Skip to content

NIFI-16011: Reduce number of FlowFiles used in LoadBalanceIT from 100…#11325

Merged
kevdoran merged 2 commits into
apache:mainfrom
markap14:NIFI-16011
Jun 12, 2026
Merged

NIFI-16011: Reduce number of FlowFiles used in LoadBalanceIT from 100…#11325
kevdoran merged 2 commits into
apache:mainfrom
markap14:NIFI-16011

Conversation

@markap14

Copy link
Copy Markdown
Contributor

… to 20 in order to avoid the excessive number of requests to the cluster in order to iterate over each FlowFile in the queue

Summary

NIFI-00000

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000
  • Pull request contains commits signed with a registered key indicating Verified status

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using ./mvnw clean install -P contrib-check
    • JDK 21
    • JDK 25

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

@markap14 markap14 marked this pull request as ready for review June 10, 2026 19:10
@markap14

Copy link
Copy Markdown
Contributor Author

[claude-opus-4.7] System-tests run 27299783298 finished with three failed shards. All three are pre-existing flakes unrelated to the change in this PR; main has been failing with the same symptoms (8 of the last 10 main system-tests runs failed).

ubuntu-24.04 Java 21 (job)

  • 1 failure: ClusteredStatelessFlowIT.testChangeStatelessFlowWhileNotDisconnected (in teardownemptyQueue)
  • Cause: HTTP 409 from cluster replication, rooted in:
Caused by: java.io.IOException: RST_STREAM received Stream cancelled
    at java.net.http/jdk.internal.net.http.HttpClientImpl.send(HttpClientImpl.java:938)
    at java.net.http/jdk.internal.net.http.HttpClientFacade.send(HttpClientFacade.java:133)
    at org.apache.nifi.web.client.StandardWebClientService$StandardHttpRequestBodySpec.getResponse(StandardWebClientService.java:381)

ubuntu-24.04 Java 25 (job)

  • 3 failures, all with the same RST_STREAM received Stream cancelled root cause: LoadBalanceIT.testPartitionByAttribute, ClusteredRegistryClientIT.testChangeVersionOnParentThatCascadesToChild (teardown), OffloadContentClaimTruncationIT.testOffloadedFlowFileContentNotPrematurelyTruncated. The 100 → 20 batch-size reduction in this PR lowers the rate but doesn't fully eliminate the underlying HTTP/2 issue.

macos-15 Java 21 (job)

  • 1 failure: FlowSynchronizationIT.testReconnectionWithUpdatedConnection. Different symptom (Timed out waiting for queue to empty in teardown), unrelated to LoadBalanceIT or this PR. The only recent change to that test file is NIFI-15844 ("Add logging to help troubleshoot flaky system tests").

Have triggered gh run rerun 27299783298 --failed to re-run only the failed shards.

@markap14

Copy link
Copy Markdown
Contributor Author

[claude-opus-4.7] Rerun attempt 2 of the system-tests run also failed on ubuntu-24.04 Java 21 and ubuntu-24.04 Java 25 (macos-15 Java 21 is still in progress). Different tests fail on each attempt, which is the classic flake signature, but LoadBalanceIT.testPartitionByAttribute has now failed in both attempts on Java 25.

ubuntu-24.04 Java 21 attempt 2 (job)

  • ClusteredStatelessFlowIT.testUpdateParameterReferencedByStatelessFlowlistQueue HTTP 409 / 500 (same RST_STREAM family)
  • OffloadContentClaimTruncationIT.testOffloadedFlowFileContentNotPrematurelyTruncated — same symptom

ubuntu-24.04 Java 25 attempt 2 (job)

  • LoadBalanceIT.testPartitionByAttribute — this time the 409 is on DELETE /listing-requests/{id}, with message Node localhost:5671 is currently connecting. The test runs immediately after LoadBalanceIT.testRoundRobinWithRestartAndPortChange, which restarts a node. Because NiFiInstanceCache reuses the cluster between tests in the same class, when this test runs after the restart test the cluster may still be reconnecting. That is a separate issue from the RST_STREAM flake.
  • LoadBalanceIT.testRoundRobinWithRestartAndPortChangeemptyQueue teardown timed out after 132s, downstream of the same node restart.
  • ClusteredReplayProvenanceIT.testReplayLastEvent[1] PRIMARYemptyQueue 409 / 500.

The non-system-test jobs (Windows FR, Scan, Ubuntu integration-tests, CodeQL, Corretto EN, macOS JP, macos-15 Java 25 attempt 2) are all green.

The reduction from 100 → 20 FlowFiles in testPartitionByAttribute materially lowers the rate of the RST_STREAM failure but is clearly not sufficient on the GitHub Actions ubuntu-24.04 runners. Recommending we hold off on additional reruns and decide on a direction. Three options I see:

  1. Accept that this PR is a partial mitigation and merge anyway (still strictly better than main, where 8/10 recent system-tests runs have failed with the same family of errors).
  2. Push a follow-up change that further reduces load in testPartitionByAttribute (smaller batch, fewer distinct attribute values) and/or addresses the testRoundRobinWithRestartAndPortChangetestPartitionByAttribute ordering by waiting for the cluster to be fully reconnected before testPartitionByAttribute proceeds.
  3. Pursue a real fix at the framework layer for the RST_STREAM on cluster replication (the original goal earlier in this investigation), separate from this PR.

@markap14 please advise — I will pause aggressive polling and switch to once-per-hour until you weigh in.

@markap14 markap14 marked this pull request as draft June 11, 2026 13:15
@markap14

Copy link
Copy Markdown
Contributor Author

Experimental commit: temporarily revert Jetty 12.1.10 → 12.1.9

Pushed 451d60b to test the hypothesis that the recent system-tests flakes are a server-side HTTP/2 regression introduced by the Jetty 12.1.9 → 12.1.10 bump in NIFI-15993 (2026-06-03), not by anything in this PR.

Why I think Jetty 12.1.10 is the trigger

  • The failures all surface as the JDK java.net.http.HttpClient receiving an HTTP/2 RST_STREAM with code CANCEL from the in-JVM Jetty server during the request body upload of a replicated cluster request:
    java.io.IOException: RST_STREAM received Stream cancelled
        at java.net.http/jdk.internal.net.http.Stream.incompleteRequestBodyReset(Stream.java:730)
        at java.net.http/jdk.internal.net.http.Stream.incoming_reset(Stream.java:712)
    
  • I counted status codes in nifi-request.log for the failing test: zero HTTP 421 responses on either node, which rules out ProxyHeaderValidatorCustomizer / HostPortValidatorCustomizer as the source of the reset.
  • Correlation with main-branch system-tests workflow history:
    • 2026-06-03 13:05 UTC: main run SUCCESS (last green run).
    • 2026-06-03 21:48 UTC: f5b9c13 NIFI-15993 bumps Jetty 12.1.9 → 12.1.10 (and several other unrelated deps).
    • 2026-06-09 02:21 UTC: main run FAILURE — first system-tests run on main after the Jetty bump.
    • Every system-tests workflow run on main since then has failed.
  • Jetty 12.1.10's notable HTTP/2 changes per the release notes include #15009 "Make processing of RST_STREAM more lenient" and #15161 "Reduce memory footprint for persistent HttpConnections", both of which touch HTTP/2 stream/connection lifecycle.

What this commit is and isn't

This is not intended to be merged as-is. The PR remains a Draft. If this commit's CI run shows the flakes disappear, we will:

  1. File an upstream Jetty bug with a minimal repro.
  2. Decide whether to pin Jetty to 12.1.9 in main until it's fixed, or wait for a 12.1.11 with a targeted fix.

If the flakes persist on 12.1.9, the hypothesis is wrong and we keep digging.

The existing LoadBalanceIT batch-size reduction in this PR stays in place as defense in depth either way.

@markap14

Copy link
Copy Markdown
Contributor Author

Hypothesis confirmed: Jetty 12.1.10 introduced the HTTP/2 RST_STREAM regression

Workflow run 27353150066 with commit 451d60b (Jetty pinned back to 12.1.9):

Job Result Notes
ubuntu-24.04 Java 21 PASS All system tests green.
macos-15 Java 21 PASS All system tests green.
ubuntu-24.04 Java 25 FAIL One test failed, but not the RST_STREAM pattern — see below.
macos-15 Java 25 FAIL Two tests failed, neither is RST_STREAM — see below.

Primary finding

rg -l "RST_STREAM" /tmp/nifi-pr11325-jetty129-logs/ returns zero hits across both Java 25 troubleshooting archives, and the previously affected tests (LoadBalanceIT, ClusteredStatelessFlowIT, ClusteredRegistryClientIT, OffloadContentClaimTruncationIT, FlowSynchronizationIT) all passed on every OS/JDK combination. The RST_STREAM Stream cancelled / incompleteRequestBodyReset failure mode that was reproducing on every recent main run is gone on Jetty 12.1.9.

Remaining Java 25 failures (not Jetty)

These look like pre-existing flakes in different code paths:

  • ClusteredConnectorDrainIT.testDrainWithNodeCompletingAtDifferentTimes (ubuntu-24.04 Java 25): the test's @BeforeEach waitForAllNodesConnected timed out after 60s with node-2 still DISCONNECTED. Cluster join, not replication.
  • OffloadIT.testOffload (macos-15 Java 25): TimeoutException: testOffload() timed out after 10 minutes. Test hang.
  • ClusteredReplayProvenanceIT[2].testReplayLastEvent (macos-15 Java 25): AssertionFailedError: expected: <2> but was: <1>. Looks like a real test/assertion issue.

None of these involve HTTP/2 RST_STREAM or cluster replication failures.

Proposed next step

Suggest we proceed in this order:

  1. Pin <jetty.version> to 12.1.9 in main (separate PR) as the immediate fix for the RST_STREAM regression, with a TODO/comment referencing the upstream Jetty bug.
  2. File an upstream Jetty bug with a minimal repro (HTTP/2 client+server in same JVM over loopback, mTLS, many short POSTs ⇒ RST_STREAM(CANCEL) during request body upload). I can put that together.
  3. Treat the three remaining Java 25 flakes as separate Jiras and triage them independently — they were almost certainly there before, just masked by the all-red RST_STREAM noise.
  4. Once 12.1.9 is pinned in main, this PR's LoadBalanceIT batch-size reduction stands on its own as a defense-in-depth load reduction and can be reviewed/merged on its merits.

Waiting on direction before proceeding.

@markap14 markap14 force-pushed the NIFI-16011 branch 3 times, most recently from 43fe62e to 7539793 Compare June 12, 2026 16:31
…reduce LoadBalanceIT request volume

Introduce a new nifi.properties setting,
nifi.cluster.node.protocol.http.version, that configures the HTTP
version that the cluster node web client prefers when replicating
requests to other nodes. Accepts HTTP_2 (the default) or HTTP_1_1;
invalid values log a warning and fall back to HTTP_2.

The setting is wired into FlowControllerConfiguration.webClientService()
so the cluster replication HttpClient honors the configured version.
The shipped conf/nifi.properties template defaults to HTTP_2 so
production traffic is unchanged. The system test resource templates
under nifi-system-test-suite/src/test/resources/conf/* set the value to
HTTP_1_1 to work around intermittent "RST_STREAM received Stream
cancelled" failures the JDK java.net.http.HttpClient produces when it
talks to Jetty 12.1.10 (jetty PR #15087 / issue #15009) under the heavy
disconnect / offload / restart patterns the system tests exercise. The
replicator cannot retry replicated POSTs, so when Jetty sends RST_STREAM
mid-stream the in-flight request is lost.

Also reduces the number of FlowFiles used in LoadBalanceIT from 100 to
20 so each test iterates over the queue with far fewer API calls,
reducing replication pressure and shortening the test.

Includes administration-guide documentation for the new property and
unit-test coverage in NiFiPropertiesTest for the default, override, and
blank-fallback cases.
@markap14 markap14 marked this pull request as ready for review June 12, 2026 16:34
@markap14

Copy link
Copy Markdown
Contributor Author

[claude-opus-4.7] Pre-existing flake on Ubuntu Java 21: PutCouchbaseIT.testPutDocument timed out opening a Couchbase Testcontainers bucket.

  • Failing job: Ubuntu Java 21
  • Module: nifi-extension-bundles/nifi-couchbase-bundle/nifi-couchbase-processors — fully unrelated to anything touched by this PR (this PR only modifies nifi-properties, nifi-framework-core, nifi-resources, nifi-docs, and nifi-system-test-suite).
  • Error excerpt:
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 46.81 s <<< FAILURE! -- in org.apache.nifi.processors.couchbase.integration.PutCouchbaseIT
[ERROR] PutCouchbaseIT.testPutDocument -- Time elapsed: 4.373 s <<< ERROR!
com.couchbase.client.core.error.UnambiguousTimeoutException: GetRequest, Reason: TIMEOUT {"reason":"TIMEOUT","requestId":7,"requestType":"GetRequest","retried":2,"retryReasons":["BUCKET_OPEN_IN_PROGRESS"],"service":{"bucket":"test_bucket",...,"vbucket":837},"timeoutMs":2500,"timings":{"totalMicros":2545096}}

Confirmed pre-existing on main: the same PutCouchbaseIT.testPutDocument failed in the integration-tests run 27372057801 for the NIFI-15880 merge commit (2026-06-11), which is upstream of this PR.

Reran just the Ubuntu Java 21 shard of run 27429035459 rather than pushing an empty commit.

@kevdoran kevdoran left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @markap14! It will be great to have more reliable system tests!.

The rationale for keeping HTTP_2 as the production default and using HTTP_1_1 only for system tests makes sense.

A few non-blocking suggestions:

1. Missing @return on the new getter. getClusterNodeProtocolHttpVersion() has a full Javadoc block but no @return tag. It won't fail checkstyle (allowMissingReturnTag=true), but it's inconsistent with the coding standard now that the method is documented:

/**
 * ...
 * @return the configured cluster node protocol HTTP version, or the default when the property is missing or blank
 */
public String getClusterNodeProtocolHttpVersion() {

2. Property scope is broader than the name suggests. The webClientService() bean this configures isn't only used for request replication — it's also injected into NarRestApiClient, AssetsRestApiClient, NiFiRestApiClient, and StandardUploadRequestReplicator. So nifi.cluster.node.protocol.http.version effectively governs the HTTP version for all framework intra-cluster HTTP clients. They're all node-to-node operations so the name is defensible, but it would be worth noting in the admin-guide entry that it applies to all framework cluster HTTP communication, not just request replication.

3. Consider being case-insensitive. HttpClient.Version.valueOf(configured) requires an exact match, so http_2 / http_1_1 would silently fall back to the default with a warning. Upper-casing the trimmed value first would avoid that surprise:

return HttpClient.Version.valueOf(configured.toUpperCase(Locale.ROOT));

4. Track the stopgap. The Jira notes this isn't intended as a permanent solution to run system tests on HTTP/1.1, but nothing in the code points back to that intent. Could we add a short comment next to the nifi.cluster.node.protocol.http.version=HTTP_1_1 lines in the system-test configs (or reference a follow-up Jira for the upstream Jetty 12.1.10 RST_STREAM regression) so the override isn't forgotten once Jetty is fixed?

…hort-term workaround and drop admin-guide entry

- Add a short comment above the nifi.cluster.node.protocol.http.version=HTTP_1_1 line in each
  nifi-system-test-suite conf template explaining that the override is a temporary workaround for the
  Jetty 12.1.10 HTTP/2 RST_STREAM regression and should be removed once Jetty is fixed.
- Remove the nifi.cluster.node.protocol.http.version entry from the administration guide. The property
  is not intended to be a long-term supported setting and should not be documented as such.
- Add the missing @return tag to NiFiProperties#getClusterNodeProtocolHttpVersion to complete its
  Javadoc.
@markap14

Copy link
Copy Markdown
Contributor Author

[claude-opus-4.7] Thanks for the review, @kevdoran! Addressed in 3d0350401bf:

#1 — missing @return: added the @return tag on NiFiProperties#getClusterNodeProtocolHttpVersion.

#2 — admin-guide scope note / #4 — track the stopgap: taking a slightly different tack on these. Since this property is not intended as a long-term supported configuration knob (the goal is for it to disappear once the upstream Jetty 12.1.10 RST_STREAM regression is resolved), I removed the entry from the administration guide entirely rather than document it. To keep the trail visible to future maintainers, I added a comment block above the nifi.cluster.node.protocol.http.version=HTTP_1_1 line in each system-test conf/*/nifi.properties template explaining that the override is a short-term workaround for the Jetty 12.1.10 HTTP/2 stream-reset rewrite (jetty/jetty.project#15087) and should be removed once Jetty is fixed.

#3 — case-insensitive parsing: leaving the strict HttpClient.Version.valueOf(...) for now. Since this isn't a setting we want operators to be relying on, I think surfacing typos via the existing warn-and-fall-back-to-HTTP_2 path is the better behavior; happy to revisit if you feel strongly.

@kevdoran kevdoran merged commit d1c7bd2 into apache:main Jun 12, 2026
10 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants