Skip to content

CAMEL-21438: Fix six flaky tests and missing aarch64 image for TensorFlow Serving#23927

Merged
davsclaus merged 6 commits into
apache:mainfrom
ammachado:CAMEL-21438
Jun 12, 2026
Merged

CAMEL-21438: Fix six flaky tests and missing aarch64 image for TensorFlow Serving#23927
davsclaus merged 6 commits into
apache:mainfrom
ammachado:CAMEL-21438

Conversation

@ammachado

@ammachado ammachado commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Description

Fix six test reliability issues:

1. DefaultConsumerBridgeErrorHandlerContinuedTest (flaky assertion)
The test asserted mock:result would receive zero messages when continued(true) is set on the error handler. In practice, continued(true) causes the exchange to continue through the main route after error handling, so mock:result does receive at least one message. Changed the expectation to expectedMinimumMessageCount(1).

2. SimpleLRUCacheTest (race condition after concurrent puts)
After the concurrent-put latch completed, the test immediately asserted cache size and eviction counter. The eviction listener callback can fire asynchronously, causing intermittent failures. Wrapped the post-latch assertions in Awaitility.await().atMost(10, TimeUnit.SECONDS).untilAsserted(...) so the assertions wait for the eviction to settle.

3. TensorFlow Serving test-infra: missing aarch64 image
The bitnami/tensorflow-serving:2.18.0 image on Docker Hub is not available for aarch64. Replaced it with mirror.gcr.io/bitnamilegacy/tensorflow-serving:2.19.1, which is available on the Google mirror registry for that architecture, consistent with the project's preference for mirror.gcr.io over Docker Hub.

4. FileSortByExpressionTest, FileSortByNestedExpressionTest, FileSortByIgnoreCaseExpressionTest (mock expectation race)
MockEndpoint.expectedBodiesReceived() must be called before routes start delivering messages. With initialDelay=0, the file consumer fires on a background thread as soon as addRoutes() returns. If messages arrive before expectedBodiesReceived() is called, expectedBodyValues is still null so performAssertions() never populates actualBodyValues, causing body order assertions to compare each expected value against null.

Fix: move getMockEndpoint() and expectedBodiesReceived() before context.addRoutes() in all three test classes. Also removed no-op context.start() calls that widened the race window. The @DisabledOnOs(ppc64le) guard on FileSortByIgnoreCaseExpressionTest is retained until ppc64le CI can confirm the fix is sufficient on that platform.

5. FileProducerCharsetUTFtoISOConvertBodyToTest (0-byte file race + cleanup masking)
Two issues remained after two prior fixes (CAMEL-18608):

  • @BeforeEach writeTestData() runs after the superclass setUp() starts the context and route. With initialDelay=0 the consumer polls every 10 ms. A poll landing between Files.newOutputStream() (creates a 0-byte file) and fos.close() picks up an empty file, producing 0-byte output and causing the size assertion to never be satisfied.
    Fix: write to a .tmp file first then atomically rename via Files.move(..., ATOMIC_MOVE). The consumer's fileName filter ignores the .tmp file and only ever sees the fully-written target.
  • @AfterEach used Files.delete() instead of Files.deleteIfExists(): if the test fails before the output file is created, cleanup also throws, masking the original failure with a misleading "cannot run due to an error cleaning up" message.
    Fix: switch to Files.deleteIfExists().

6. QueueProducerQoSTest (ANYCAST/MULTICAST mismatch + zero-timeout mock assertion)
Two issues caused non-deterministic failures:

  • The expiry route consumed from sjms:topic:ExpiryQueue but Artemis routes expired messages to an ANYCAST address. A JMS Topic consumer creates a MULTICAST subscription, which does not receive ANYCAST messages. Whether the test passed depended on which subscriber created the ExpiryQueue address first. Fixed by switching to sjms:queue:ExpiryQueue.
  • Both test methods called the static MockEndpoint.assertIsSatisfied(context) which delegates to each mock's own resultWaitTime (defaults to 0). With requestTimeout=500ms and timeToLive=1000ms, the expiry notification arrives after the assertion was already evaluated. Fixed by calling mockExpiredAdvisory.assertIsSatisfied(10_000) with an explicit 10-second timeout.

Target

  • I checked that the commit is targeting the correct branch (Camel 4 uses the main branch)

Tracking

  • If this is a large change, bug fix, or code improvement, I checked there is a JIRA issue filed for the change (usually before you start working on it).

Apache Camel coding standards and style

  • I checked that each commit in the pull request has a meaningful subject line and body.
  • I have run mvn clean install -DskipTests locally from root folder and I have committed all auto-generated changes.

Claude Code on behalf of Adriano Machado

ammachado and others added 3 commits June 10, 2026 23:46
…rrent puts

callEvictionIfNeeded() runs after OperationContext releases its read lock,
so the cache can still be oversized when the CountDownLatch fires. Replace
immediate post-latch assertions with Awaitility.await().untilAsserted() to
give in-flight evictions time to complete before checking size and counter.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
With continued(true) and bridgeErrorHandler=true, the bridge creates
a synthetic exchange, the onException route executes, and the exchange
then continues through the main route body (including mock:result).
The test incorrectly expected mock:result to receive 0 messages.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@ammachado ammachado marked this pull request as ready for review June 11, 2026 03:49
@github-actions

Copy link
Copy Markdown
Contributor

🌟 Thank you for your contribution to the Apache Camel project! 🌟
🤖 CI automation will test this PR automatically.

🐫 Apache Camel Committers, please review the following items:

  • First-time contributors require MANUAL approval for the GitHub Actions to run
  • You can use the command /component-test (camel-)component-name1 (camel-)component-name2.. to request a test from the test bot although they are normally detected and executed by CI.
  • You can label PRs using skip-tests and test-dependents to fine-tune the checks executed by this PR.
  • Build and test logs are available in the summary page. Only Apache Camel committers have access to the summary.

⚠️ Be careful when sharing logs. Review their contents before sharing them publicly.

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

🧪 CI tested the following changed modules:

  • components/camel-sjms
  • core/camel-core
  • test-infra/camel-test-infra-tensorflow-serving

⚠️ Some tests are disabled on GitHub Actions (@DisabledIfSystemProperty(named = "ci.env.name")) and require manual verification:

  • core/camel-core: 2 test(s) disabled on GitHub Actions
All tested modules (13 modules)
  • Camel :: AI :: TensorFlow Serving
  • Camel :: JBang :: MCP
  • Camel :: JBang :: Plugin :: Route Parser
  • Camel :: JBang :: Plugin :: TUI
  • Camel :: JBang :: Plugin :: Validate
  • Camel :: Launcher :: Container
  • Camel :: MLLP
  • Camel :: Simple JMS
  • Camel :: Simple JMS2
  • Camel :: Test Infra :: All test services
  • Camel :: Test Infra :: TensorFlow Serving
  • Camel :: YAML DSL :: Validator
  • Camel :: YAML DSL :: Validator Maven Plugin

⚙️ View full build and test results

@davsclaus davsclaus left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: CAMEL-21438 — Fix flaky tests and missing aarch64 image

CI is green. The test reliability fixes are well-motivated and follow project conventions. A few notes below.

Findings

1. PR description is incomplete (Minor)

The description mentions 3 changes, but the diff modifies 7 files. Four additional fixes are not covered:

  • FileProducerCharsetUTFtoISOConvertBodyToTest — atomic file write + deleteIfExists cleanup
  • FileSortByExpressionTest — move expectations before routes, remove hardcoded 1s timeout
  • FileSortByIgnoreCaseExpressionTest — remove redundant context.start(), fix initialDelay/delay, typo fix
  • FileSortByNestedExpressionTest — remove redundant context.start(), move expectations

Please update the PR description to cover all changes.

2. DefaultConsumerBridgeErrorHandlerContinuedTest — assertion reversal

This reverses the intent of the original CAMEL-22907 test (fe9abb35c3b7), which explicitly asserted mock:result should receive 0 messages. The change to expectedMinimumMessageCount(1) is consistent with how continued(true) should work — after onException processing, the exchange continues through the route to mock:result. The original expectedMessageCount(0) was a timing-dependent assertion that passed only when the mock check completed before the continued exchange arrived.

This looks correct to me, but since it reverses an intentional assertion from CAMEL-22907, it would be good to get confirmation from @gnodet that this interpretation is right.

3. bitnamilegacy image

The mirror.gcr.io/bitnamilegacy/tensorflow-serving:2.19.1 image uses a "legacy" namespace. Worth checking that it has ongoing maintenance for long-term reliability.

Verified (no issues)

  • SimpleLRUCacheTest — Awaitility usage is correct; eviction callbacks fire asynchronously. Follows the project's "no Thread.sleep()" convention.
  • File sort tests — Moving mock expectations before context.addRoutes() and removing redundant context.start() (context is auto-started by ContextTestSupport.setUp()) is the right pattern.
  • FileProducerCharsetUTFtoISOConvertBodyToTest — Atomic rename prevents file consumer from seeing a half-written file. Good defensive fix.
  • TensorFlow aarch64 image — Switching from Docker Hub to mirror.gcr.io follows project conventions.

Note: This review covers project conventions, test patterns, and behavioral correctness. It does not replace specialized tools such as CodeRabbit or SonarCloud.

This review was generated by an AI agent and may contain inaccuracies. Please verify all suggestions before applying.

Claude Code on behalf of Claus Ibsen

getMockEndpoint("mock:result").expectedMessageCount(0);
// With continued(true), the exchange continues through the main route after error handling,
// so mock:result should receive messages.
getMockEndpoint("mock:result").expectedMinimumMessageCount(1);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reverses the original CAMEL-22907 assertion which expected 0 messages here, with the rationale that "the consumer throws before creating a valid exchange". With continued(true), the bridge error handler creates an exchange, onException processes it, and continued(true) allows it to proceed through the route to mock:result. The original expectedMessageCount(0) was a race condition — it passed only when the mock assertion completed before the continued exchange arrived.

The new assertion looks correct, but since it contradicts the original test's explicit intent, confirmation from the CAMEL-22907 author would be valuable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnodet, you added this test in CAMEL-22907 with expectedMessageCount(0) and the comment "since the consumer throws before creating a valid exchange, mock:result won't receive messages".

We found it flaky: after the CAMEL-22907 fix, continued(true) causes the exchange to continue through the main route after onException handling, so mock:result does receive messages. The assertion was changed to expectedMinimumMessageCount(1).

Could you confirm whether this matches the intended post-fix behavior, or whether mock:result was expected to remain at zero for a reason we may have missed?

Claude Code on behalf of Adriano Machado

## ---------------------------------------------------------------------------
tensorflow.serving.container=mirror.gcr.io/tensorflow/serving:2.19.1
tensorflow.serving.container.aarch64=bitnami/tensorflow-serving:2.18.0
tensorflow.serving.container.aarch64=mirror.gcr.io/bitnamilegacy/tensorflow-serving:2.19.1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: the bitnamilegacy namespace suggests this image may be deprecated upstream. Worth verifying it has ongoing maintenance so this doesn't need another fix soon.

ammachado and others added 2 commits June 11, 2026 10:25
… addRoutes

MockEndpoint.expectedBodiesReceived() must be called before routes start
delivering messages. With initialDelay=0 the file consumer fires on a
background thread as soon as addRoutes() returns; if messages arrive before
expectedBodiesReceived() is called, expectedBodyValues is still null so
performAssertions() never populates actualBodyValues, causing the order
assertion to compare each expected body against null.

Fix: move getMockEndpoint() + expectedBodiesReceived() before addRoutes()
in all three FileSortBy test classes. Also remove the no-op context.start() calls that widened the race window.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Two remaining issues after two prior fixes:

1. Race between @beforeeach and the file consumer: writeTestData() runs
   after setUp() has already started the context and route. With
   initialDelay=0 the consumer polls every 10ms; a poll landing between
   Files.newOutputStream() (creates 0-byte file) and fos.close() picks up
   an empty file and produces 0-byte output, causing the size > 0
   assertion to never be satisfied.
   Fix: write to a .tmp file first, then atomically rename (POSIX rename
   syscall). The consumer's fileName filter ignores the .tmp file and only
   ever sees the fully-written target.

2. @AfterEach used Files.delete() instead of Files.deleteIfExists(): if
   the test fails before the output file is created, cleanup also throws,
   masking the original failure with "cannot run due to an error cleaning
   up". Fix: switch to Files.deleteIfExists().

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@ammachado ammachado changed the title CAMEL-21438: Fix flaky tests and missing aarch64 image for TensorFlow Serving CAMEL-21438: Fix five flaky tests and missing aarch64 image for TensorFlow Serving Jun 11, 2026
@ammachado

Copy link
Copy Markdown
Contributor Author

The mirror.gcr.io/bitnamilegacy/tensorflow-serving:2.19.1 image uses a "legacy" namespace. Worth checking that it has ongoing maintenance for long-term reliability.

Thanks for flagging this. The bitnamilegacy namespace is indeed deprecated: Bitnami restructured their container catalog in August 2025 and moved all versioned/pinned tags to bitnamilegacy with no further updates. The concern is valid.

However, there is currently no well-maintained aarch64 image for TensorFlow Serving from any trusted registry:

  • The official tensorflow/serving image is amd64-only (long-standing open issue, no arm64 support planned).
  • emacski/tensorflow-serving, the main community aarch64 build, is last published at 2.6.0 and is not actively maintained.
  • AWS Deep Learning Containers publish a 2.19 aarch64 image, but it is SageMaker-specific and not suitable here.

bitnamilegacy/tensorflow-serving:2.19.1 is frozen but functional. It is the only option that matches our required version on aarch64. We can revisit this when either the official TF Serving project adds aarch64 support or a community project catches up to 2.19+.

The remaining alternative is to apply the skipITs.aarch64 fix: add <skipITs.aarch64>true</skipITs.aarch64> to camel-tensorflow-serving/pom.xml and remove the bitnamilegacy line from container.properties. Please advice.

Claude Code on behalf of Adriano Machado

…cit mock timeout

Two issues caused non-deterministic failures:

1. The expiry route consumed from `sjms:topic:ExpiryQueue` but Artemis routes
   expired messages to an ANYCAST address. A JMS Topic consumer creates a
   MULTICAST subscription, which does not receive ANYCAST messages. Whether
   the test passed depended on which subscriber created the ExpiryQueue address
   first. Fixed by switching to `sjms:queue:ExpiryQueue`.

2. Both test methods called the static `MockEndpoint.assertIsSatisfied(context)`
   which delegates to each mock's own `resultWaitTime` (defaults to 0). With
   `requestTimeout=500ms` and `timeToLive=1000ms`, the expiry notification
   arrives after the assertion was already evaluated. Fixed by calling
   `mockExpiredAdvisory.assertIsSatisfied(10_000)` with an explicit 10-second
   timeout on the instance directly.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@ammachado ammachado changed the title CAMEL-21438: Fix five flaky tests and missing aarch64 image for TensorFlow Serving CAMEL-21438: Fix six flaky tests and missing aarch64 image for TensorFlow Serving Jun 12, 2026
@davsclaus

Copy link
Copy Markdown
Contributor

The mirror.gcr.io/bitnamilegacy/tensorflow-serving:2.19.1 image uses a "legacy" namespace. Worth checking that it has ongoing maintenance for long-term reliability.

Thanks for flagging this. The bitnamilegacy namespace is indeed deprecated: Bitnami restructured their container catalog in August 2025 and moved all versioned/pinned tags to bitnamilegacy with no further updates. The concern is valid.

However, there is currently no well-maintained aarch64 image for TensorFlow Serving from any trusted registry:

  • The official tensorflow/serving image is amd64-only (long-standing open issue, no arm64 support planned).
  • emacski/tensorflow-serving, the main community aarch64 build, is last published at 2.6.0 and is not actively maintained.
  • AWS Deep Learning Containers publish a 2.19 aarch64 image, but it is SageMaker-specific and not suitable here.

bitnamilegacy/tensorflow-serving:2.19.1 is frozen but functional. It is the only option that matches our required version on aarch64. We can revisit this when either the official TF Serving project adds aarch64 support or a community project catches up to 2.19+.

The remaining alternative is to apply the skipITs.aarch64 fix: add <skipITs.aarch64>true</skipITs.aarch64> to camel-tensorflow-serving/pom.xml and remove the bitnamilegacy line from container.properties. Please advice.

Claude Code on behalf of Adriano Machado

Yeah though aarch64 is for mac developers to run on the laptop and not for production users. So it may be okay to allow them to test and validate tensorflow. Can we add a comment that about so we will not forget.

@davsclaus davsclaus merged commit 859fcf1 into apache:main Jun 12, 2026
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants