Skip to content

Fix/issue 212 upgrade checkpoint#6302

Merged
venkatamudili-fel merged 1 commit into
mainfrom
fix/issue-212-upgrade-checkpoint
Jun 15, 2026
Merged

Fix/issue 212 upgrade checkpoint#6302
venkatamudili-fel merged 1 commit into
mainfrom
fix/issue-212-upgrade-checkpoint

Conversation

@venkatamudili-fel

@venkatamudili-fel venkatamudili-fel commented May 22, 2026

Copy link
Copy Markdown

Describe Manual Test Plan

added a runtime integration test for feldera-qa#212.

locally, i verified that the test file is picked up by pytest. full execution needs the enterprise runtime ci/staging environment because the test uses enterprise checkpoint support and runtime version routing.

Checklist

  • Unit tests added/updated
  • Integration tests added/updated
  • Documentation updated
  • Changelog updated

Breaking Changes?

Mark if you think the answer is yes for any of these components:

Describe Incompatible Changes

none.

@venkatamudili-fel venkatamudili-fel force-pushed the fix/issue-212-upgrade-checkpoint branch from b897c2e to f9965e7 Compare May 22, 2026 20:56
@gz

gz commented May 22, 2026

Copy link
Copy Markdown
Contributor

in CI we already have a pretty comprehensive test for this here https://github.com/feldera/feldera/blob/main/python/tests/runtime/test_checkpoint_runtime_upgrade.py#L511

is there anything this new test checks what isn't tested in test_checkpoint_runtime_upgrade?

but overall the right direction, lets try to make sure we can exercise this path for qa workloads before a release

@venkatamudili-fel

venkatamudili-fel commented May 22, 2026

Copy link
Copy Markdown
Author

in CI we already have a pretty comprehensive test for this here https://github.com/feldera/feldera/blob/main/python/tests/runtime/test_checkpoint_runtime_upgrade.py#L511

is there anything this new test checks what isn't tested in test_checkpoint_runtime_upgrade?

but overall the right direction, lets try to make sure we can exercise this path for qa workloads before a release

Good point. I checked this against the existing test, and it covers a different recovery path.

The existing test clears local storage before phase 2, so it validates the S3/MinIO restore path.

This new test deliberately keeps local storage, so it validates local checkpoint detection and resume after a runtime upgrade.

It also adds ARM64 coverage since the existing test is skipped there due to the Delta Lake dependency. I updated the docstring to make this clearer.

@venkatamudili-fel venkatamudili-fel force-pushed the fix/issue-212-upgrade-checkpoint branch from f9965e7 to 5c67c72 Compare May 22, 2026 21:17
@gz

gz commented May 22, 2026

Copy link
Copy Markdown
Contributor

The existing test clears local storage before phase 2, so it validates the S3/MinIO restore path.

makes sense! can we use the same sql from the s3-sync checkpoint test? it is carefully crafted to leverage every operator that has a custom checkpoint implementation to maximize coverage of checkpoint testing

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, additive test-only PR — fills a real gap: test_runtime_upgrade_round_trip skips @skip_on_arm64 because of the deltalake wheel and forces MinIO-pull recovery via clear_storage(). This new test has no native deps and exercises the local-disk recovery path. Good rationale, well-documented in the docstring.

A few soft notes, none blocking:

  1. Manual test plan is thin. The PR description says "i verified that the test file is picked up by pytest. full execution needs the enterprise runtime ci/staging environment." That means the test has not actually been run end-to-end against an enterprise runtime locally — only collected. Please run it through staging CI once before merge and confirm phase 2 actually resumes from the on-disk checkpoint, otherwise we may merge a green-on-collection / red-in-reality test.

  2. Recovery isn't validated explicitly. Phase 2 asserts result_after == [{"c": 5}], which proves the rows survived. But if the manager ever changes default-resume behavior so a missing start_from_checkpoint="latest" skips the local checkpoint and replays from empty input, the assertion would still hold by accident on this dataset (no Delta input is wired in phase 2 — there's no replay source either, so 0 is the only alternative, granted). Worth a comment in the test saying "an absent local checkpoint would produce c == 0, which is why the assertion is sufficient" so the intent is clear to future readers.

  3. No wait_for_idle() before the phase 2 query. The companion test uses wait_for_token(...) after start to ensure ingest is drained. Phase 2 here has no input, so technically nothing to drain, but a current_pipeline.wait_for_idle() between start() and query() is a cheap belt-and-braces against a future change where bootstrap completion lags start().

  4. legacy_builder.runtime_version = legacy_version after constructor is a code smell copied verbatim from the round-trip test (the constructor lets FELDERA_RUNTIME_VERSION win). Not introduced here, but it would be nice to file a tracking issue against PipelineBuilder so the override is explicit rather than mutate-after-init. (Same point I flagged on #6185.)

LGTM modulo (1) — please confirm a staging run passes.

@venkatamudili-fel venkatamudili-fel force-pushed the fix/issue-212-upgrade-checkpoint branch from 5c67c72 to 7e65792 Compare May 24, 2026 21:25
@venkatamudili-fel

Copy link
Copy Markdown
Author

Thanks, addressed the soft notes.
I updated the test to use the same checkpoint-heavy operator families as the existing upgrade test where possible, while still avoiding Delta/MinIO so it can run on ARM64.
I also added wait_for_idle() before the phase 2 query, made the local checkpoint recovery expectation explicit in the comment, and clarified why the runtime_version override is still needed here.

@venkatamudili-fel venkatamudili-fel force-pushed the fix/issue-212-upgrade-checkpoint branch from 7e65792 to 260a5a1 Compare May 24, 2026 21:30

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approve on 260a5a1. The SQL now matches _build_sql's operator families (DISTINCT, GROUP BY, binary/three-way JOIN, recursive closure via pair_seed), current_pipeline.wait_for_idle() is in place before the phase-2 query, and the docstring spells out exactly why local-checkpoint recovery is the path under test. Coverage is solid for an ARM64-runnable upgrade test that doesn't pull in deltalake.

The legacy_builder.runtime_version = legacy_version stomp after the constructor is now justified inline (FELDERA_RUNTIME_VERSION env wins in PipelineBuilder.__init__); fine to leave as-is in this PR, but the underlying PipelineBuilder API smell is still worth a tracking issue.

My only remaining ask is the original soft note #1: please confirm a green staging-CI run end-to-end before merge — "pytest picks it up" is not the same as "phase 2 actually restored from the on-disk checkpoint." If staging passes, this is ready to go.

@gz

gz commented May 27, 2026

Copy link
Copy Markdown
Contributor

while still avoiding Delta/MinIO so it can run on ARM64

fyi this is fixed on main you can now drop the restriction that only runs on amd64

@gz

gz commented May 27, 2026

Copy link
Copy Markdown
Contributor

(when dropped please queue it)

@venkatamudili-fel venkatamudili-fel force-pushed the fix/issue-212-upgrade-checkpoint branch from 260a5a1 to 01bf599 Compare May 27, 2026 17:05
@venkatamudili-fel

Copy link
Copy Markdown
Author

@gz done, I removed the amd64-only restriction and pushed the update.

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-APPROVE on 01bf5997 (Add runtime upgrade checkpoint validation test).

New test_upgrade_from_local_checkpoint complements the existing test_runtime_upgrade_round_trip by deliberately skipping clear_storage() between phases — forcing recovery from the manager's local-disk checkpoint rather than the MinIO/S3 sync path. Solid coverage delta: same operator families (DISTINCT, GROUP BY, binary + three-way JOIN, recursive view, MV passthrough) with a Delta connector intentionally avoided so the test runs on ARM64 too. The @skip_on_arm64 decorator is also dropped from the companion test now that delta-rs#4413 is fixed upstream.

The legacy_builder.runtime_version = legacy_version stomp after construction is well-commented and the right call given PipelineBuilder.__init__ lets FELDERA_RUNTIME_VERSION win over the constructor argument in CI; future work could expose a proper override flag on the builder, but that's out of scope here.

Assertions check row counts only (SELECT COUNT(*) FROM t) — proves checkpoint detection but not per-operator state correctness. For a deeper invariant, asserting per-view aggregates (e.g. v_grouped, v_three_way, closure row counts) would catch regressions that change distinct/group/recursive state shape but leave the source table intact. Soft nit; not blocking.

No blockers.

@venkatamudili-fel venkatamudili-fel added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@venkatamudili-fel venkatamudili-fel force-pushed the fix/issue-212-upgrade-checkpoint branch 2 times, most recently from 0040aea to dd3f2fc Compare June 8, 2026 06:37

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-APPROVE. The switch from create_or_replace() to update_runtime() is a genuine fix — create_or_replace() calls clear_storage() on an existing pipeline, which would destroy the local checkpoint before the upgraded runtime can load it, defeating the test's purpose. Good catch.

@venkatamudili-fel venkatamudili-fel force-pushed the fix/issue-212-upgrade-checkpoint branch from dd3f2fc to 74fd154 Compare June 15, 2026 05:34
@venkatamudili-fel venkatamudili-fel added this pull request to the merge queue Jun 15, 2026
Merged via the queue into main with commit e7b30ba Jun 15, 2026
1 check passed
@venkatamudili-fel venkatamudili-fel deleted the fix/issue-212-upgrade-checkpoint branch June 15, 2026 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants