Fix/issue 212 upgrade checkpoint by venkatamudili-fel · Pull Request #6302 · feldera/feldera

venkatamudili-fel · 2026-05-22T20:44:10Z

Describe Manual Test Plan

added a runtime integration test for feldera-qa#212.

locally, i verified that the test file is picked up by pytest. full execution needs the enterprise runtime ci/staging environment because the test uses enterprise checkpoint support and runtime version routing.

Checklist

Unit tests added/updated
Integration tests added/updated
Documentation updated
Changelog updated

Breaking Changes?

Mark if you think the answer is yes for any of these components:

OpenAPI / REST HTTP API / feldera-types / manager (What is a breaking change?)
Feldera SQL (Syntax, Semantics)
feldera-sqllib (incl. dependencies fxp, etc.) (What is a breaking change?)
Python SDK (What is a breaking change?)
fda (CLI arguments)
Adapters (including configuration)
Storage Format / Checkpoints
Others (specify)

Describe Incompatible Changes

none.

gz · 2026-05-22T20:57:50Z

in CI we already have a pretty comprehensive test for this here https://github.com/feldera/feldera/blob/main/python/tests/runtime/test_checkpoint_runtime_upgrade.py#L511

is there anything this new test checks what isn't tested in test_checkpoint_runtime_upgrade?

but overall the right direction, lets try to make sure we can exercise this path for qa workloads before a release

venkatamudili-fel · 2026-05-22T21:09:13Z

in CI we already have a pretty comprehensive test for this here https://github.com/feldera/feldera/blob/main/python/tests/runtime/test_checkpoint_runtime_upgrade.py#L511

is there anything this new test checks what isn't tested in test_checkpoint_runtime_upgrade?

but overall the right direction, lets try to make sure we can exercise this path for qa workloads before a release

Good point. I checked this against the existing test, and it covers a different recovery path.

The existing test clears local storage before phase 2, so it validates the S3/MinIO restore path.

This new test deliberately keeps local storage, so it validates local checkpoint detection and resume after a runtime upgrade.

It also adds ARM64 coverage since the existing test is skipped there due to the Delta Lake dependency. I updated the docstring to make this clearer.

gz · 2026-05-22T21:25:45Z

The existing test clears local storage before phase 2, so it validates the S3/MinIO restore path.

makes sense! can we use the same sql from the s3-sync checkpoint test? it is carefully crafted to leverage every operator that has a custom checkpoint implementation to maximize coverage of checkpoint testing

mythical-fred

Clean, additive test-only PR — fills a real gap: test_runtime_upgrade_round_trip skips @skip_on_arm64 because of the deltalake wheel and forces MinIO-pull recovery via clear_storage(). This new test has no native deps and exercises the local-disk recovery path. Good rationale, well-documented in the docstring.

A few soft notes, none blocking:

Manual test plan is thin. The PR description says "i verified that the test file is picked up by pytest. full execution needs the enterprise runtime ci/staging environment." That means the test has not actually been run end-to-end against an enterprise runtime locally — only collected. Please run it through staging CI once before merge and confirm phase 2 actually resumes from the on-disk checkpoint, otherwise we may merge a green-on-collection / red-in-reality test.
Recovery isn't validated explicitly. Phase 2 asserts result_after == [{"c": 5}], which proves the rows survived. But if the manager ever changes default-resume behavior so a missing start_from_checkpoint="latest" skips the local checkpoint and replays from empty input, the assertion would still hold by accident on this dataset (no Delta input is wired in phase 2 — there's no replay source either, so 0 is the only alternative, granted). Worth a comment in the test saying "an absent local checkpoint would produce c == 0, which is why the assertion is sufficient" so the intent is clear to future readers.
No wait_for_idle() before the phase 2 query. The companion test uses wait_for_token(...) after start to ensure ingest is drained. Phase 2 here has no input, so technically nothing to drain, but a current_pipeline.wait_for_idle() between start() and query() is a cheap belt-and-braces against a future change where bootstrap completion lags start().
legacy_builder.runtime_version = legacy_version after constructor is a code smell copied verbatim from the round-trip test (the constructor lets FELDERA_RUNTIME_VERSION win). Not introduced here, but it would be nice to file a tracking issue against PipelineBuilder so the override is explicit rather than mutate-after-init. (Same point I flagged on #6185.)

LGTM modulo (1) — please confirm a staging run passes.

venkatamudili-fel · 2026-05-24T21:26:31Z

Thanks, addressed the soft notes.
I updated the test to use the same checkpoint-heavy operator families as the existing upgrade test where possible, while still avoiding Delta/MinIO so it can run on ARM64.
I also added wait_for_idle() before the phase 2 query, made the local checkpoint recovery expectation explicit in the comment, and clarified why the runtime_version override is still needed here.

mythical-fred

Re-approve on 260a5a1. The SQL now matches _build_sql's operator families (DISTINCT, GROUP BY, binary/three-way JOIN, recursive closure via pair_seed), current_pipeline.wait_for_idle() is in place before the phase-2 query, and the docstring spells out exactly why local-checkpoint recovery is the path under test. Coverage is solid for an ARM64-runnable upgrade test that doesn't pull in deltalake.

The legacy_builder.runtime_version = legacy_version stomp after the constructor is now justified inline (FELDERA_RUNTIME_VERSION env wins in PipelineBuilder.__init__); fine to leave as-is in this PR, but the underlying PipelineBuilder API smell is still worth a tracking issue.

My only remaining ask is the original soft note #1: please confirm a green staging-CI run end-to-end before merge — "pytest picks it up" is not the same as "phase 2 actually restored from the on-disk checkpoint." If staging passes, this is ready to go.

gz · 2026-05-27T16:13:39Z

while still avoiding Delta/MinIO so it can run on ARM64

fyi this is fixed on main you can now drop the restriction that only runs on amd64

gz · 2026-05-27T16:16:17Z

(when dropped please queue it)

venkatamudili-fel · 2026-05-27T17:06:29Z

@gz done, I removed the amd64-only restriction and pushed the update.

mythical-fred

Re-APPROVE on 01bf5997 (Add runtime upgrade checkpoint validation test).

New test_upgrade_from_local_checkpoint complements the existing test_runtime_upgrade_round_trip by deliberately skipping clear_storage() between phases — forcing recovery from the manager's local-disk checkpoint rather than the MinIO/S3 sync path. Solid coverage delta: same operator families (DISTINCT, GROUP BY, binary + three-way JOIN, recursive view, MV passthrough) with a Delta connector intentionally avoided so the test runs on ARM64 too. The @skip_on_arm64 decorator is also dropped from the companion test now that delta-rs#4413 is fixed upstream.

The legacy_builder.runtime_version = legacy_version stomp after construction is well-commented and the right call given PipelineBuilder.__init__ lets FELDERA_RUNTIME_VERSION win over the constructor argument in CI; future work could expose a proper override flag on the builder, but that's out of scope here.

Assertions check row counts only (SELECT COUNT(*) FROM t) — proves checkpoint detection but not per-operator state correctness. For a deeper invariant, asserting per-view aggregates (e.g. v_grouped, v_three_way, closure row counts) would catch regressions that change distinct/group/recursive state shape but leave the source table intact. Soft nit; not blocking.

No blockers.

mythical-fred

Re-APPROVE. The switch from create_or_replace() to update_runtime() is a genuine fix — create_or_replace() calls clear_storage() on an existing pipeline, which would destroy the local checkpoint before the upgraded runtime can load it, defeating the test's purpose. Good catch.

venkatamudili-fel force-pushed the fix/issue-212-upgrade-checkpoint branch from b897c2e to f9965e7 Compare May 22, 2026 20:56

venkatamudili-fel force-pushed the fix/issue-212-upgrade-checkpoint branch from f9965e7 to 5c67c72 Compare May 22, 2026 21:17

mythical-fred approved these changes May 22, 2026

View reviewed changes

venkatamudili-fel force-pushed the fix/issue-212-upgrade-checkpoint branch from 5c67c72 to 7e65792 Compare May 24, 2026 21:25

venkatamudili-fel force-pushed the fix/issue-212-upgrade-checkpoint branch from 7e65792 to 260a5a1 Compare May 24, 2026 21:30

mythical-fred approved these changes May 24, 2026

View reviewed changes

gz approved these changes May 27, 2026

View reviewed changes

venkatamudili-fel force-pushed the fix/issue-212-upgrade-checkpoint branch from 260a5a1 to 01bf599 Compare May 27, 2026 17:05

mythical-fred approved these changes May 27, 2026

View reviewed changes

venkatamudili-fel added this pull request to the merge queue May 29, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026

venkatamudili-fel force-pushed the fix/issue-212-upgrade-checkpoint branch 2 times, most recently from 0040aea to dd3f2fc Compare June 8, 2026 06:37

mythical-fred approved these changes Jun 14, 2026

View reviewed changes

Add runtime upgrade checkpoint validation test

74fd154

venkatamudili-fel force-pushed the fix/issue-212-upgrade-checkpoint branch from dd3f2fc to 74fd154 Compare June 15, 2026 05:34

venkatamudili-fel added this pull request to the merge queue Jun 15, 2026

Merged via the queue into main with commit e7b30ba Jun 15, 2026
1 check passed

venkatamudili-fel deleted the fix/issue-212-upgrade-checkpoint branch June 15, 2026 07:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/issue 212 upgrade checkpoint#6302

Fix/issue 212 upgrade checkpoint#6302
venkatamudili-fel merged 1 commit into
mainfrom
fix/issue-212-upgrade-checkpoint

venkatamudili-fel commented May 22, 2026 •

edited

Loading

Uh oh!

gz commented May 22, 2026 •

edited

Loading

Uh oh!

venkatamudili-fel commented May 22, 2026 •

edited

Loading

Uh oh!

gz commented May 22, 2026 •

edited

Loading

Uh oh!

mythical-fred left a comment

Uh oh!

venkatamudili-fel commented May 24, 2026

Uh oh!

mythical-fred left a comment

Uh oh!

gz commented May 27, 2026

Uh oh!

gz commented May 27, 2026 •

edited

Loading

Uh oh!

venkatamudili-fel commented May 27, 2026

Uh oh!

mythical-fred left a comment

Uh oh!

Uh oh!

mythical-fred left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

venkatamudili-fel commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe Manual Test Plan

Checklist

Breaking Changes?

Describe Incompatible Changes

Uh oh!

gz commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

venkatamudili-fel commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gz commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

venkatamudili-fel commented May 24, 2026

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

gz commented May 27, 2026

Uh oh!

gz commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

venkatamudili-fel commented May 27, 2026

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

venkatamudili-fel commented May 22, 2026 •

edited

Loading

gz commented May 22, 2026 •

edited

Loading

venkatamudili-fel commented May 22, 2026 •

edited

Loading

gz commented May 22, 2026 •

edited

Loading

gz commented May 27, 2026 •

edited

Loading