Fix/issue 212 upgrade checkpoint#6302
Conversation
b897c2e to
f9965e7
Compare
|
in CI we already have a pretty comprehensive test for this here https://github.com/feldera/feldera/blob/main/python/tests/runtime/test_checkpoint_runtime_upgrade.py#L511 is there anything this new test checks what isn't tested in test_checkpoint_runtime_upgrade? but overall the right direction, lets try to make sure we can exercise this path for qa workloads before a release |
Good point. I checked this against the existing test, and it covers a different recovery path. The existing test clears local storage before phase 2, so it validates the S3/MinIO restore path. This new test deliberately keeps local storage, so it validates local checkpoint detection and resume after a runtime upgrade. It also adds ARM64 coverage since the existing test is skipped there due to the Delta Lake dependency. I updated the docstring to make this clearer. |
f9965e7 to
5c67c72
Compare
makes sense! can we use the same sql from the s3-sync checkpoint test? it is carefully crafted to leverage every operator that has a custom checkpoint implementation to maximize coverage of checkpoint testing |
mythical-fred
left a comment
There was a problem hiding this comment.
Clean, additive test-only PR — fills a real gap: test_runtime_upgrade_round_trip skips @skip_on_arm64 because of the deltalake wheel and forces MinIO-pull recovery via clear_storage(). This new test has no native deps and exercises the local-disk recovery path. Good rationale, well-documented in the docstring.
A few soft notes, none blocking:
-
Manual test plan is thin. The PR description says "i verified that the test file is picked up by pytest. full execution needs the enterprise runtime ci/staging environment." That means the test has not actually been run end-to-end against an enterprise runtime locally — only collected. Please run it through staging CI once before merge and confirm phase 2 actually resumes from the on-disk checkpoint, otherwise we may merge a green-on-collection / red-in-reality test.
-
Recovery isn't validated explicitly. Phase 2 asserts
result_after == [{"c": 5}], which proves the rows survived. But if the manager ever changes default-resume behavior so a missingstart_from_checkpoint="latest"skips the local checkpoint and replays from empty input, the assertion would still hold by accident on this dataset (no Delta input is wired in phase 2 — there's no replay source either, so 0 is the only alternative, granted). Worth a comment in the test saying "an absent local checkpoint would producec == 0, which is why the assertion is sufficient" so the intent is clear to future readers. -
No
wait_for_idle()before the phase 2 query. The companion test useswait_for_token(...)after start to ensure ingest is drained. Phase 2 here has no input, so technically nothing to drain, but acurrent_pipeline.wait_for_idle()betweenstart()andquery()is a cheap belt-and-braces against a future change where bootstrap completion lagsstart(). -
legacy_builder.runtime_version = legacy_versionafter constructor is a code smell copied verbatim from the round-trip test (the constructor letsFELDERA_RUNTIME_VERSIONwin). Not introduced here, but it would be nice to file a tracking issue againstPipelineBuilderso the override is explicit rather than mutate-after-init. (Same point I flagged on #6185.)
LGTM modulo (1) — please confirm a staging run passes.
5c67c72 to
7e65792
Compare
|
Thanks, addressed the soft notes. |
7e65792 to
260a5a1
Compare
mythical-fred
left a comment
There was a problem hiding this comment.
Re-approve on 260a5a1. The SQL now matches _build_sql's operator families (DISTINCT, GROUP BY, binary/three-way JOIN, recursive closure via pair_seed), current_pipeline.wait_for_idle() is in place before the phase-2 query, and the docstring spells out exactly why local-checkpoint recovery is the path under test. Coverage is solid for an ARM64-runnable upgrade test that doesn't pull in deltalake.
The legacy_builder.runtime_version = legacy_version stomp after the constructor is now justified inline (FELDERA_RUNTIME_VERSION env wins in PipelineBuilder.__init__); fine to leave as-is in this PR, but the underlying PipelineBuilder API smell is still worth a tracking issue.
My only remaining ask is the original soft note #1: please confirm a green staging-CI run end-to-end before merge — "pytest picks it up" is not the same as "phase 2 actually restored from the on-disk checkpoint." If staging passes, this is ready to go.
fyi this is fixed on |
|
(when dropped please queue it) |
260a5a1 to
01bf599
Compare
|
@gz done, I removed the amd64-only restriction and pushed the update. |
mythical-fred
left a comment
There was a problem hiding this comment.
Re-APPROVE on 01bf5997 (Add runtime upgrade checkpoint validation test).
New test_upgrade_from_local_checkpoint complements the existing test_runtime_upgrade_round_trip by deliberately skipping clear_storage() between phases — forcing recovery from the manager's local-disk checkpoint rather than the MinIO/S3 sync path. Solid coverage delta: same operator families (DISTINCT, GROUP BY, binary + three-way JOIN, recursive view, MV passthrough) with a Delta connector intentionally avoided so the test runs on ARM64 too. The @skip_on_arm64 decorator is also dropped from the companion test now that delta-rs#4413 is fixed upstream.
The legacy_builder.runtime_version = legacy_version stomp after construction is well-commented and the right call given PipelineBuilder.__init__ lets FELDERA_RUNTIME_VERSION win over the constructor argument in CI; future work could expose a proper override flag on the builder, but that's out of scope here.
Assertions check row counts only (SELECT COUNT(*) FROM t) — proves checkpoint detection but not per-operator state correctness. For a deeper invariant, asserting per-view aggregates (e.g. v_grouped, v_three_way, closure row counts) would catch regressions that change distinct/group/recursive state shape but leave the source table intact. Soft nit; not blocking.
No blockers.
0040aea to
dd3f2fc
Compare
mythical-fred
left a comment
There was a problem hiding this comment.
Re-APPROVE. The switch from create_or_replace() to update_runtime() is a genuine fix — create_or_replace() calls clear_storage() on an existing pipeline, which would destroy the local checkpoint before the upgraded runtime can load it, defeating the test's purpose. Good catch.
dd3f2fc to
74fd154
Compare
Describe Manual Test Plan
added a runtime integration test for feldera-qa#212.
locally, i verified that the test file is picked up by pytest. full execution needs the enterprise runtime ci/staging environment because the test uses enterprise checkpoint support and runtime version routing.
Checklist
Breaking Changes?
Mark if you think the answer is yes for any of these components:
Describe Incompatible Changes
none.