Skip to content

[SQL] Support INTERSECT/EXCEPT ALL by rewriting into JOIN of inputs with ROW_NUMBER#6438

Open
mihaibudiu wants to merge 3 commits into
feldera:mainfrom
mihaibudiu:issue6436
Open

[SQL] Support INTERSECT/EXCEPT ALL by rewriting into JOIN of inputs with ROW_NUMBER#6438
mihaibudiu wants to merge 3 commits into
feldera:mainfrom
mihaibudiu:issue6436

Conversation

@mihaibudiu

@mihaibudiu mihaibudiu commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Fixes #6436
Fixes #5483

The first commit cleans up some tests.
The second commit implements INTERSECT ALL.
The third commit implementsEXCEPT ALL.

INTERSECT ALL is eliminated using a Calcite rewrite rule, and converted into a window query for each input and a join tree.
For two inputs A and B with columns c0, c1, ..., cn the transformation produces:

SELECT a.c0, a.c1, ..., a.cn
FROM (SELECT c0,...,cn,
             ROW_NUMBER() OVER (PARTITION BY c0,...,cn ORDER BY c0,...,cn) AS $rn
      FROM A) AS a
JOIN (SELECT c0,...,cn,
             ROW_NUMBER() OVER (PARTITION BY c0,...,cn ORDER BY c0,...,cn) AS $rn
      FROM B) AS b
ON  a.c0 IS NOT DISTINCT FROM b.c0
AND ...
AND a.cn IS NOT DISTINCT FROM b.cn
AND a.$rn = b.$rn

The algorithm for EXCEPT ALL is very similar, just using a left join.

Describe Manual Test Plan

Ran all the new Java tests.

Checklist

  • Unit tests added/updated
  • Integration tests added/updated
  • Documentation updated

@mihaibudiu mihaibudiu force-pushed the issue6436 branch 2 times, most recently from 3a3cdae to 63a50ba Compare June 10, 2026 02:54
@mihaibudiu mihaibudiu changed the title [SQL] Support INSERSECT ALL by rewriting into JOIN of inputs with ROW_NUMBER [SQL] Support INTERSECT ALL by rewriting into JOIN of inputs with ROW_NUMBER Jun 10, 2026

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues — one real bug, one prose nit. The rewrite rule itself is clean and well-documented; good test coverage including NULL semantics and multi-way INTERSECT ALL.

Also: commit subject says "INSERSECT" — worth fixing with git rebase -i before merge.

Comment thread docs.feldera.com/docs/sql/unsupported-operations.md Outdated

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mihaibudiu mihaibudiu changed the title [SQL] Support INTERSECT ALL by rewriting into JOIN of inputs with ROW_NUMBER [SQL] Support INTERSECT/EXCEPT ALL by rewriting into JOIN of inputs with ROW_NUMBER Jun 14, 2026

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review for EXCEPT ALL scope expansion (prior approval was at 26fa626 for INTERSECT ALL only).

Verdict: LGTM

The EXCEPT ALL implementation mirrors the (already-reviewed) INTERSECT ALL one and is correct.

Correctness notes

  • ExceptAllToJoinRule: LEFT JOIN A vs B on IS NOT DISTINCT FROM over data cols + $rn, filtered by b.$rn IS NULL. A row (v, k) from A survives iff B has no row (v, k), i.e. iff k > count_B(v). Surviving k's are count_B(v)+1 .. count_A(v), giving exactly max(0, count_A(v) - count_B(v)) copies. Multiset-difference semantics. ✓
  • Field-offset arithmetic checks out: left $rn at position fieldCount, right $rn at 2*fieldCount + 1. ✓
  • Nice reuse of IntersectAllToJoinRule.augmentWithRowNumber from ExceptAllToJoinRule. ✓
  • ExceptAllFoldRule (n>2 → binary) is defensive: only fires if Calcite produced an n-ary Minus upstream (e.g. via MinusMergeRule). Harmless if it never matches; the parser-emitted left-associative nested binaries are handled by ExceptAllToJoinRule directly.
  • CalciteToDBSPCompiler.visitMinus/visitIntersect correctly converted from UnimplementedException to InternalCompilerError — good fail-loud if the rule didn't fire.

Test coverage gap (non-blocking, please consider)

SetOpTests.java has a thorough testIntersectAll* series (basic, minCounts, empty, null, nullBothSides, multiColumn, threeWay) but no symmetric testExceptAll* series. EXCEPT ALL is only covered by two inline cases in calciteExcept (basic + nulls). Worth adding tests for:

  • empty left / empty right
  • multi-column rows
  • three-way left-associative A EXCEPT ALL B EXCEPT ALL C
  • count_B(v) > count_A(v) edge (result has 0 copies)
  • count_B(v) == count_A(v) edge (all consumed)

Not a blocker — calciteExcept plus the planner symmetry with INTERSECT ALL give reasonable confidence.

Minor

  • IS_NOT_DISTINCT_FROM on $rn is fine (ROW_NUMBER is non-null so it reduces to =); a plain = on $rn would be slightly clearer but won't change planning.

Approved.

// Right side contains all left values: result is empty.
this.qst("""
SELECT * FROM (VALUES (1), (2)) AS T(x)
EXCEPT

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: please add testExceptAll* tests symmetric to the testIntersectAll* set: empty inputs, multi-column, three-way (left-associative), and the count_B(v) >= count_A(v) boundary cases. Currently EXCEPT ALL is covered only by the two inline cases in calciteExcept, which is much thinner than the INTERSECT ALL coverage in this same file.

builder.push(inputs.get(0));

// UNION ALL inputs[1..n-1] into a single right-side relation.
builder.push(inputs.get(1));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Calcite's SQL parser emits A EXCEPT ALL B EXCEPT ALL C as nested binary Minus nodes, so this rule only fires if something upstream (e.g. MinusMergeRule) merges them into an n-ary node. I don't see MinusMergeRule registered in the optimizer pipeline, so this rule may be dead code today. Harmless either way — leaving it in is reasonable defensive coding — but worth a one-line comment about the precondition, or a test that exercises an n-ary Minus to make sure the path is alive.

@mihaibudiu

Copy link
Copy Markdown
Contributor Author

Needs #6472 to be merged first

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-APPROVE on ea2ce79.

Since the prior review at af46f21:

  • Rebased onto 92f03be (EXCEPT/INTERSECT nullability fix merged into main). Rebase brings extra null-input filtering into CalciteToDBSPCompiler for n-ary Minus/Intersect, plus the corresponding filterNonNullFields signature refactor — orthogonal to this PR, already approved on its own merit.
  • New [SQL] Use stepWeightOne in more tests commit: purely mechanical replacement of step(... empty ...) with stepWeightOne(...) across 8 test files. Empty-expected blocks shrink from r | weight\n---- to r\n---. No behavioural change, no risk.
  • New SetOpTests cases pick up my prior nit asking for a symmetric testExceptAll* ladder via the checkSetOpNullTwoInts helper (16-combination NULL cross-product for both EXCEPT and INTERSECT) plus targeted testExceptMultiField, testExceptNullExcluded, testIntersectMultiField. Good coverage.

Carry forward: my earlier soft note about ExceptAllFoldRule possibly being dead code (no MinusMergeRule in the pipeline today) stands but is non-blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[SQL] INSERSECT ALL is not supported [DBSP][SQL] EXCEPT ALL is not implemented

2 participants