Skip to content

perf: syntax highlighting skips patterns that cannot match a line#20371

Open
h-east wants to merge 1 commit into
vim:masterfrom
h-east:syntax-leadbyte-prefilter
Open

perf: syntax highlighting skips patterns that cannot match a line#20371
h-east wants to merge 1 commit into
vim:masterfrom
h-east:syntax-leadbyte-prefilter

Conversation

@h-east
Copy link
Copy Markdown
Member

@h-east h-east commented May 29, 2026

Problem:  Every syntax pattern is run against every line, including lines
          that obviously cannot match it.
Solution: For each pattern derive the set of bytes that must occur in any
          match (a conservative analysis that bails on constructs it cannot
          model soundly), and skip the pattern on lines missing one of those
          required bytes.  Highlighting is unchanged; verify by toggling
          test_override('syn_prefilter').

@h-east
Copy link
Copy Markdown
Member Author

h-east commented May 29, 2026

Measurement results

Build: ./configure default CFLAGS (-g -O2), no profiling.

Method: force full-buffer syntax highlighting by calling synID() on every
line/column, measure wall time with reltime(), median of several runs.
Baseline = the same tree without the prefilter.

Full-buffer highlight:

file (filetype) lines baseline with prefilter change
big.c, C (concatenated) 99,192 ~6.70 s ~3.00 s ~55% faster (~2.2x)
src/evalfunc.c, C 12,919 ~0.92 s ~0.44 s ~52% faster (~2.1x)
netrw.vim, Vim script 9,717 ~5.8 s ~5.2 s ~11% faster

Mechanism: on a typical C buffer about 40% of regexp executions are patterns
that never match on that line (for example character/string-constant and
preprocessor patterns that are tried on every line). The prefilter removes
most of them; the share of regexp time spent on never-matching patterns drops
from ~40% to ~15% (measured with :syntime).

The gain depends on the filetype. For regexp-pattern-heavy syntaxes (C/C++)
with many never-matching patterns it is large. For keyword-heavy syntaxes
(Vim script), where much of the work is keyword lookup rather than regexp
matching, the overall gain is smaller, but regexp executions are still cut
substantially with identical highlighting (netrw.vim: 1,680,030 -> 1,024,966
regexp calls, about 39% fewer).

Correctness: byte-for-byte identical synID() output vs. the baseline across
C, C++, Vim script, Python, Ruby, Lua, JavaScript, shell, HTML and CSS
(millions of cells), including multibyte content, very long lines and a
reduced 'synmaxcol'.

Tests: test_syntax (including a new Test_syntax_lead_byte_prefilter regression
test), test_highlight, test_spell and test_textprop all pass.

Notes

  • Observable change: patterns skipped by the prefilter are no longer counted
    by :syntime (they genuinely do not run). Test_syntime was updated to
    highlight the whole buffer so the profiled patterns still appear.
  • The analyzer bails out (pattern is always tried, behaviour unchanged) on
    constructs it does not model: top-level "&", look-around "@", multi-line
    "_x", the case/magic flags "\c \C \v \V \m \M", "\Z", multibyte literals
    and groups nested deeper than the recursion cap.

@h-east
Copy link
Copy Markdown
Member Author

h-east commented May 29, 2026

Self-review: maintainability of the prefilter analyzer, and how I plan to address it

I want to flag the main maintenance concern with this change myself.

Concern. The lead-byte prefilter derives its first-byte / required-byte sets
by re-implementing a subset of Vim's regexp grammar in syn_compute_first_bytes()
and its helpers: magic-mode parsing, character classes, anchors, quantifiers,
groups and alternation. That means:

  • It has to be kept in sync if Vim's regexp syntax ever grows new constructs.
  • A missed or mis-modelled construct does not crash — it silently produces
    wrong highlighting by skipping a pattern that could actually match. That is
    the worst kind of failure mode for a maintainer.
  • This is not hypothetical: while developing this I hit four separate soundness
    pitfalls before the output matched byte-for-byte — top-level alternation
    (a\|b), look-around (\@<= / \@!), ignore-case (\c and :syn case
    semantics for the required-byte set), and inline flags (\c \v ...).
  • The scariest future case is a new \<char> escape — especially a zero-width
    or whole-pattern flag — being misclassified as a literal/consuming atom.

Mitigation I plan to add (commit on this branch). A differential test that
makes the optimization self-checking:

  • Add test_override('syn_prefilter', 1) to disable the prefilter at runtime
    (same mechanism as nfa_fail etc.).
  • Add a test that highlights several buffers (C, Vim script, Python, shell, and
    the tricky-construct cases) and asserts that the full-buffer synID() output
    is identical with the prefilter on (default) and off.

With that test in place, any future edit — or any new regexp construct the
analyzer fails to model — surfaces as a CI failure instead of silently wrong
highlighting. The optimization stays safe to maintain regardless of the
analyzer's complexity. The analyzer is already written to bail out (keep
trying the pattern, behaviour unchanged) on anything it does not model, so the
intended degradation is "no speed-up", never "wrong result"; the differential
test is what guarantees that property keeps holding.

Alternatives considered (not in this PR).

  • Tighten the analyzer so that any unrecognised \<char> escape bails instead
    of being treated as a literal, making it safe-by-default against future
    additions (small follow-up).
  • Longer term, derive the first-byte set from the compiled NFA start states
    instead of re-parsing the pattern text. That removes the grammar-tracking
    burden entirely (single source of truth = the regexp engine), at the cost of
    coupling to engine internals and handling the BT fallback. This would be a
    separate, larger change.

@h-east
Copy link
Copy Markdown
Member Author

h-east commented May 29, 2026

A differential test that makes the optimization self-checking:

Done.

@h-east

This comment was marked as outdated.

@h-east
Copy link
Copy Markdown
Member Author

h-east commented May 31, 2026

@chrisbra Do you have any concerns?

@chrisbra
Copy link
Copy Markdown
Member

no, this sounds like a very nice performance optimization. I'll have a closer look later, this is all a bit over my head right now :)

@h-east h-east force-pushed the syntax-leadbyte-prefilter branch 2 times, most recently from 64b37b3 to 32a75fe Compare June 1, 2026 02:13
@h-east
Copy link
Copy Markdown
Member Author

h-east commented Jun 1, 2026

Update: I've folded in the safe-by-default tightening I listed as a follow-up
in the self-review above. Unknown alphanumeric regexp escapes now bail (the
pattern is always tried) rather than being treated as ordinary atoms, so the
analyzer degrades to "no speed-up", never "wrong highlighting", even for regexp
constructs added in the future. Covered by the new
Test_syntax_prefilter_classes() (identical synID() with the prefilter on and
off).

Comment thread src/syntax.c Outdated
return p;
}
if (*q == '\\' && q[1] != NUL)
c1 = *++q;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong. It does not correctly handle \n \r \b \e inside collations and it will not handle [\s] as matching a backslash or s and I think it also does not handle \d<digit>, \u<digit>, \U<digit>, \x<digit> and \o<digit>. perhaps just bail out?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thanks. Inside [...] the code treated "\x" as a literal x, which
is wrong for \t \e \r \b \n, the numeric \d \o \x \u \U forms, and "\s"
(= backslash-or-s) — the resulting first-byte set was too narrow and could skip
a matching line. As you suggested, it now bails on any backslash inside [...]
(so the pattern is always tried). Added regression cases ([\t]x, [\d65]x, [\s]x)
to Test_syntax_prefilter_classes that fail without the fix.

@h-east h-east force-pushed the syntax-leadbyte-prefilter branch from 32a75fe to a883733 Compare June 2, 2026 02:24
@dkearns
Copy link
Copy Markdown
Contributor

dkearns commented Jun 2, 2026

I think the first/required byte analysis should be performed by the BT engine, it already provides regstart and regmust. The new string parser does more than the BT optimisations but implementing this in the regex engine itself should afford a lot of opportunities to avoid bailing out as often as is being done here.

It's just a first impression but the potential to introduce subtle, long lasting, bugs seems high. However, the performance improvements look very nice and I'm unaware of any kittens ever being harmed by bad syntax highlighting.

@h-east
Copy link
Copy Markdown
Member Author

h-east commented Jun 2, 2026

@dkearns Thanks for the review!

One clarification: this isn't a "new string parser" that reimplements or
belongs in the regex engine. It does no regex matching and isn't an alternative
to (or optimization of) the BT/NFA engines.

It's a cheap pre-screen in the syntax layer that avoids calling the engine
when a pattern provably can't match the line. At definition time it analyzes
the pattern string once, only to derive a sound necessary condition (the set of
bytes a match may start with, and the set of bytes that must occur in the line).
At runtime it does no regex work: it builds the line's byte set once per line,
then rejects a pattern with a couple of bitset ops before entering the engine.
If it passes, the pattern goes to the engine exactly as before.

So the goal isn't a faster regexec — it's skipping the call entirely. Syntax
highlighting runs (patterns × lines) attempts, and the line byte set is built
once and shared across all patterns, so this amortization only works at the
syntax layer. regstart/regmust are a different, weaker thing (a single
start char and one literal substring, consulted inside vim_regexec()); for
the common case of keyword alternations like \<\(int\|long\|char\)\> they
give no help, while the prefilter can still skip lines containing none of
i,l,c.

Comment thread src/syntax.c
@dkearns
Copy link
Copy Markdown
Contributor

dkearns commented Jun 2, 2026

I understand, I'm suggesting the entire pre-screen pattern analysis is happening in the wrong place and an unnecessary duplication. It could be done during regex compilation and the results then exposed for use by the syntax layer. The patterns are all compiled at file load time anyway.

I missed it on my first pass but it looks you've said exactly the same thing yourself. :)

Longer term, derive the first-byte set from the compiled NFA start states
instead of re-parsing the pattern text. That removes the grammar-tracking
burden entirely (single source of truth = the regexp engine), at the cost of
coupling to engine internals and handling the BT fallback. This would be a
separate, larger change.

Why implement an intermediate solution?

@h-east h-east force-pushed the syntax-leadbyte-prefilter branch from a883733 to 04895db Compare June 2, 2026 19:04
@h-east
Copy link
Copy Markdown
Member Author

h-east commented Jun 2, 2026

Fair question, but my answer is the opposite of what that quote implies: I'm
not planning to do the engine-derived version, and this isn't a step toward it.

It wouldn't be "expose what the engine already has". regstart is a single
start char (NUL the moment there's an alternation) and regmust is a single
literal substring (BT only). The prefilter needs a set of possible first
bytes and a set of required bytes, so doing it in the engine means adding new
set-valued outputs to both the BT and the NFA compilers and keeping them in
agreement — a substantial change in execution-critical code shared by
search/substitute/match. That is a lot of work and risk for an optimization
whose whole job is to never change behaviour, and the "subtle, long-lasting
bugs" you rightly worried about would land squarely in the engines.

I don't think that cost is justified. This prefilter is deliberately shallow:
it lives entirely in the syntax layer, touches no engine code, bails on
anything it doesn't model (so the worst case is "no speed-up", never "wrong
result"), and is pinned by an on/off differential test requiring byte-for-byte
identical highlighting. Since your review I've made it bail more aggressively—
\{n,m} and all character classes are no longer modeled — which both fixes
the \{42,0} case you found and shrinks the grammar surface that could ever
drift. The guiding principle is now stated at the top of the analyzer:
correctness over coverage, widen-only, when in doubt bail.

So I'd retract that "longer term" note rather than act on it: single source of
truth would be tidier in the abstract, but it isn't worth touching two regex
engines for. A shallow, bail-happy, self-contained prefilter already gives the
measured speed-up, and that's a good place to stop. I've dropped the misleading
bullet from the description.

@h-east
Copy link
Copy Markdown
Member Author

h-east commented Jun 2, 2026

I've re-run the benchmarks. (First result: #20371 (comment))
The numbers for Part 1 have dropped compared to the initial benchmark, but this is because I changed the prefilter analyzer to be safe-by-default (bailing out to avoid regressions). Even so, it still maintains an overall 2x speed-up, so I believe this patch is well worth merging.


The "syntax highlighting performance trilogy" is now complete. This PR (#20371) is the first part; two more are lined up in my fork, to be submitted upstream after this one lands:

All three are pure speedups — the resulting highlighting is unchanged, and each one ships with a test_override() flag plus a test that asserts identical synID() output with the optimization on and off.

Benchmark

Measured on a single binary that contains all three, toggling each optimization via its test_override() flag, so "all off" reproduces the current (master) algorithm. The workload is a full-buffer synID() sweep (every line, up to end-of-line); median of 5 runs.

C source (src/eval.c, 8209 lines)

Configuration Time Speedup
baseline (all off ≈ master) 0.788s 1.00×
+ lead-byte prefilter (#20371) 0.500s 1.57×
+ in_id_list cache (#29) 0.753s 1.05×
+ saved-state hint (#30) 0.609s 1.29×
all three on 0.358s 2.20×

Vim script (netrw.vim, 9717 lines)

Configuration Time Speedup
baseline (all off ≈ master) 9.16s 1.00×
+ lead-byte prefilter (#20371) 8.21s 1.12×
+ in_id_list cache (#29) 4.73s 1.94×
+ saved-state hint (#30) 8.30s 1.10×
all three on 4.41s 2.08×

The three target different bottlenecks and are complementary: the prefilter dominates for C (many patterns with distinct lead bytes), the in_id_list() cache dominates for Vim script (large contains/cluster lists, e.g. netrw), and the saved-state hint gives a smaller, filetype-independent gain. Combined, that's roughly 2.2× on C-heavy code and 2.1× on Vim-script-heavy code, with highlighting output unchanged.

@dkearns
Copy link
Copy Markdown
Contributor

dkearns commented Jun 3, 2026

I was suggesting that it might be useful to expand upon the very basic existing optimisations as this would probably (not always easy to intuit) be generally useful as well, but let's put that aside.

It makes little sense to me that the pattern is being parsed by anything other than the regex engine. It introduces a whole class of problems that just don't need to exist. We've already identified two issues just by eyeballing the parser.

There's certainly no need to query both engines and the BT engine is probably easier to work with for current purposes. However, if we don't want to generate these results during compilation, and don't want to process the compiled pattern, then it would be better just to parse the output of the existing regdump(). It's certainly no uglier than the current approach, accurate, and we get to support all pattern constructs for free.

Problem:  Every syntax pattern is run against every line, including lines
          that obviously cannot match it.
Solution: For each pattern derive the set of bytes that must occur in any
          match (a conservative analysis that bails on constructs it cannot
          model soundly), and skip the pattern on lines missing one of those
          required bytes.  Highlighting is unchanged; verify by toggling
          test_override('syn_prefilter').

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@h-east h-east force-pushed the syntax-leadbyte-prefilter branch from 04895db to fecd3d5 Compare June 3, 2026 10:23
@h-east h-east changed the title perf: syntax highlighting tries patterns that cannot match a line perf: syntax highlighting skips patterns that cannot match a line Jun 3, 2026
@h-east
Copy link
Copy Markdown
Member Author

h-east commented Jun 3, 2026

Thanks — I took the "let the regex engine do it" suggestion seriously and
measured it before deciding, because it would clearly be the cleaner design if
it carried its weight. The result changed my mind about what to extract, so
let me share the numbers.

Benchmark: a ~25,000-line C file (Vim's own change.c/buffer.c/eval.c/
syntax.c/regexp.c/ex_docmd.c/edit.c concatenated), highlighted with
filetype=c, full-buffer synID() cold pass, min of 3 runs. Speedup vs. the
prefilter disabled (test_override('syn_prefilter', 1)):

signal used to reject a line speedup
first byte only (≈ regstart) ~1.0x (no measurable gain)
engine regmust → required bytes ~1.03x
required-byte set (union of mandatory bytes) ~1.43x
first-byte set + required bytes (the original PR) ~1.42x

Two takeaways:

  1. The first-byte set contributes essentially nothing on this workload, and
    it was also the riskiest part (alternation start-sets, \c, magic mode, the
    under-approximation completeness you were rightly worried about). So I
    removed it entirely.

  2. The entire win comes from the required-byte set — the union of bytes that
    must occur somewhere in any match. The engine's exposed analyses don't
    reproduce it: regstart is a single first character, and regmust is a
    single longest literal supplied only when the pattern is "potent enough" —
    both measure ~1.0–1.03x here. To get the required-byte union from the
    engine you'd have to add a new must-set traversal to regexp_bt.c (it isn't
    an existing field, and it isn't in regdump() output either), i.e. new
    engine code rather than reuse.

So the PR is now a required-byte prefilter only (~124 fewer lines): the
analyzer no longer tracks any start-byte set, just the conservative
required-byte set, bailing on anything it can't model soundly. Correctness is
covered by the on/off equivalence tests plus differential fuzzing over random
patterns (several thousand cases, no mismatches); and structurally, required-byte
rejection is a strict subset of the previous filter's rejections, so it can only
ever skip fewer lines, never wrongly skip one.

I'm happy to revisit an engine-side must-set traversal later if we want
correctness-by-construction, but given the data I'd rather not block this on new
engine code.

@dkearns
Copy link
Copy Markdown
Contributor

dkearns commented Jun 4, 2026

Something is getting lost in translation here...

To get the required-byte union from the engine you'd have to add a new must-set traversal to regexp_bt.c (it isn't an existing field, and it isn't in regdump() output either), i.e. new engine code rather than reuse.

I was being slightly facetious in suggesting that we parse the regdump() output. That would be silly, and I intended it more as a sign post, but it would still be better than a new pattern parser in syntax.c. I'm suggesting that this required character set be computed from an analysis of the compiled pattern. This poses no engine-side risks; it is read-only.

As I read it we have gone from 2.2x to 1.4x for the prefilter on the same big.c in the pursuit of correctness that the compiled pattern provides for free.

I can't justify the existence of the new pattern parser and wouldn't like to see it merged.

@h-east
Copy link
Copy Markdown
Member Author

h-east commented Jun 4, 2026

One number first, since I think it's skewing the trade-off: the 2.2x is the
three optimizations combined, never the prefilter alone. The prefilter on its
own is ~1.57x (eval.c) / ~1.12x (netrw), and dropping the first-byte set took it
from ~1.42x to ~1.43x on the big file — so no prefilter performance was traded
for correctness. The required-byte set was always carrying essentially the whole
prefilter win.

On deriving the required set from the compiled pattern: I should separate two
things I've conflated with it before, because neither objection actually applies
to what you're proposing now.

  • The ~1.0-1.03x numbers I measured were regstart/regmust — the engine's
    existing outputs. They don't carry the win because they're a single start
    char and a single literal, not the required-byte union. That result only
    says "reusing the existing fields doesn't work", not "an engine-derived
    required set would be slow". A read-only walk that computes the union from the
    regprog would compute the same signal the analyzer computes now, so it should
    land at the same ~1.43x. So I'm not going to argue this on speed.

  • My earlier "too risky / new engine code" was about adding set-valued outputs to
    the BT and NFA compilers — modifying execution-critical matching code. A
    read-only walk of the compiled BT program isn't that, and you're right that it
    carries no execution risk. I withdraw that framing for this proposal.

What's left is narrower but, I think, still the deciding factor: a regprog walk
couples this syntax-layer optimization to the engine's internal, undocumented
bytecode layout — the representation that is otherwise only surfaced for
debugging via regdump(). And it isn't free of the correctness burden either: a
required-byte walk still has to be written correctly per opcode (BRANCH =
intersect across alternatives, STAR/\? = contributes nothing, ANYOF = no single
required byte, the \{n,m} reversal cases, …), and a mistake there over-claims
required bytes and wrongly skips a line exactly as a source-level mistake would.
So it relocates the per-construct correctness work onto engine internals rather
than removing it.

Against that, the current analyzer is self-contained, depends on nothing private
to the engine, and fails safe: when it can't prove a construct sound it bails and
the pattern is always tried, so a gap costs a missed skip, never a wrong one. It
already bails on \{n,m}, character classes, and backslashes inside [...]; the
floor is "model less, bail more", which only ever costs speed.

I'm not opposed to a read-only regprog analysis as a later refinement if we
decide the bytecode coupling is acceptable. But given the corrected numbers and
the fail-safe behaviour, I'd rather land the self-contained version than gate it
on coupling to the compiled program.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants