perf: syntax highlighting skips patterns that cannot match a line by h-east · Pull Request #20371 · vim/vim

h-east · 2026-05-29T13:08:07Z

Problem:  Every syntax pattern is run against every line, including lines
          that obviously cannot match it.
Solution: For each pattern derive the set of bytes that must occur in any
          match (a conservative analysis that bails on constructs it cannot
          model soundly), and skip the pattern on lines missing one of those
          required bytes.  Highlighting is unchanged; verify by toggling
          test_override('syn_prefilter').

h-east · 2026-05-29T13:10:50Z

Measurement results

Build: ./configure default CFLAGS (-g -O2), no profiling.

Method: force full-buffer syntax highlighting by calling synID() on every
line/column, measure wall time with reltime(), median of several runs.
Baseline = the same tree without the prefilter.

Full-buffer highlight:

file (filetype)	lines	baseline	with prefilter	change
big.c, C (concatenated)	99,192	~6.70 s	~3.00 s	~55% faster (~2.2x)
src/evalfunc.c, C	12,919	~0.92 s	~0.44 s	~52% faster (~2.1x)
netrw.vim, Vim script	9,717	~5.8 s	~5.2 s	~11% faster

Mechanism: on a typical C buffer about 40% of regexp executions are patterns
that never match on that line (for example character/string-constant and
preprocessor patterns that are tried on every line). The prefilter removes
most of them; the share of regexp time spent on never-matching patterns drops
from ~40% to ~15% (measured with :syntime).

The gain depends on the filetype. For regexp-pattern-heavy syntaxes (C/C++)
with many never-matching patterns it is large. For keyword-heavy syntaxes
(Vim script), where much of the work is keyword lookup rather than regexp
matching, the overall gain is smaller, but regexp executions are still cut
substantially with identical highlighting (netrw.vim: 1,680,030 -> 1,024,966
regexp calls, about 39% fewer).

Correctness: byte-for-byte identical synID() output vs. the baseline across
C, C++, Vim script, Python, Ruby, Lua, JavaScript, shell, HTML and CSS
(millions of cells), including multibyte content, very long lines and a
reduced 'synmaxcol'.

Tests: test_syntax (including a new Test_syntax_lead_byte_prefilter regression
test), test_highlight, test_spell and test_textprop all pass.

Notes

Observable change: patterns skipped by the prefilter are no longer counted
by :syntime (they genuinely do not run). Test_syntime was updated to
highlight the whole buffer so the profiled patterns still appear.
The analyzer bails out (pattern is always tried, behaviour unchanged) on
constructs it does not model: top-level "&", look-around "@", multi-line
"_x", the case/magic flags "\c \C \v \V \m \M", "\Z", multibyte literals
and groups nested deeper than the recursion cap.

h-east · 2026-05-29T15:19:20Z

Self-review: maintainability of the prefilter analyzer, and how I plan to address it

I want to flag the main maintenance concern with this change myself.

Concern. The lead-byte prefilter derives its first-byte / required-byte sets
by re-implementing a subset of Vim's regexp grammar in syn_compute_first_bytes()
and its helpers: magic-mode parsing, character classes, anchors, quantifiers,
groups and alternation. That means:

It has to be kept in sync if Vim's regexp syntax ever grows new constructs.
A missed or mis-modelled construct does not crash — it silently produces
wrong highlighting by skipping a pattern that could actually match. That is
the worst kind of failure mode for a maintainer.
This is not hypothetical: while developing this I hit four separate soundness
pitfalls before the output matched byte-for-byte — top-level alternation
(a\|b), look-around (\@<= / \@!), ignore-case (\c and :syn case
semantics for the required-byte set), and inline flags (\c \v ...).
The scariest future case is a new \<char> escape — especially a zero-width
or whole-pattern flag — being misclassified as a literal/consuming atom.

Mitigation I plan to add (commit on this branch). A differential test that
makes the optimization self-checking:

Add test_override('syn_prefilter', 1) to disable the prefilter at runtime
(same mechanism as nfa_fail etc.).
Add a test that highlights several buffers (C, Vim script, Python, shell, and
the tricky-construct cases) and asserts that the full-buffer synID() output
is identical with the prefilter on (default) and off.

With that test in place, any future edit — or any new regexp construct the
analyzer fails to model — surfaces as a CI failure instead of silently wrong
highlighting. The optimization stays safe to maintain regardless of the
analyzer's complexity. The analyzer is already written to bail out (keep
trying the pattern, behaviour unchanged) on anything it does not model, so the
intended degradation is "no speed-up", never "wrong result"; the differential
test is what guarantees that property keeps holding.

Alternatives considered (not in this PR).

Tighten the analyzer so that any unrecognised \<char> escape bails instead
of being treated as a literal, making it safe-by-default against future
additions (small follow-up).
Longer term, derive the first-byte set from the compiled NFA start states
instead of re-parsing the pattern text. That removes the grammar-tracking
burden entirely (single source of truth = the regexp engine), at the cost of
coupling to engine internals and handling the BT fallback. This would be a
separate, larger change.

h-east · 2026-05-29T15:39:48Z

A differential test that makes the optimization self-checking:

Done.

h-east · 2026-05-31T19:19:33Z

@chrisbra Do you have any concerns?

chrisbra · 2026-05-31T20:26:14Z

no, this sounds like a very nice performance optimization. I'll have a closer look later, this is all a bit over my head right now :)

h-east · 2026-06-01T02:25:04Z

Update: I've folded in the safe-by-default tightening I listed as a follow-up
in the self-review above. Unknown alphanumeric regexp escapes now bail (the
pattern is always tried) rather than being treated as ordinary atoms, so the
analyzer degrades to "no speed-up", never "wrong highlighting", even for regexp
constructs added in the future. Covered by the new
Test_syntax_prefilter_classes() (identical synID() with the prefilter on and
off).

chrisbra · 2026-06-01T21:02:21Z

+		return p;
+	    }
+	    if (*q == '\\' && q[1] != NUL)
+		c1 = *++q;


This looks wrong. It does not correctly handle \n \r \b \e inside collations and it will not handle [\s] as matching a backslash or s and I think it also does not handle \d<digit>, \u<digit>, \U<digit>, \x<digit> and \o<digit>. perhaps just bail out?

You're right, thanks. Inside [...] the code treated "\x" as a literal x, which
is wrong for \t \e \r \b \n, the numeric \d \o \x \u \U forms, and "\s"
(= backslash-or-s) — the resulting first-byte set was too narrow and could skip
a matching line. As you suggested, it now bails on any backslash inside [...]
(so the pattern is always tried). Added regression cases ([\t]x, [\d65]x, [\s]x)
to Test_syntax_prefilter_classes that fail without the fix.

dkearns · 2026-06-02T12:14:59Z

I think the first/required byte analysis should be performed by the BT engine, it already provides regstart and regmust. The new string parser does more than the BT optimisations but implementing this in the regex engine itself should afford a lot of opportunities to avoid bailing out as often as is being done here.

It's just a first impression but the potential to introduce subtle, long lasting, bugs seems high. However, the performance improvements look very nice and I'm unaware of any kittens ever being harmed by bad syntax highlighting.

h-east · 2026-06-02T15:23:57Z

@dkearns Thanks for the review!

One clarification: this isn't a "new string parser" that reimplements or
belongs in the regex engine. It does no regex matching and isn't an alternative
to (or optimization of) the BT/NFA engines.

It's a cheap pre-screen in the syntax layer that avoids calling the engine
when a pattern provably can't match the line. At definition time it analyzes
the pattern string once, only to derive a sound necessary condition (the set of
bytes a match may start with, and the set of bytes that must occur in the line).
At runtime it does no regex work: it builds the line's byte set once per line,
then rejects a pattern with a couple of bitset ops before entering the engine.
If it passes, the pattern goes to the engine exactly as before.

So the goal isn't a faster regexec — it's skipping the call entirely. Syntax
highlighting runs (patterns × lines) attempts, and the line byte set is built
once and shared across all patterns, so this amortization only works at the
syntax layer. regstart/regmust are a different, weaker thing (a single
start char and one literal substring, consulted inside vim_regexec()); for
the common case of keyword alternations like \<\(int\|long\|char\)\> they
give no help, while the prefilter can still skip lines containing none of
i,l,c.

dkearns · 2026-06-02T17:24:22Z

I understand, I'm suggesting the entire pre-screen pattern analysis is happening in the wrong place and an unnecessary duplication. It could be done during regex compilation and the results then exposed for use by the syntax layer. The patterns are all compiled at file load time anyway.

I missed it on my first pass but it looks you've said exactly the same thing yourself. :)

Longer term, derive the first-byte set from the compiled NFA start states
instead of re-parsing the pattern text. That removes the grammar-tracking
burden entirely (single source of truth = the regexp engine), at the cost of
coupling to engine internals and handling the BT fallback. This would be a
separate, larger change.

Why implement an intermediate solution?

h-east · 2026-06-02T19:27:57Z

Fair question, but my answer is the opposite of what that quote implies: I'm
not planning to do the engine-derived version, and this isn't a step toward it.

It wouldn't be "expose what the engine already has". regstart is a single
start char (NUL the moment there's an alternation) and regmust is a single
literal substring (BT only). The prefilter needs a set of possible first
bytes and a set of required bytes, so doing it in the engine means adding new
set-valued outputs to both the BT and the NFA compilers and keeping them in
agreement — a substantial change in execution-critical code shared by
search/substitute/match. That is a lot of work and risk for an optimization
whose whole job is to never change behaviour, and the "subtle, long-lasting
bugs" you rightly worried about would land squarely in the engines.

I don't think that cost is justified. This prefilter is deliberately shallow:
it lives entirely in the syntax layer, touches no engine code, bails on
anything it doesn't model (so the worst case is "no speed-up", never "wrong
result"), and is pinned by an on/off differential test requiring byte-for-byte
identical highlighting. Since your review I've made it bail more aggressively—
\{n,m} and all character classes are no longer modeled — which both fixes
the \{42,0} case you found and shrinks the grammar surface that could ever
drift. The guiding principle is now stated at the top of the analyzer:
correctness over coverage, widen-only, when in doubt bail.

So I'd retract that "longer term" note rather than act on it: single source of
truth would be tidier in the abstract, but it isn't worth touching two regex
engines for. A shallow, bail-happy, self-contained prefilter already gives the
measured speed-up, and that's a good place to stop. I've dropped the misleading
bullet from the description.

h-east · 2026-06-02T20:13:56Z

I've re-run the benchmarks. (First result: #20371 (comment))
The numbers for Part 1 have dropped compared to the initial benchmark, but this is because I changed the prefilter analyzer to be safe-by-default (bailing out to avoid regressions). Even so, it still maintains an overall 2x speed-up, so I believe this patch is well worth merging.

The "syntax highlighting performance trilogy" is now complete. This PR (#20371) is the first part; two more are lined up in my fork, to be submitted upstream after this one lands:

Part 1 — perf: syntax highlighting skips patterns that cannot match a line #20371 (this PR): lead-byte prefilter. Before running a pattern's regexp, skip it when the bytes it requires are absent from the line (a per-pattern lead-byte prefilter derived at definition time).
Part 2 — perf: checking the syntax contains/cluster list is slow h-east/vim#29: in_id_list() cache. Deciding whether a group is in a contains/cluster list scans the list and expands clusters on every check. Resolve each list once into a sorted, cluster-expanded set of group IDs and use a binary search, cached per syntax block and dropped when syntax definitions change.
Part 3 — perf: syntax highlighting searches the saved state list from the start h-east/vim#30: saved-state search hint. Looking up the saved syntax state for a line scans the state list from the start every time. Remember the last found entry and start the search there; the list is sorted by line number, so the result cannot be earlier.

All three are pure speedups — the resulting highlighting is unchanged, and each one ships with a test_override() flag plus a test that asserts identical synID() output with the optimization on and off.

Benchmark

Measured on a single binary that contains all three, toggling each optimization via its test_override() flag, so "all off" reproduces the current (master) algorithm. The workload is a full-buffer synID() sweep (every line, up to end-of-line); median of 5 runs.

C source (src/eval.c, 8209 lines)

Configuration	Time	Speedup
baseline (all off ≈ master)	0.788s	1.00×
+ lead-byte prefilter (#20371)	0.500s	1.57×
+ in_id_list cache (#29)	0.753s	1.05×
+ saved-state hint (#30)	0.609s	1.29×
all three on	0.358s	2.20×

Vim script (netrw.vim, 9717 lines)

Configuration	Time	Speedup
baseline (all off ≈ master)	9.16s	1.00×
+ lead-byte prefilter (#20371)	8.21s	1.12×
+ in_id_list cache (#29)	4.73s	1.94×
+ saved-state hint (#30)	8.30s	1.10×
all three on	4.41s	2.08×

The three target different bottlenecks and are complementary: the prefilter dominates for C (many patterns with distinct lead bytes), the in_id_list() cache dominates for Vim script (large contains/cluster lists, e.g. netrw), and the saved-state hint gives a smaller, filetype-independent gain. Combined, that's roughly 2.2× on C-heavy code and 2.1× on Vim-script-heavy code, with highlighting output unchanged.

dkearns · 2026-06-03T08:35:01Z

I was suggesting that it might be useful to expand upon the very basic existing optimisations as this would probably (not always easy to intuit) be generally useful as well, but let's put that aside.

It makes little sense to me that the pattern is being parsed by anything other than the regex engine. It introduces a whole class of problems that just don't need to exist. We've already identified two issues just by eyeballing the parser.

There's certainly no need to query both engines and the BT engine is probably easier to work with for current purposes. However, if we don't want to generate these results during compilation, and don't want to process the compiled pattern, then it would be better just to parse the output of the existing regdump(). It's certainly no uglier than the current approach, accurate, and we get to support all pattern constructs for free.

Problem: Every syntax pattern is run against every line, including lines that obviously cannot match it. Solution: For each pattern derive the set of bytes that must occur in any match (a conservative analysis that bails on constructs it cannot model soundly), and skip the pattern on lines missing one of those required bytes. Highlighting is unchanged; verify by toggling test_override('syn_prefilter'). Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

h-east · 2026-06-03T10:33:35Z

Thanks — I took the "let the regex engine do it" suggestion seriously and
measured it before deciding, because it would clearly be the cleaner design if
it carried its weight. The result changed my mind about what to extract, so
let me share the numbers.

Benchmark: a ~25,000-line C file (Vim's own change.c/buffer.c/eval.c/
syntax.c/regexp.c/ex_docmd.c/edit.c concatenated), highlighted with
filetype=c, full-buffer synID() cold pass, min of 3 runs. Speedup vs. the
prefilter disabled (test_override('syn_prefilter', 1)):

signal used to reject a line	speedup
first byte only (≈ `regstart`)	~1.0x (no measurable gain)
engine `regmust` → required bytes	~1.03x
required-byte set (union of mandatory bytes)	~1.43x
first-byte set + required bytes (the original PR)	~1.42x

Two takeaways:

The first-byte set contributes essentially nothing on this workload, and
it was also the riskiest part (alternation start-sets, \c, magic mode, the
under-approximation completeness you were rightly worried about). So I
removed it entirely.
The entire win comes from the required-byte set — the union of bytes that
must occur somewhere in any match. The engine's exposed analyses don't
reproduce it: regstart is a single first character, and regmust is a
single longest literal supplied only when the pattern is "potent enough" —
both measure ~1.0–1.03x here. To get the required-byte union from the
engine you'd have to add a new must-set traversal to regexp_bt.c (it isn't
an existing field, and it isn't in regdump() output either), i.e. new
engine code rather than reuse.

So the PR is now a required-byte prefilter only (~124 fewer lines): the
analyzer no longer tracks any start-byte set, just the conservative
required-byte set, bailing on anything it can't model soundly. Correctness is
covered by the on/off equivalence tests plus differential fuzzing over random
patterns (several thousand cases, no mismatches); and structurally, required-byte
rejection is a strict subset of the previous filter's rejections, so it can only
ever skip fewer lines, never wrongly skip one.

I'm happy to revisit an engine-side must-set traversal later if we want
correctness-by-construction, but given the data I'd rather not block this on new
engine code.

dkearns · 2026-06-04T18:47:26Z

Something is getting lost in translation here...

To get the required-byte union from the engine you'd have to add a new must-set traversal to regexp_bt.c (it isn't an existing field, and it isn't in regdump() output either), i.e. new engine code rather than reuse.

I was being slightly facetious in suggesting that we parse the regdump() output. That would be silly, and I intended it more as a sign post, but it would still be better than a new pattern parser in syntax.c. I'm suggesting that this required character set be computed from an analysis of the compiled pattern. This poses no engine-side risks; it is read-only.

As I read it we have gone from 2.2x to 1.4x for the prefilter on the same big.c in the pursuit of correctness that the compiled pattern provides for free.

I can't justify the existence of the new pattern parser and wouldn't like to see it merged.

h-east · 2026-06-04T20:20:06Z

One number first, since I think it's skewing the trade-off: the 2.2x is the
three optimizations combined, never the prefilter alone. The prefilter on its
own is ~1.57x (eval.c) / ~1.12x (netrw), and dropping the first-byte set took it
from ~1.42x to ~1.43x on the big file — so no prefilter performance was traded
for correctness. The required-byte set was always carrying essentially the whole
prefilter win.

On deriving the required set from the compiled pattern: I should separate two
things I've conflated with it before, because neither objection actually applies
to what you're proposing now.

The ~1.0-1.03x numbers I measured were regstart/regmust — the engine's
existing outputs. They don't carry the win because they're a single start
char and a single literal, not the required-byte union. That result only
says "reusing the existing fields doesn't work", not "an engine-derived
required set would be slow". A read-only walk that computes the union from the
regprog would compute the same signal the analyzer computes now, so it should
land at the same ~1.43x. So I'm not going to argue this on speed.
My earlier "too risky / new engine code" was about adding set-valued outputs to
the BT and NFA compilers — modifying execution-critical matching code. A
read-only walk of the compiled BT program isn't that, and you're right that it
carries no execution risk. I withdraw that framing for this proposal.

What's left is narrower but, I think, still the deciding factor: a regprog walk
couples this syntax-layer optimization to the engine's internal, undocumented
bytecode layout — the representation that is otherwise only surfaced for
debugging via regdump(). And it isn't free of the correctness burden either: a
required-byte walk still has to be written correctly per opcode (BRANCH =
intersect across alternatives, STAR/\? = contributes nothing, ANYOF = no single
required byte, the \{n,m} reversal cases, …), and a mistake there over-claims
required bytes and wrongly skips a line exactly as a source-level mistake would.
So it relocates the per-construct correctness work onto engine internals rather
than removing it.

Against that, the current analyzer is self-contained, depends on nothing private
to the engine, and fails safe: when it can't prove a construct sound it bails and
the pattern is always tried, so a gap costs a missed skip, never a wrong one. It
already bails on \{n,m}, character classes, and backslashes inside [...]; the
floor is "model less, bail more", which only ever costs speed.

I'm not opposed to a read-only regprog analysis as a later refinement if we
decide the bytecode coupling is acceptable. But given the corrected numbers and
the fail-safe behaviour, I'd rather land the self-contained version than gate it
on coupling to the compiled program.

h-east force-pushed the syntax-leadbyte-prefilter branch from 31c79fd to 7115f46 Compare May 29, 2026 15:48

h-east mentioned this pull request May 29, 2026

perf: checking the syntax contains/cluster list is slow h-east/vim#29

Open

This comment was marked as outdated.

Sign in to view

h-east added the performance label May 30, 2026

h-east force-pushed the syntax-leadbyte-prefilter branch 2 times, most recently from 64b37b3 to 32a75fe Compare June 1, 2026 02:13

chrisbra reviewed Jun 1, 2026

View reviewed changes

h-east force-pushed the syntax-leadbyte-prefilter branch from 32a75fe to a883733 Compare June 2, 2026 02:24

dkearns reviewed Jun 2, 2026

View reviewed changes

Comment thread src/syntax.c

h-east force-pushed the syntax-leadbyte-prefilter branch from a883733 to 04895db Compare June 2, 2026 19:04

h-east force-pushed the syntax-leadbyte-prefilter branch from 04895db to fecd3d5 Compare June 3, 2026 10:23

h-east changed the title ~~perf: syntax highlighting tries patterns that cannot match a line~~ perf: syntax highlighting skips patterns that cannot match a line Jun 3, 2026

Uh oh!

Conversation

h-east commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

h-east commented May 29, 2026

Measurement results

Notes

Uh oh!

h-east commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Self-review: maintainability of the prefilter analyzer, and how I plan to address it

Uh oh!

h-east commented May 29, 2026

Uh oh!

This comment was marked as outdated.

h-east commented May 31, 2026

Uh oh!

chrisbra commented May 31, 2026

Uh oh!

h-east commented Jun 1, 2026

Uh oh!

chrisbra Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

h-east Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

dkearns commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

h-east commented Jun 2, 2026

Uh oh!

Uh oh!

dkearns commented Jun 2, 2026

Uh oh!

h-east commented Jun 2, 2026

Uh oh!

h-east commented Jun 2, 2026

Benchmark

Uh oh!

dkearns commented Jun 3, 2026

Uh oh!

h-east commented Jun 3, 2026

Uh oh!

dkearns commented Jun 4, 2026

Uh oh!

h-east commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

h-east commented May 29, 2026 •

edited

Loading

h-east commented May 29, 2026 •

edited

Loading

dkearns commented Jun 2, 2026 •

edited

Loading