perf: syntax highlighting skips patterns that cannot match a line#20371
perf: syntax highlighting skips patterns that cannot match a line#20371h-east wants to merge 1 commit into
Conversation
Measurement resultsBuild: ./configure default CFLAGS (-g -O2), no profiling. Method: force full-buffer syntax highlighting by calling synID() on every Full-buffer highlight:
Mechanism: on a typical C buffer about 40% of regexp executions are patterns The gain depends on the filetype. For regexp-pattern-heavy syntaxes (C/C++) Correctness: byte-for-byte identical synID() output vs. the baseline across Tests: test_syntax (including a new Test_syntax_lead_byte_prefilter regression Notes
|
Self-review: maintainability of the prefilter analyzer, and how I plan to address itI want to flag the main maintenance concern with this change myself. Concern. The lead-byte prefilter derives its first-byte / required-byte sets
Mitigation I plan to add (commit on this branch). A differential test that
With that test in place, any future edit — or any new regexp construct the Alternatives considered (not in this PR).
|
Done. |
31c79fd to
7115f46
Compare
This comment was marked as outdated.
This comment was marked as outdated.
|
@chrisbra Do you have any concerns? |
|
no, this sounds like a very nice performance optimization. I'll have a closer look later, this is all a bit over my head right now :) |
64b37b3 to
32a75fe
Compare
|
Update: I've folded in the safe-by-default tightening I listed as a follow-up |
| return p; | ||
| } | ||
| if (*q == '\\' && q[1] != NUL) | ||
| c1 = *++q; |
There was a problem hiding this comment.
This looks wrong. It does not correctly handle \n \r \b \e inside collations and it will not handle [\s] as matching a backslash or s and I think it also does not handle \d<digit>, \u<digit>, \U<digit>, \x<digit> and \o<digit>. perhaps just bail out?
There was a problem hiding this comment.
You're right, thanks. Inside [...] the code treated "\x" as a literal x, which
is wrong for \t \e \r \b \n, the numeric \d \o \x \u \U forms, and "\s"
(= backslash-or-s) — the resulting first-byte set was too narrow and could skip
a matching line. As you suggested, it now bails on any backslash inside [...]
(so the pattern is always tried). Added regression cases ([\t]x, [\d65]x, [\s]x)
to Test_syntax_prefilter_classes that fail without the fix.
32a75fe to
a883733
Compare
|
I think the first/required byte analysis should be performed by the BT engine, it already provides It's just a first impression but the potential to introduce subtle, long lasting, bugs seems high. However, the performance improvements look very nice and I'm unaware of any kittens ever being harmed by bad syntax highlighting. |
|
@dkearns Thanks for the review! One clarification: this isn't a "new string parser" that reimplements or It's a cheap pre-screen in the syntax layer that avoids calling the engine So the goal isn't a faster regexec — it's skipping the call entirely. Syntax |
|
I understand, I'm suggesting the entire pre-screen pattern analysis is happening in the wrong place and an unnecessary duplication. It could be done during regex compilation and the results then exposed for use by the syntax layer. The patterns are all compiled at file load time anyway. I missed it on my first pass but it looks you've said exactly the same thing yourself. :)
Why implement an intermediate solution? |
a883733 to
04895db
Compare
|
Fair question, but my answer is the opposite of what that quote implies: I'm It wouldn't be "expose what the engine already has". I don't think that cost is justified. This prefilter is deliberately shallow: So I'd retract that "longer term" note rather than act on it: single source of |
|
I've re-run the benchmarks. (First result: #20371 (comment)) The "syntax highlighting performance trilogy" is now complete. This PR (#20371) is the first part; two more are lined up in my fork, to be submitted upstream after this one lands:
All three are pure speedups — the resulting highlighting is unchanged, and each one ships with a BenchmarkMeasured on a single binary that contains all three, toggling each optimization via its C source (
Vim script (
The three target different bottlenecks and are complementary: the prefilter dominates for C (many patterns with distinct lead bytes), the |
|
I was suggesting that it might be useful to expand upon the very basic existing optimisations as this would probably (not always easy to intuit) be generally useful as well, but let's put that aside. It makes little sense to me that the pattern is being parsed by anything other than the regex engine. It introduces a whole class of problems that just don't need to exist. We've already identified two issues just by eyeballing the parser. There's certainly no need to query both engines and the BT engine is probably easier to work with for current purposes. However, if we don't want to generate these results during compilation, and don't want to process the compiled pattern, then it would be better just to parse the output of the existing |
Problem: Every syntax pattern is run against every line, including lines
that obviously cannot match it.
Solution: For each pattern derive the set of bytes that must occur in any
match (a conservative analysis that bails on constructs it cannot
model soundly), and skip the pattern on lines missing one of those
required bytes. Highlighting is unchanged; verify by toggling
test_override('syn_prefilter').
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
04895db to
fecd3d5
Compare
|
Thanks — I took the "let the regex engine do it" suggestion seriously and Benchmark: a ~25,000-line C file (Vim's own
Two takeaways:
So the PR is now a required-byte prefilter only (~124 fewer lines): the I'm happy to revisit an engine-side must-set traversal later if we want |
|
Something is getting lost in translation here...
I was being slightly facetious in suggesting that we parse the As I read it we have gone from 2.2x to 1.4x for the prefilter on the same I can't justify the existence of the new pattern parser and wouldn't like to see it merged. |
|
One number first, since I think it's skewing the trade-off: the 2.2x is the On deriving the required set from the compiled pattern: I should separate two
What's left is narrower but, I think, still the deciding factor: a Against that, the current analyzer is self-contained, depends on nothing private I'm not opposed to a read-only |
Uh oh!
There was an error while loading. Please reload this page.