Fix flat YAML highlighter depth/position bugs (#23, #24) + a depth-witness gate#27
Open
johnsoncodehk wants to merge 4 commits into
Open
Fix flat YAML highlighter depth/position bugs (#23, #24) + a depth-witness gate#27johnsoncodehk wants to merge 4 commits into
johnsoncodehk wants to merge 4 commits into
Conversation
…rt (#23) A value-leading `---` / `...` (`note: --- x`, `x: ... bar`, `- --- x`) was scoped as a document marker in the flat YAML highlighter. The parser already constrains the markers to stream position structurally (DocStart / DocEnd appear only in the Stream grammar), but gen-tm emits the token pattern into every context, dropping that constraint — the flat-highlighter analogue of a position the parser gets free. Fix: anchor the markers to line start (YAML §9.1.1 makes a marker column-0-only), which carries the same constraint into the derived grammar. This exposed a latent lexer bug: `start()` compiled to a bare `^` under the sticky `y` matcher, which matches only at index 0 (file start), so a marker at the start of a LATER line (`# c\n---\n…`) stopped lexing — parser-alignment dropped 100% → 95%. `start()` means line start everywhere else (it serializes to `^`, stripped in monarch / tree-sitter), so the lexer is corrected to compile start-anchored token patterns with the `m` flag — `^` then matches at every line start, restoring 100%. Also adds test/yaml-depth-witnesses.ts: a raw-scope regression gate for the flat highlighter's depth/position sites. The scope-gap metric reported monogramWrong=0 here because it is corpus-bound (these inputs aren't in yaml-test-suite) AND excludes lexical-floor roles (a `-` mis-painted as string is invisible). The gate constructs one witness per scanner state field and asserts the raw inner scope, so neither blind spot can hide a regression. #24 (nested compact sequence sibling vs plain-scalar fold) is tracked there as a known bug pending indent-region derivation. Parser CST + the other six grammars byte-identical; src-coverage-yaml 100%, scope-gap-yaml monogramWrong=0, tree-sitter-yaml 97.8%, js shebang unregressed.
…#24)
In the flat derived YAML TextMate highlighter, a nested compact sequence's
sibling item (`- - a\n - b\n- c`) was wrongly swallowed by the preceding plain
scalar's multi-line fold: the inner `- b` lost its `punctuation` (sequence
indicator) scope and read as one `string.unquoted` token. The §2a' fold region
is LINE-relative (its `\1` is the line's leading whitespace), but a YAML
continuation is NODE-relative (more indented than the enclosing dash/key). For a
single sequence or a mapping the two coincide, so those folds are correct; they
diverge only for a COMPACT nested sequence, whose inner dash sits at column 2
after the outer `- ` prefix (not whitespace) — a sibling `- b` at column 2 reads
to a `\1=""` fold as "indented past column 0" and is folded.
Derive (from the grammar's block-sequence rule + indent config, gated on
`grammar.indent`) a column-anchored COMPACT block-sequence region (gen-tm §2c):
a `\G`-anchored begin/while — re-anchored each line by the meta.stream wrapper —
that reclaims the inner sibling `- ` at the inner indicator's column before the
plain fold can swallow it. It mirrors the maintained RedCMD YAML grammar's
block-sequence but uses only a FIXED-width compact re-anchor lookbehind
(`(?<=[-?:])`), portable under Onigmo (RedCMD's variable-length
`(?<![^\t ][\t ]*+:|---)` is rejected by Onigmo / GitHub-Linguist). It triggers
ONLY on the compact case (a dash followed by another dash), so single `- a`
items, `- key: v` mappings, `- {…}`/`- "…"`/`- |` values, and plain folds are
untouched; the item body reuses the full top-level dispatch (minus the two
line-relative folds, plus a bounded plain-fold for a bare plain item value).
Gated on `grammar.indent`, so the six other grammars regenerate byte-identical.
The yaml-depth-witnesses #24 case is now an asserted pass (counter-proof
`x: hello\n - b` still folds); scope-gap stays 100% / 0 monogram-wrong (and now
also correctly scopes three real corpus compact-sibling cases: 3ALJ, 6BCT,
W42U); parser src-coverage 100%; portability, issue#12, sanity, agnostic, and
the tree-sitter target all unchanged.
…sidual) The #24 fix's COMPACT block-sequence region (gen-tm §2c) reclaimed a dash at ANY depth as a sibling (its while arm 1 used `[\t ]*${dash}`), so a `-`-led continuation indented STRICTLY DEEPER than the inner indicator kept its `-` scoped `punctuation` instead of folding into the plain scalar — `- - a\n - b` is `[["a - b"]]` (the deeper `- b` is plain content, not an item), the one residual the witness tracked as a known-bug. Pin the inner indicator's column PORTABLY instead of matching any depth. The begin captures the indicator run between the outer and inner dash as group 4, so the while reconstructs the inner column as `\1\2 \4` (outer indent + the dash's own column + the captured run — a multi-space compact `- - x` pins correctly too): arm 1 reclaims a dash AT EXACTLY that column (a sibling -> punctuation), arm 2 is a zero-width lookahead that keeps the region alive on a strictly-deeper line so a nested deeper #block-sequence (re-opened per compact level) gets first claim on its own sibling, and a deeper line that opens no nested sequence is folded by a new body rule #block-fold (`^([\t ]+)…(plain run)`, anchored at line start so it never fires on the header line's inline inner item; excludes a comment or a deeper `key:` so a mapping item value's deeper entry keeps its structure). No variable-length lookbehind, so it stays portable under Onigmo / GitHub-Linguist (RedCMD achieves the same semantics only with a rejected variable-length lookbehind). The depth-witnesses deeper-irregular case is now an asserted pass (the column- aligned sibling and counter-proof still hold; a deeper-NESTED sibling `- - - a\n - b` stays punctuation). Highlighter-only and gated on `grammar.indent`: scope-gap stays 100% / 0 monogram-wrong, parser src-coverage 100%, the six other grammars + tree-sitter regenerate byte-identical, and portability / issue#12 / sanity / agnostic are unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The flat YAML TextMate highlighter is derived from a token stream, so it loses
cross-line state the parser keeps on a stack (indentation depth) or in grammar
structure (a token's legal position). Wherever the correct scope depends on that
state, the flat derivation provably diverges somewhere — the bug is guaranteed by
the expressiveness gap, not incidental. This PR fixes the two such sites filed as
#23 and #24, and adds a gate that constructs witnesses for these sites from the
scanner's state fields so the class can't silently regress.
#23 — value-leading
---/...scoped as a document markernote: --- not a marker,x: ... bar,- --- xhad their---/...scopedas
entity.other.document.*. A YAML marker is column-0-only; the parser enforcesthis structurally (DocStart / DocEnd appear only in the Stream grammar), but gen-tm
emits the token pattern into every context, dropping the constraint. Fix: anchor
the markers to line start in
yaml.ts(start()).That surfaced a latent lexer bug:
start()compiled to a bare^under thesticky
ymatcher, which matches only at index 0 (file start), so a marker at thestart of a later line (
# c\n---\n…) stopped lexing and parser-alignment fell100% → 95%.
start()means line start everywhere else (it serializes to^,stripped in monarch / tree-sitter), so
src/gen-lexer.tsnow compiles start-anchored token patterns with the
mflag. Alignment back to 100%; the JS shebang(the only other
start()user) is unregressed.#24 — nested compact sequence item swallowed by the plain-scalar fold
- - a\n - b\n- c: the inner- bis a sibling sequence item, but the flatplain-scalar fold swallowed it as
string.unquoted(the-lost its punctuation).The fold is line-relative (
\1= the line's leading whitespace) while a YAMLcontinuation is node-relative (more indented than the enclosing dash/key); the two
coincide for a single sequence or a mapping but diverge for a compact nested
sequence, where the inner dash sits past a non-whitespace
-prefix — so no\1-relative backreference or possessive quantifier can tell a sibling from afold (and the counter-example
x: hello\n - bis byte-identical at the front yetmust keep folding).
Fix (
src/gen-tm.ts, gated ongrammar.indent): a newdetectBlockSequencederives the block-sequence rule + its
-indicator from the grammar shape, and acolumn-anchored block-sequence region (
\G-anchored, re-anchored each line by theexisting
meta.streamwrapper, self-recursing per compact level) pins each level'sinner column portably — the begin captures the indicator run and the
whilereconstructs the inner column as
outer-indent + the dash's own column + the run,so it reclaims a dash at exactly that column as a sibling (
punctuation) while astrictly-deeper line folds into the item's plain scalar (
string) via a new#block-foldbody rule. This handles every shape correctly against theyamloracle — the column-aligned sibling (
- - a\n - b→[["a","b"]]), the deeperfold (
- - a\n - b→[["a - b"]]), a multi-space compact (- - a\n - b),and a deeper-nested sibling (
- - - a\n - b→[[["a","b"]]]) — with novariable-length lookbehind (RedCMD needs one; this doesn't), so it stays portable
under Onigmo / GitHub-Linguist. Single
- aitems,- key: vmappings,- {…}/- "…"/- |values, and ordinary plain folds are untouched.A raw-scope regression gate (the durable fix)
scope-gap:yamlreportedmonogramWrong=0while both bugs sat in plain sight: itis corpus-bound (these inputs aren't in yaml-test-suite) AND excludes lexical-floor
roles (a
-mis-painted as string is invisible becausepunctuationis floor-excluded and the
bbeside it grades correct). So "0 wrong" never meant "no bug",only "no bug this metric can see".
test/yaml-depth-witnesses.tsconstructs onewitness per scanner state field (indent stack, flow depth, block-scalar region,
marker position, node-property lead) and asserts the RAW inner scope — oracle-
independent and floor-blind, so neither blind spot can hide a regression. Wired
into CI; 10 asserted pass, 0 known bug.
Parser CST + the other six grammars byte-identical;
src-coverage-yaml100%,scope-gap-yamlmonogramWrong=0 (100%),redcmd-tm-diagnosticsOnigmo-clean,tree-sitter-yaml97.8%, issue#12 10/10, sanity 15/15, agnostic 9/9.Closes #23. Closes #24.