Skip to content

Fix flat YAML highlighter depth/position bugs (#23, #24) + a depth-witness gate#27

Open
johnsoncodehk wants to merge 4 commits into
masterfrom
fix-yaml-flat-highlighter-depth
Open

Fix flat YAML highlighter depth/position bugs (#23, #24) + a depth-witness gate#27
johnsoncodehk wants to merge 4 commits into
masterfrom
fix-yaml-flat-highlighter-depth

Conversation

@johnsoncodehk
Copy link
Copy Markdown
Owner

@johnsoncodehk johnsoncodehk commented Jun 8, 2026

The flat YAML TextMate highlighter is derived from a token stream, so it loses
cross-line state the parser keeps on a stack (indentation depth) or in grammar
structure (a token's legal position). Wherever the correct scope depends on that
state, the flat derivation provably diverges somewhere — the bug is guaranteed by
the expressiveness gap, not incidental. This PR fixes the two such sites filed as
#23 and #24, and adds a gate that constructs witnesses for these sites from the
scanner's state fields so the class can't silently regress.

#23 — value-leading --- / ... scoped as a document marker

note: --- not a marker, x: ... bar, - --- x had their --- / ... scoped
as entity.other.document.*. A YAML marker is column-0-only; the parser enforces
this structurally (DocStart / DocEnd appear only in the Stream grammar), but gen-tm
emits the token pattern into every context, dropping the constraint. Fix: anchor
the markers to line start in yaml.ts (start()).

That surfaced a latent lexer bug: start() compiled to a bare ^ under the
sticky y matcher, which matches only at index 0 (file start), so a marker at the
start of a later line (# c\n---\n…) stopped lexing and parser-alignment fell
100% → 95%. start() means line start everywhere else (it serializes to ^,
stripped in monarch / tree-sitter), so src/gen-lexer.ts now compiles start-
anchored token patterns with the m flag. Alignment back to 100%; the JS shebang
(the only other start() user) is unregressed.

#24 — nested compact sequence item swallowed by the plain-scalar fold

- - a\n - b\n- c: the inner - b is a sibling sequence item, but the flat
plain-scalar fold swallowed it as string.unquoted (the - lost its punctuation).
The fold is line-relative (\1 = the line's leading whitespace) while a YAML
continuation is node-relative (more indented than the enclosing dash/key); the two
coincide for a single sequence or a mapping but diverge for a compact nested
sequence, where the inner dash sits past a non-whitespace - prefix — so no
\1-relative backreference or possessive quantifier can tell a sibling from a
fold (and the counter-example x: hello\n - b is byte-identical at the front yet
must keep folding).

Fix (src/gen-tm.ts, gated on grammar.indent): a new detectBlockSequence
derives the block-sequence rule + its - indicator from the grammar shape, and a
column-anchored block-sequence region (\G-anchored, re-anchored each line by the
existing meta.stream wrapper, self-recursing per compact level) pins each level's
inner column portably — the begin captures the indicator run and the while
reconstructs the inner column as outer-indent + the dash's own column + the run,
so it reclaims a dash at exactly that column as a sibling (punctuation) while a
strictly-deeper line folds into the item's plain scalar (string) via a new
#block-fold body rule. This handles every shape correctly against the yaml
oracle — the column-aligned sibling (- - a\n - b[["a","b"]]), the deeper
fold (- - a\n - b[["a - b"]]), a multi-space compact (- - a\n - b),
and a deeper-nested sibling (- - - a\n - b[[["a","b"]]]) — with no
variable-length lookbehind
(RedCMD needs one; this doesn't), so it stays portable
under Onigmo / GitHub-Linguist. Single - a items, - key: v mappings, - {…} /
- "…" / - | values, and ordinary plain folds are untouched.

A raw-scope regression gate (the durable fix)

scope-gap:yaml reported monogramWrong=0 while both bugs sat in plain sight: it
is corpus-bound (these inputs aren't in yaml-test-suite) AND excludes lexical-floor
roles (a - mis-painted as string is invisible because punctuation is floor-
excluded and the b beside it grades correct). So "0 wrong" never meant "no bug",
only "no bug this metric can see". test/yaml-depth-witnesses.ts constructs one
witness per scanner state field (indent stack, flow depth, block-scalar region,
marker position, node-property lead) and asserts the RAW inner scope — oracle-
independent and floor-blind, so neither blind spot can hide a regression. Wired
into CI; 10 asserted pass, 0 known bug.

Parser CST + the other six grammars byte-identical; src-coverage-yaml 100%,
scope-gap-yaml monogramWrong=0 (100%), redcmd-tm-diagnostics Onigmo-clean,
tree-sitter-yaml 97.8%, issue#12 10/10, sanity 15/15, agnostic 9/9.

Closes #23. Closes #24.

…rt (#23)

A value-leading `---` / `...` (`note: --- x`, `x: ... bar`, `- --- x`) was scoped
as a document marker in the flat YAML highlighter. The parser already constrains
the markers to stream position structurally (DocStart / DocEnd appear only in the
Stream grammar), but gen-tm emits the token pattern into every context, dropping
that constraint — the flat-highlighter analogue of a position the parser gets free.

Fix: anchor the markers to line start (YAML §9.1.1 makes a marker column-0-only),
which carries the same constraint into the derived grammar. This exposed a latent
lexer bug: `start()` compiled to a bare `^` under the sticky `y` matcher, which
matches only at index 0 (file start), so a marker at the start of a LATER line
(`# c\n---\n…`) stopped lexing — parser-alignment dropped 100% → 95%. `start()`
means line start everywhere else (it serializes to `^`, stripped in monarch /
tree-sitter), so the lexer is corrected to compile start-anchored token patterns
with the `m` flag — `^` then matches at every line start, restoring 100%.

Also adds test/yaml-depth-witnesses.ts: a raw-scope regression gate for the flat
highlighter's depth/position sites. The scope-gap metric reported monogramWrong=0
here because it is corpus-bound (these inputs aren't in yaml-test-suite) AND
excludes lexical-floor roles (a `-` mis-painted as string is invisible). The gate
constructs one witness per scanner state field and asserts the raw inner scope, so
neither blind spot can hide a regression. #24 (nested compact sequence sibling vs
plain-scalar fold) is tracked there as a known bug pending indent-region derivation.

Parser CST + the other six grammars byte-identical; src-coverage-yaml 100%,
scope-gap-yaml monogramWrong=0, tree-sitter-yaml 97.8%, js shebang unregressed.
…#24)

In the flat derived YAML TextMate highlighter, a nested compact sequence's
sibling item (`- - a\n  - b\n- c`) was wrongly swallowed by the preceding plain
scalar's multi-line fold: the inner `- b` lost its `punctuation` (sequence
indicator) scope and read as one `string.unquoted` token. The §2a' fold region
is LINE-relative (its `\1` is the line's leading whitespace), but a YAML
continuation is NODE-relative (more indented than the enclosing dash/key). For a
single sequence or a mapping the two coincide, so those folds are correct; they
diverge only for a COMPACT nested sequence, whose inner dash sits at column 2
after the outer `- ` prefix (not whitespace) — a sibling `- b` at column 2 reads
to a `\1=""` fold as "indented past column 0" and is folded.

Derive (from the grammar's block-sequence rule + indent config, gated on
`grammar.indent`) a column-anchored COMPACT block-sequence region (gen-tm §2c):
a `\G`-anchored begin/while — re-anchored each line by the meta.stream wrapper —
that reclaims the inner sibling `- ` at the inner indicator's column before the
plain fold can swallow it. It mirrors the maintained RedCMD YAML grammar's
block-sequence but uses only a FIXED-width compact re-anchor lookbehind
(`(?<=[-?:])`), portable under Onigmo (RedCMD's variable-length
`(?<![^\t ][\t ]*+:|---)` is rejected by Onigmo / GitHub-Linguist). It triggers
ONLY on the compact case (a dash followed by another dash), so single `- a`
items, `- key: v` mappings, `- {…}`/`- "…"`/`- |` values, and plain folds are
untouched; the item body reuses the full top-level dispatch (minus the two
line-relative folds, plus a bounded plain-fold for a bare plain item value).

Gated on `grammar.indent`, so the six other grammars regenerate byte-identical.
The yaml-depth-witnesses #24 case is now an asserted pass (counter-proof
`x: hello\n  - b` still folds); scope-gap stays 100% / 0 monogram-wrong (and now
also correctly scopes three real corpus compact-sibling cases: 3ALJ, 6BCT,
W42U); parser src-coverage 100%; portability, issue#12, sanity, agnostic, and
the tree-sitter target all unchanged.
@johnsoncodehk johnsoncodehk changed the title Fix value-leading YAML document markers; add a depth-witness gate (#23) Fix flat YAML highlighter depth/position bugs (#23, #24) + a depth-witness gate Jun 8, 2026
…sidual)

The #24 fix's COMPACT block-sequence region (gen-tm §2c) reclaimed a dash at ANY
depth as a sibling (its while arm 1 used `[\t ]*${dash}`), so a `-`-led
continuation indented STRICTLY DEEPER than the inner indicator kept its `-`
scoped `punctuation` instead of folding into the plain scalar — `- - a\n   - b`
is `[["a - b"]]` (the deeper `- b` is plain content, not an item), the one
residual the witness tracked as a known-bug.

Pin the inner indicator's column PORTABLY instead of matching any depth. The begin
captures the indicator run between the outer and inner dash as group 4, so the
while reconstructs the inner column as `\1\2 \4` (outer indent + the dash's own
column + the captured run — a multi-space compact `-  - x` pins correctly too):
arm 1 reclaims a dash AT EXACTLY that column (a sibling -> punctuation), arm 2 is a
zero-width lookahead that keeps the region alive on a strictly-deeper line so a
nested deeper #block-sequence (re-opened per compact level) gets first claim on its
own sibling, and a deeper line that opens no nested sequence is folded by a new
body rule #block-fold (`^([\t ]+)…(plain run)`, anchored at line start so it never
fires on the header line's inline inner item; excludes a comment or a deeper
`key:` so a mapping item value's deeper entry keeps its structure). No
variable-length lookbehind, so it stays portable under Onigmo / GitHub-Linguist
(RedCMD achieves the same semantics only with a rejected variable-length
lookbehind).

The depth-witnesses deeper-irregular case is now an asserted pass (the column-
aligned sibling and counter-proof still hold; a deeper-NESTED sibling
`- - - a\n    - b` stays punctuation). Highlighter-only and gated on
`grammar.indent`: scope-gap stays 100% / 0 monogram-wrong, parser src-coverage
100%, the six other grammars + tree-sitter regenerate byte-identical, and
portability / issue#12 / sanity / agnostic are unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

YAML highlighter: nested compact sequence item swallowed by plain-scalar fold YAML highlighter: value-leading ---/... scoped as document markers

1 participant