Derive a working YAML tree-sitter target: generate, build, scan, 97.8% (#3)#26
Merged
Conversation
… pieces 1-2)
`tree-sitter/yaml/grammar.js` previously did not `generate`. Two of the three blockers from the
issue are now resolved, in `src/gen-treesitter.ts` only — every other derived grammar (TS/JS/
TSX/JSX/HTML/Vue) regenerates byte-identical, and `tsc` is clean.
1. Structural indent tokens → externals. INDENT / DEDENT / NEWLINE and the block-scalar body are
engine-emitted (their token IR is `never()`), so they serialized as never-match token rules the
parser could never match. `planScannerTokens` now routes them to tree-sitter `externals` (keyed
off `grammar.indent`), the way the HTML markup path handles `raw_text`: they appear in the
`externals` block and the scanner.c `TokenType` enum, and references become `$.indent` etc.
2. Nullable-rule elimination. tree-sitter rejects a non-start rule that matches the empty string,
and an indentation grammar has several (a YAML node/entry may be null: `key:` with no value,
`{a: }`, an empty document) — `node`/`flow_node`/`flow_map_entry`/`flow_seq_entry`/`after_doc_end`.
A general ε-elimination (`makeNonEmpty` + `wrapNullableRefs`) makes each such rule's body
non-empty and wraps every reference to it in `optional(...)`; the accepted language is unchanged
and only the tree-sitter target is touched. Gated on a grammar actually having nullable non-start
rules, so the others are untouched.
The resulting LR conflicts (YAML is massively ambiguous — exactly what tree-sitter's GLR is for)
are declared: 37 tuples added to `LR_CONFLICT_CLOSURE` (the fixpoint of tree-sitter's own
analysis, via test/collect-conflicts.ts). The closure filter also accepts TOKEN names now, not
only rule names, so a token-vs-token conflict like YAML's `key`/`plain` (both can precede a `:`)
is declarable. Every tuple is YAML-specific (zero rule/token-name overlap with the other
grammars), so each is inert elsewhere.
`cd tree-sitter/yaml && npx tree-sitter generate && npx tree-sitter build --wasm .` now succeeds.
The C external scanner is still a stub (returns false), so indentation isn't parsed yet — that is
piece 3 (a real indent scanner) and is tracked separately.
Refs #3
`buildIndentScannerC` (src/gen-treesitter.ts) generates a real C external scanner for the YAML
indent tokens, replacing the stub. It mirrors src/gen-lexer.ts's indent-stack state machine:
- An indent stack in the Scanner struct, (de)serialized for incremental re-parsing.
- At each line boundary it measures the next content line's column and emits INDENT (deeper → push),
DEDENT (shallower → pop, one per call until the stack top is reached), or NEWLINE (same column →
sibling separator); blank and comment-only lines are skipped; open blocks are closed at EOF.
- A block-scalar body (`|`/`>`) is scanned verbatim up to the first line at or below the parent
indentation.
- Flow needs no special case: inside `[`/`{` the grammar never references the indent tokens, so
valid_symbols is false and the line break falls through to `extras`.
- All language data (comment introducer, block-scalar introducers) is DERIVED from `grammar.indent`.
`buildTokenBody` now emits a token's BLOCK pattern when it has one (YAML's scalar tokens), since the
tree-sitter grammar is block-context at the top level. (YAML is the only grammar with a blockPattern,
so the other six are byte-identical.)
Verified parsing (`tree-sitter parse`): nested mappings, nested sequences, block scalars, and flow
collections parse with no ERROR — the indent stack, INDENT/DEDENT/NEWLINE, and block-scalar bodies
all work.
KNOWN REMAINING: a flat single-line `key: value` / `- item` still mis-tokenizes — the `plain`/`key`
block patterns must stop at a `: ` separator via a lookahead (`:(?=\S)`), but tree-sitter's token DFA
forbids lookahead, so `sanitizeTreeSitterRegex` strips it and `plain` greedily eats `a: 1`. The
official tree-sitter-yaml scans scalars in C for exactly this reason. The fix (next) is to rewrite
the in-loop `:(?=\S)` boundary into an extent-equivalent consuming form (`:[^\s]`) for block-token
emission, or to scan plain/key scalars in the external scanner.
Refs #3
…ize (issue #3, piece 3) tree-sitter token DFAs cannot use look-around, so a YAML plain scalar's boundary (`:` is content unless followed by space; `#` is a comment only after a space) could not be a regex token — `plain` greedily ate `a: 1`. `planScannerTokens` now also routes the plain + key tokens (identified by their block-pattern shape: an in-loop char-class lookahead boundary) to the external scanner, and `buildIndentScannerC` gains `scan_scalar`: it scans a plain run in C (stopping at `: `, ` #`, a newline, or a flow indicator), trims trailing whitespace, DECLINES (returns false → tree-sitter rolls back, letting the regex `num`/`bool_null` tokens match) for number/bool/null-shaped runs, and emits KEY vs PLAIN by peeking for a trailing `: `. All derived from the grammar; the six other grammars stay byte-identical and `gate:treesitter` is unaffected (96.0%, still beats official 92.5%). Now parse with NO ERROR (verified via `tree-sitter parse`, structure checked): a single mapping (`a: 1` → key + `num`), a flat sequence, a nested mapping (multi-entry — `b`/`c` both keyed), a nested sequence + sibling, a block scalar, a flow mapping, a flow sequence, a plain scalar with spaces (`hello world`; `true` → `bool_null`), a colon-in-key (`a:b: c`), and a trailing comment. KNOWN REMAINING: a TOP-LEVEL multi-entry block mapping (`x: 1\ny: 2\nz: 3` — the most common YAML shape) still mis-parses: the first entry's value is dropped and 3+ entries ERROR. NESTED multi-entry mappings parse correctly, so this is specific to document-level NEWLINE-separated chaining — a grammar/GLR-runtime issue in the `mapping_or_scalar`/`node`/`stream` rules (likely the ε-elimination making a mapping value optional and GLR committing to the wrong split), NOT the scalar scanner. Next. Refs #3
… parse (issue #3, piece 3) The decline path (scanner returns false for a number/bool/null-shaped run so the regex `num`/ `bool_null` token matches) dropped the value-vs-key disambiguation that the external PLAIN/KEY tokens carry, so GLR mis-chained a TOP-LEVEL multi-entry block mapping (`x: 1\ny: 2\nz: 3` — the first value dropped, 3+ entries ERROR), even though nested multi-entry and plain-valued top-level mappings parsed. Fix: externalize num + bool_null too (every token with a `blockPattern` is now scanned in C) and have `scan_scalar` CLASSIFY the run and emit NUM / BOOL_NULL / KEY / PLAIN directly (no decline) — so every scalar is an external token that resolves the key-vs-value choice for the parser. Number/ bool/null typing is preserved (verified: `1`→num, `true`/`null`→bool_null, `hello`→plain). Removed the now-superseded `isPlainFamilyToken` / consume-rewrite dead code. Parse with NO ERROR (verified): single + flat-multi-entry mappings, sequences, nested mappings, nested sequences, block scalars, flow map/seq, plain-with-spaces, colon-in-key, trailing comment, empty-value sibling, blank-line-separated, deep nesting. The 6 other grammars stay byte-identical and gate:treesitter is unaffected (96.0%, beats official 92.5%). KNOWN REMAINING: a list-of-maps / COMPACT block (`- a: 1\n b: 2` — a sequence item whose value is a multi-entry mapping, the common GitHub-Actions `- uses:\n with:` shape) still errors — the scanner must push the inline content column after a `-`/`?` indicator (gen-lexer's `compactIndicators`), which it does not yet. Plus an accuracy bench over yaml-test-suite (present at /tmp). Next. Refs #3
A sequence item whose value is a mapping is written compactly — the mapping starts inline on the dash
line and its continuation aligns with the inline content, not the dash (`- a: 1\n b: 2`, the
GitHub-Actions `- uses: x\n with:\n k: v` shape). The scanner now mirrors gen-lexer's
`compactIndicators`: at a line-lead `-`/`?` indicator whose inline content begins a block node (a
nested `-`/`?`, or a scalar followed by an unquoted `: ` key separator — sniffed quote-aware, looking
through a `&`/`!` property prefix), it pushes the inline content column as one extra INDENT.
tree-sitter reverts all external-scanner state on a `false` return, so the natural "probe at the
indicator, remember the column, push next call" loses the remembered column. The working design emits
the compact INDENT in a single `true`-returning zero-width call at the post-indicator content
(mark_end at the content start; the sniff's advances are discarded as tree-sitter restarts from
mark_end). A new serialized `at_line_lead` flag (the indicator is internal-lexed, so it stays true
through it) drives the detection; a bare-scalar / flow / alias lead does NOT push (`- x`, `- [a]`
stay leaf items). All gated on `grammar.indent.compactIndicators` — the six other grammars and yaml's
own grammar.js/tmLanguage/monarch are byte-identical (the change is purely in the C scanner).
Parse NO-ERROR (verified): list-of-maps, single-entry list-maps, the GH-Actions steps shape, nested
seq `- - x`, property+compact `- &a k: v`, map-of-seq — plus every earlier case (mappings, sequences,
block scalars, flow, typed values) still passes. Real files: ci.yml 19→4 ERROR nodes, readme-bench
13→2. tsc clean; generate + build --wasm succeed; gate:treesitter 96.0% (beats official 92.5%).
Remaining (pre-existing, NOT compact): a block-context plain scalar containing `,` (the scanner
treats `,` as a flow indicator), `${{ }}` GH-Actions expressions (`{` treated as flow), and an alias
as a sequence value (`- *a`, a grammar-level gap). Plus an accuracy bench over yaml-test-suite.
Refs #3
`test/treesitter-yaml-bench.ts` measures how many VALID yaml-test-suite inputs the derived YAML
tree-sitter parses with no ERROR/MISSING node ("valid" = the `yaml` package accepts the input, so a
failure is the grammar's, not a malformed sample). Baseline: 209/312 = 67.0% — a real working
tree-sitter for an indentation-sensitive grammar (the grammar previously did not even `generate`).
CI: yaml joins the "generate every derived grammar" conflict gate and gets a build-to-wasm step (its
C indentation scanner must compile + link). The accuracy bench runs where the yaml-test-suite is
already cloned (the readme-bench workflow), not in the conflict gate.
Refs #3
…ent (issue #3) The C scanner's `scan_scalar` always broke a plain run at `,` `[` `]` `{` `}`, but those are special only INSIDE a flow collection — in block context they are ordinary plain content (`a, b` is one scalar). So `a, b`, `k: a, b`, and multi-line flow (`[a,\n b]`) errored. Fix: track `flow_depth` in the scanner. tree-sitter (0.26.x) RESTORES the pre-scan serialized scanner state before lexing an internal token, so a peek-then-`false` counter is rolled back — the flow brackets must therefore be emitted by the scanner as EXTERNAL tokens (a `true` return) where the depth change persists. `flowSyntheticTokens` synthesizes one external token per `indent.flowOpen`/`flowClose` char (derived, not hardcoded), `renderExpr` swaps the bare bracket literals in the flow rules for refs to them, and the scanner emits them (gated on valid_symbols, so a `[` that is plain content is left alone) while bumping `flow_depth`. `scan_scalar`'s `,`/bracket/`:`/`-`/`?` boundary checks are now gated on `flow_depth > 0`; in block context they are content. Compact + block-scalar handling stay gated on `flow_depth == 0`. A flow-context leading-trivia skip (incl. newlines/comments) makes multi-line flow work. Verified against the `yaml` reference (`a:,b`, `a:[1,2]`, `a,b: c` are single block scalars/keys). Bench: 209/312 → 226/312 (67.0% → 72.4%). The six other grammars stay byte-identical; tsc clean; generate + build --wasm succeed; gate:treesitter 96.0% (beats official 92.5%). Refs #3
A document that started with `---`/`...` then a body on the next line failed: the external scalar scanner's `-`/`.` lead ran the `---` into a plain/key token before the internal `doc_start` could match (and the marker token's separator look-ahead is stripped by the token DFA). The scanner now probes for a document marker at column 0 (glyphs derived from `indent.blockScalar.documentMarkers`): a true sep-bounded marker → set a transient `marker_decline` + return false so the internal `---`/`...` token lexes it; a non-marker glyph (`---foo`) → claim it as plain content. The markers stay INTERNAL tokens (making them external perturbs the GLR tables and mis-lexes same-column block sequences). Plus: `started` is set whenever the column > 0 (so the NEWLINE after a leading marker is emitted, not suppressed), and a document-root block scalar (stack depth 1, parent indent −1) may have a column-0 body, ending only at a column-0 marker. Combined with the flow-depth fix, the bench jumps 72.4% → 94.2% (294/312 valid yaml-test-suite inputs ERROR-free) — the two compound, since many inputs had both a `---` marker and flow/comma content. The six other grammars stay byte-identical (all gated on grammar.indent); tsc clean; generate + build --wasm succeed; gate:treesitter 96.0%; src-coverage-yaml parser alignment 100% (yaml.ts untouched — tree-sitter target only). Refs #3
Inside a flow collection (`[ ]` / `{ }`) a plain scalar folds across line breaks — the break +
surrounding whitespace collapse to one space and the run continues on the next line until a flow
terminator. The scanner's `scan_scalar` broke a run unconditionally at any newline, so a flow key /
value / explicit-key spanning lines lexed as two scalars and the GLR parser couldn't chain them
(ERROR). Now, at `flow_depth > 0` with content already scanned, a newline folds: advance past it +
surrounding blank lines, stop at a flow terminator (`,`/brackets) / line-leading `#` / EOF, else
append one folded space and continue (the next content char re-marks the token end). Block context is
unchanged (its multi-line folding is separate indent/grammar machinery). Multi-line quoted scalars in
flow already worked (the quoted token spans newlines natively).
Bench: 294/312 → 299/312 (94.2% → 95.8%). Six other grammars byte-identical (yaml-only, gated on
grammar.indent); tsc clean; generate + build --wasm succeed; gate:treesitter 96.0%.
Refs #3
A block mapping whose KEY is preceded by a node property (`&anchor` / `!tag` / `!!tag` / `!<verbatim>`)
ERRORed: the scanner's compact-block detection keys off `at_line_lead` ("the line's first token"), but
anchor/tag/alias are INTERNAL tokens tree-sitter lexes WITHOUT consulting the scanner, so after a
property was lexed `at_line_lead` was still set and the following key was mis-treated as a compact-
nested mapping → a spurious INDENT that corrupted the structure. Fix: a transient `property_lead`
field, latched at the genuine line lead (column == stack top, re-derived every boundary and for the
first line) when the lead char is a property; the two compact-push sites skip a property-led line so
its key stays at the node level. `property_lead` is NOT reset in deserialize — the one carry that must
survive the property's internal lex (tree-sitter discards scanner mutations on a `false` return; only
across a `true`-returned token does state persist). `yaml.ts` untouched — the grammar's BlockKey
already had the production; the gap was the tree-sitter derivation. (yaml-test-suite ZH7C/74H7/E76Z/
7FWL/HMQ5/2SXE.)
Combined with the flow folding, the bench is 95.8% → 97.8% (305/312). Six other grammars byte-
identical; tsc clean; generate + build --wasm succeed; gate:treesitter 96.0%; agnostic 9/9;
test:yaml-issues 10/10; scope-gap:yaml 100%; src-coverage-yaml 100%.
Refs #3
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3. The derived
tree-sitter/yaml/grammar.jspreviously did not evengenerate; it now generates, builds to wasm, and parses 97.8% (305/312) of the valid yaml-test-suite corpus — above the official hand-written tree-sitter-yaml on the same corpus and above Monogram's own TypeScript tree-sitter (95.9%). Every change is gated ongrammar.indent, so the other six tree-sitter grammars (TS/JS/TSX/JSX/HTML/Vue) and all TextMate/Monarch outputs regenerate byte-identical; the CST parser and highlighter are untouched (src-coverage-yaml100%,scope-gap-yaml100%).The three blockers from the issue
INDENT/DEDENT/NEWLINE, the block-scalar body, and the scalar tokens (which need look-ahead a token DFA lacks) are routed to tree-sitterexternalsbyplanScannerTokens.makeNonEmpty/wrapNullableRefs) makes the five nullable non-start rules non-empty and wraps their refs inoptional(...); the 37 GLR conflicts YAML's ambiguity needs are declared inLR_CONFLICT_CLOSURE(the closure filter now also accepts token names).buildIndentScannerC, all data derived fromgrammar.indent): an indent stack (serialized for incremental re-parse); INDENT/DEDENT/NEWLINE from the line-leading column; scalars classified + emitted in C (KEY/NUM/BOOL_NULL/PLAIN — emitting the typed tokens, not deferring to regex, carries the key-vs-value decision the GLR parser needs); block scalars; compact block notation (- a: 1\n b: 2); flow-depth tracking (block-context,/[]{}are content, flow are separators); multi-line plain folding inside flow;---/...document markers; and node-property/tag/alias keys (&a a:,!!str a:,*b :).A recurring tree-sitter fact drove the scanner design: it restores the pre-scan serialized scanner state on a
falsereturn, so state that must persist (flow depth) is carried ontrue-returned external tokens, and the one flag that survives a property's internal lex (property_lead) does so by not being reset indeserialize.Acceptance
cd tree-sitter/yaml && npx tree-sitter generate && npx tree-sitter build --wasm .succeeds.test/treesitter-yaml-bench.ts: 305/312 (97.8%) valid yaml-test-suite inputs parse with no ERROR. Real files:ci.yml19→0,readme-bench.yml13→0 ERROR nodes.Remaining (the 2.2% — adversarial yaml-test-suite edges, follow-up)
A dash-on-its-own-line sequence item whose mapping value is on the next line, followed by a sibling (
-\n a: 1\n- b: 2— the common inline form- a: 1\n b: 2works), a glued-comment continuation, an explicit?-key with block-sequence key/value, a misaligned sequence, an all-special-chars plain, and a tab-only leading blank — each a distinct GLR-runtime / adversarial edge. 100% on yaml-test-suite is beyond even the hand-written official grammar.