Skip to content

Infer decimal-int-string/non-decimal-int-string for regex capturing groups matching (non-)digits#5793

Open
phpstan-bot wants to merge 7 commits into
phpstan:2.2.xfrom
phpstan-bot:create-pull-request/patch-0m2x77i
Open

Infer decimal-int-string/non-decimal-int-string for regex capturing groups matching (non-)digits#5793
phpstan-bot wants to merge 7 commits into
phpstan:2.2.xfrom
phpstan-bot:create-pull-request/patch-0m2x77i

Conversation

@phpstan-bot
Copy link
Copy Markdown
Collaborator

Summary

PHPStan's regex array-shape matcher inferred numeric-string for capturing groups that match digit sequences (\d+, [0-9]+, …). Following the recent ctype_digit() work (#5311), these are better described as decimal-int-string, and groups that can only match non-digits as non-decimal-int-string. This PR teaches the regex AST walker to produce both accessory types.

Changes

  • src/Type/Regex/RegexGroupParser.php
    • createGroupType() now emits AccessoryDecimalIntegerStringType (instead of AccessoryNumericStringType) for digit-only groups, and AccessoryDecimalIntegerStringType(true) (non-decimal-int-string) for groups guaranteed to contain no digit.
    • The per-token walker tracks a single leading minus sign positionally, so -?[0-9]+ and -[0-9]+ remain decimal integers while a non-leading - (\d+-\d+) or a + ([+-]?\d+) disqualifies the inference.
    • Added negatedClassExcludesAllDigits() to detect negated classes ([^0-9], [^\d]) that exclude every digit; only those become non-decimal-int-string. A new inNegativeClass flag prevents the excluded digit tokens from disturbing the flags.
    • Decimal/non-decimal flags are AND-combined across alternation branches (with the sign flag reset per branch).
  • src/Type/Regex/RegexGroupWalkResult.php
    • Replaced isNumeric with isDecimalInteger, added isNonDecimalInteger and a seenDecimalIntegerSign flag.
  • Updated preg_match/preg_match_all/preg_replace_callback shape test expectations where digit captures now infer decimal-int-string (preg_match_shapes*.php, preg_match_all_shapes.php, bug-11293, bug-11311, bug-14177).

Root cause

The regex walker only modeled "matches digits" as numeric-string. numeric-string is broader than what a pure-digit regex matches (it also covers floats, exponents, + signs), so it under-described these groups. The walker already only set the numeric flag for pure-digit content, so the refinement to decimal-int-string is a strict improvement on the same code path. The inverse — a regex that can only match non-digit characters — was previously left as a plain (non-empty-)string; it is now non-decimal-int-string.

Test

  • New tests/PHPStan/Analyser/nsrt/bug-14750.php covers the reported group cases (-?[0-9]+, [3-9]+, \d+, [3-9], [^0-9]) plus edge cases: \d* (''|decimal-int-string), required leading minus, non-leading minus (\d+-\d+), +-sign rejection, quantified negated digit class, a negated class that doesn't exclude all digits ([^1-4]), and a digit/non-digit alternation. Confirmed failing before the fix and passing after.
  • Subject-type narrowing of the matched variable (lines 10/12 of the playground) is not implemented here; it requires full ^...$ anchor detection plus inverse narrowing in the falsey branch and is left as a follow-up.

Fixes phpstan/phpstan#14750

Comment thread src/Type/Regex/RegexGroupParser.php Outdated
Comment thread src/Type/Regex/RegexGroupParser.php Outdated
Copy link
Copy Markdown
Contributor

@staabm staabm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are missing support for

if (preg_match('/^-?[0-9]+$/', $x)) {
		assertType('decimal-int-string', $x);
	}

@phpstan-bot
Copy link
Copy Markdown
Collaborator Author

I've addressed the review. Here's a summary of what I did.

What the review asked for

staabm pointed out that subject narrowing was missing:

if (preg_match('/^-?[0-9]+$/', $x)) {
    assertType('decimal-int-string', $x);  // was inferred as non-empty-string
}

This is narrowing of the subject ($x itself), which the original PR explicitly deferred as a follow-up.

Changes (src/Type/Regex/RegexGroupParser.php)

  1. Anchored-subject narrowing — in parseGroups(), when the whole-pattern walk reports isDecimalInteger()->yes() and the pattern is anchored at both ends, the subject base type now becomes decimal-int-string (&non-falsy-string when a leading minus is required, or ''|decimal-int-string when zero-or-more digits allow an empty match).

  2. New isAnchored() helper (+ isStartAnchor/isEndAnchor) — detects ^/\A at the start and $/\z/\Z at the end, recursing through #expression/#concatenation/#alternation. Crucially, it ignores ^/$ under the m (multiline) modifier, since those become line anchors and the subject could then be a multi-line string.

  3. Empty literals made neutral — an empty literal value (anchors like ^/$, \b, \K) previously disqualified the decimal-integer inference; it's now treated as neutral, so the walk sees the digit body through the surrounding anchors. The non-empty guard is unchanged.

Anchoring is required because /[0-9]+/ (unanchored) matches a digit substring of e.g. "abc123", so only fully-anchored patterns may narrow the subject.

Tests

  • bug-14750.php — added a pregMatchSubjectNarrowing() covering the reviewer's exact case plus \A...\z, \d* (empty subject), required leading minus, unanchored (no narrowing), start-only anchoring (no narrowing), and multiline mode (no narrowing). Verified these 5 new assertions fail before the source change and pass after.
  • preg_match_shapes.php — two pre-existing assertions improved automatically (~^([157])$~ and /^[0-9]$/): the full-match element $matches[0] is the whole anchored subject and now correctly infers decimal-int-string. The latter's // could be numeric-string comment is now realized.

Verification

  • make phpstan: no errors.
  • Full NodeScopeResolverTest (1631 tests), Regex type/rule tests, and Php type extension tests: all green.

Committed and pushed as f6264fe9f.

I scoped this to the reviewer's request. One adjacent case — a top-level alternation of anchored decimal branches like /^\d+$|^-\d+$/ — is not narrowed, because the alternation AND-combine seeds from Maybe; fixing that would require touching the combine logic that capturing-group inference also relies on, so I left it out rather than risk regressions. The idiomatic /^-?\d+$/ form is fully covered.

Copy link
Copy Markdown
Contributor

@staabm staabm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need support for \A and \Z. just ^ and $ is enough

@phpstan-bot
Copy link
Copy Markdown
Collaborator Author

Done. I addressed staabm's review ("we don't need support for \A and \Z. just ^ and $ is enough").

Changes

src/Type/Regex/RegexGroupParser.php

  • isStartAnchor() now only recognizes ^ (dropped the \A branch).
  • isEndAnchor() now only recognizes $ (dropped the \z/\Z branch).
  • Updated the isAnchored() doc comment to say ^/$ instead of ^/\A and $/\z/\Z.
  • The multiline-mode guard (^/$ are line anchors under the m modifier) is preserved.

tests/PHPStan/Analyser/nsrt/bug-14750.php

  • Removed the \A\d+\z subject-narrowing test case, since those anchors are no longer treated as subject anchors.

Verification

  • NodeScopeResolverTest: 1631/1631 pass.
  • make phpstan: no errors.

Committed as 6f2549fde and pushed to create-pull-request/patch-0m2x77i.

@staabm staabm force-pushed the create-pull-request/patch-0m2x77i branch from 6f2549f to e41e798 Compare June 2, 2026 18:15
phpstan-bot and others added 3 commits June 3, 2026 11:08
…ng groups matching (non-)digits

- `RegexGroupParser` now emits `AccessoryDecimalIntegerStringType` for capturing groups that match only digits (e.g. `\d+`, `[0-9]+`, `[3-9]`), replacing the previous `numeric-string` inference, consistent with the recent `ctype_digit()` narrowing.
- A single leading minus sign is tracked positionally so `-?[0-9]+` / `-[0-9]+` stay decimal integers, while a non-leading `-` (`\d+-\d+`) or a `+` (`[+-]?\d+`) correctly disqualifies the decimal-integer inference.
- Negated character classes that exclude every digit (`[^0-9]`, `[^\d]`) now yield `non-decimal-int-string`, since they can only match non-digit characters. Negated classes that do not exclude all digits (`[^1-4]`) stay unrefined.
- `RegexGroupWalkResult` replaces `isNumeric` with `isDecimalInteger`, adds `isNonDecimalInteger` and a leading-sign flag, all combined across alternations.
- Updated existing preg shape tests where digit captures now infer `decimal-int-string`.
- Subject-type narrowing of the matched string itself (`preg_match('/^\d+$/', $x)` narrowing `$x`) is intentionally left out; it requires full-anchor detection and is tracked separately.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Removes the negatedClassExcludesAllDigits() helper (and the now-unused
$inNegativeClass parameter / non-decimal walk-result plumbing) per review.
Regex capturing groups that can only match non-digits are no longer
narrowed to non-decimal-int-string; decimal-int-string inference for
digit groups is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@staabm staabm force-pushed the create-pull-request/patch-0m2x77i branch from e41e798 to 4fdc652 Compare June 3, 2026 09:08
@staabm staabm requested a review from VincentLanglet June 3, 2026 09:23
Copy link
Copy Markdown
Contributor

@staabm staabm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to investigate this more

@phpstan-bot
Copy link
Copy Markdown
Collaborator Author

I've investigated the current review and the branch state. Here is an honest assessment.

The review is not actionable as a code change

The current review from staabm has the body "I need to investigate this more" with no specific change request. This is the reviewer signalling they're still investigating — not a concrete instruction to implement. Note that staabm had already APPROVED this PR at 2026-06-03T09:22:41Z; this follow-up comment reads as the reviewer wanting more time before merge, not a request for me to modify code.

I deliberately did not fabricate changes, since speculative edits would interfere with the reviewer's own investigation.

What I verified about the current branch

The branch (create-pull-request/patch-0m2x77i, HEAD 47c31ba75) is clean and green:

  • Regex type tests: 8/8 pass
  • Working tree clean, up to date with origin

One substantive fact the reviewer may be investigating

The branch currently does capturing-group inference only. The subject-narrowing case that staabm explicitly requested in an earlier CHANGES_REQUESTED review:

if (preg_match('/^-?[0-9]+$/', $x)) {
    assertType('decimal-int-string', $x);   // ← reviewer's requested behavior
}

…is not in the current branch. I confirmed live:

Pattern Expression Inferred now
/^-?[0-9]+$/ $x (subject) non-empty-string
/^(-?[0-9]+)$/ $m[1] (capture group) decimal-int-string

Per the PR comments, subject narrowing was implemented (commits f6264fe9f, 6f2549fde) but those commits are no longer in the branch — they were dropped during the rebase that removed the non-decimal-int-string inference. So the branch regressed against the reviewer's earlier explicit request, then was approved without it. This mismatch is very likely what "I need to investigate this more" is about.

Why I'm not just re-adding it

The reviewer's last word is "I need to investigate this more," not "re-add subject narrowing." Whether they want it back (and in what form — e.g. the /^\d+$|^-\d+$/ top-level-alternation case was deferred) is exactly the open question they're investigating. Guessing and pushing commits now would pre-empt that.

My recommendation: wait for staabm's investigation to conclude. If the conclusion is "subject narrowing should be restored," I can re-apply the isAnchored() / start-and-end-anchor logic from the earlier commits (without the non-decimal parts) and add back the pregMatchSubjectNarrowing() test cases. I've left the tree untouched and made no commits.

Copy link
Copy Markdown
Contributor

@staabm staabm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are good to go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support decimal-int-string in regex matching

2 participants