Skip to content

Infer decimal-int-string/non-decimal-int-string for regex capturing groups matching (non-)digits#5793

Open
phpstan-bot wants to merge 3 commits into
phpstan:2.2.xfrom
phpstan-bot:create-pull-request/patch-0m2x77i
Open

Infer decimal-int-string/non-decimal-int-string for regex capturing groups matching (non-)digits#5793
phpstan-bot wants to merge 3 commits into
phpstan:2.2.xfrom
phpstan-bot:create-pull-request/patch-0m2x77i

Conversation

@phpstan-bot
Copy link
Copy Markdown
Collaborator

Summary

PHPStan's regex array-shape matcher inferred numeric-string for capturing groups that match digit sequences (\d+, [0-9]+, …). Following the recent ctype_digit() work (#5311), these are better described as decimal-int-string, and groups that can only match non-digits as non-decimal-int-string. This PR teaches the regex AST walker to produce both accessory types.

Changes

  • src/Type/Regex/RegexGroupParser.php
    • createGroupType() now emits AccessoryDecimalIntegerStringType (instead of AccessoryNumericStringType) for digit-only groups, and AccessoryDecimalIntegerStringType(true) (non-decimal-int-string) for groups guaranteed to contain no digit.
    • The per-token walker tracks a single leading minus sign positionally, so -?[0-9]+ and -[0-9]+ remain decimal integers while a non-leading - (\d+-\d+) or a + ([+-]?\d+) disqualifies the inference.
    • Added negatedClassExcludesAllDigits() to detect negated classes ([^0-9], [^\d]) that exclude every digit; only those become non-decimal-int-string. A new inNegativeClass flag prevents the excluded digit tokens from disturbing the flags.
    • Decimal/non-decimal flags are AND-combined across alternation branches (with the sign flag reset per branch).
  • src/Type/Regex/RegexGroupWalkResult.php
    • Replaced isNumeric with isDecimalInteger, added isNonDecimalInteger and a seenDecimalIntegerSign flag.
  • Updated preg_match/preg_match_all/preg_replace_callback shape test expectations where digit captures now infer decimal-int-string (preg_match_shapes*.php, preg_match_all_shapes.php, bug-11293, bug-11311, bug-14177).

Root cause

The regex walker only modeled "matches digits" as numeric-string. numeric-string is broader than what a pure-digit regex matches (it also covers floats, exponents, + signs), so it under-described these groups. The walker already only set the numeric flag for pure-digit content, so the refinement to decimal-int-string is a strict improvement on the same code path. The inverse — a regex that can only match non-digit characters — was previously left as a plain (non-empty-)string; it is now non-decimal-int-string.

Test

  • New tests/PHPStan/Analyser/nsrt/bug-14750.php covers the reported group cases (-?[0-9]+, [3-9]+, \d+, [3-9], [^0-9]) plus edge cases: \d* (''|decimal-int-string), required leading minus, non-leading minus (\d+-\d+), +-sign rejection, quantified negated digit class, a negated class that doesn't exclude all digits ([^1-4]), and a digit/non-digit alternation. Confirmed failing before the fix and passing after.
  • Subject-type narrowing of the matched variable (lines 10/12 of the playground) is not implemented here; it requires full ^...$ anchor detection plus inverse narrowing in the falsey branch and is left as a follow-up.

Fixes phpstan/phpstan#14750

…ng groups matching (non-)digits

- `RegexGroupParser` now emits `AccessoryDecimalIntegerStringType` for capturing groups that match only digits (e.g. `\d+`, `[0-9]+`, `[3-9]`), replacing the previous `numeric-string` inference, consistent with the recent `ctype_digit()` narrowing.
- A single leading minus sign is tracked positionally so `-?[0-9]+` / `-[0-9]+` stay decimal integers, while a non-leading `-` (`\d+-\d+`) or a `+` (`[+-]?\d+`) correctly disqualifies the decimal-integer inference.
- Negated character classes that exclude every digit (`[^0-9]`, `[^\d]`) now yield `non-decimal-int-string`, since they can only match non-digit characters. Negated classes that do not exclude all digits (`[^1-4]`) stay unrefined.
- `RegexGroupWalkResult` replaces `isNumeric` with `isDecimalInteger`, adds `isNonDecimalInteger` and a leading-sign flag, all combined across alternations.
- Updated existing preg shape tests where digit captures now infer `decimal-int-string`.
- Subject-type narrowing of the matched string itself (`preg_match('/^\d+$/', $x)` narrowing `$x`) is intentionally left out; it requires full-anchor detection and is tracked separately.
Comment thread src/Type/Regex/RegexGroupParser.php Outdated
Comment thread src/Type/Regex/RegexGroupParser.php Outdated
phpstan-bot and others added 2 commits June 2, 2026 17:34
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Removes the negatedClassExcludesAllDigits() helper (and the now-unused
$inNegativeClass parameter / non-decimal walk-result plumbing) per review.
Regex capturing groups that can only match non-digits are no longer
narrowed to non-decimal-int-string; decimal-int-string inference for
digit groups is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@staabm staabm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are missing support for

if (preg_match('/^-?[0-9]+$/', $x)) {
		assertType('decimal-int-string', $x);
	}

@phpstan-bot
Copy link
Copy Markdown
Collaborator Author

I've addressed the review. Here's a summary of what I did.

What the review asked for

staabm pointed out that subject narrowing was missing:

if (preg_match('/^-?[0-9]+$/', $x)) {
    assertType('decimal-int-string', $x);  // was inferred as non-empty-string
}

This is narrowing of the subject ($x itself), which the original PR explicitly deferred as a follow-up.

Changes (src/Type/Regex/RegexGroupParser.php)

  1. Anchored-subject narrowing — in parseGroups(), when the whole-pattern walk reports isDecimalInteger()->yes() and the pattern is anchored at both ends, the subject base type now becomes decimal-int-string (&non-falsy-string when a leading minus is required, or ''|decimal-int-string when zero-or-more digits allow an empty match).

  2. New isAnchored() helper (+ isStartAnchor/isEndAnchor) — detects ^/\A at the start and $/\z/\Z at the end, recursing through #expression/#concatenation/#alternation. Crucially, it ignores ^/$ under the m (multiline) modifier, since those become line anchors and the subject could then be a multi-line string.

  3. Empty literals made neutral — an empty literal value (anchors like ^/$, \b, \K) previously disqualified the decimal-integer inference; it's now treated as neutral, so the walk sees the digit body through the surrounding anchors. The non-empty guard is unchanged.

Anchoring is required because /[0-9]+/ (unanchored) matches a digit substring of e.g. "abc123", so only fully-anchored patterns may narrow the subject.

Tests

  • bug-14750.php — added a pregMatchSubjectNarrowing() covering the reviewer's exact case plus \A...\z, \d* (empty subject), required leading minus, unanchored (no narrowing), start-only anchoring (no narrowing), and multiline mode (no narrowing). Verified these 5 new assertions fail before the source change and pass after.
  • preg_match_shapes.php — two pre-existing assertions improved automatically (~^([157])$~ and /^[0-9]$/): the full-match element $matches[0] is the whole anchored subject and now correctly infers decimal-int-string. The latter's // could be numeric-string comment is now realized.

Verification

  • make phpstan: no errors.
  • Full NodeScopeResolverTest (1631 tests), Regex type/rule tests, and Php type extension tests: all green.

Committed and pushed as f6264fe9f.

I scoped this to the reviewer's request. One adjacent case — a top-level alternation of anchored decimal branches like /^\d+$|^-\d+$/ — is not narrowed, because the alternation AND-combine seeds from Maybe; fixing that would require touching the combine logic that capturing-group inference also relies on, so I left it out rather than risk regressions. The idiomatic /^-?\d+$/ form is fully covered.

Copy link
Copy Markdown
Contributor

@staabm staabm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need support for \A and \Z. just ^ and $ is enough

@phpstan-bot
Copy link
Copy Markdown
Collaborator Author

Done. I addressed staabm's review ("we don't need support for \A and \Z. just ^ and $ is enough").

Changes

src/Type/Regex/RegexGroupParser.php

  • isStartAnchor() now only recognizes ^ (dropped the \A branch).
  • isEndAnchor() now only recognizes $ (dropped the \z/\Z branch).
  • Updated the isAnchored() doc comment to say ^/$ instead of ^/\A and $/\z/\Z.
  • The multiline-mode guard (^/$ are line anchors under the m modifier) is preserved.

tests/PHPStan/Analyser/nsrt/bug-14750.php

  • Removed the \A\d+\z subject-narrowing test case, since those anchors are no longer treated as subject anchors.

Verification

  • NodeScopeResolverTest: 1631/1631 pass.
  • make phpstan: no errors.

Committed as 6f2549fde and pushed to create-pull-request/patch-0m2x77i.

@staabm staabm force-pushed the create-pull-request/patch-0m2x77i branch from 6f2549f to e41e798 Compare June 2, 2026 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support decimal-int-string in regex matching

2 participants