Skip to content

[java] Tighten JSON number lexer to match RFC 8259#17739

Merged
shs96c merged 3 commits into
SeleniumHQ:trunkfrom
shs96c:shs-json-number-state-machine
Jul 3, 2026
Merged

[java] Tighten JSON number lexer to match RFC 8259#17739
shs96c merged 3 commits into
SeleniumHQ:trunkfrom
shs96c:shs-json-number-state-machine

Conversation

@shs96c

@shs96c shs96c commented Jul 2, 2026

Copy link
Copy Markdown
Member

nextNumber() collected any of - + 0-9 . e E and handed the string to Long.valueOf / BigDecimal. It accepted several spec-invalid forms:

  • +5 (leading plus)
  • 01, 007 (leading zeros)
  • .5 (no digit before decimal)
  • 5. (no digit after decimal)
  • 1e, - (sign / exponent without digits)

Very large exponents (1e9999) also silently produced Double.POSITIVE_INFINITY, which JSON cannot represent.

We now:

  • Have replaced the collector with a state machine that follows number = [minus] int [frac] [exp] (RFC 8259 §6).
  • Reject non-finite double values with a clear message.
  • Drop + from peek()'s NUMBER classifier — it never was a valid JSON number start.

@selenium-ci selenium-ci added the C-java Java Bindings label Jul 2, 2026
@shs96c shs96c force-pushed the shs-json-number-state-machine branch from 53d5bf5 to ee9fa0c Compare July 2, 2026 12:47
@shs96c shs96c marked this pull request as ready for review July 2, 2026 13:12
@qodo-code-review

Copy link
Copy Markdown
Contributor

PR Summary by Qodo

Tighten JsonInput number lexing to RFC 8259 grammar and reject non-finite doubles

🐞 Bug fix 🧪 Tests 🕐 20-40 Minutes

Grey Divider

AI Description

• Replace permissive number token collection with an RFC 8259 §6 state machine.
• Reject spec-invalid numeric forms (leading '+', leading zeros, missing digits).
• Fail fast on double overflow/NaN and add regression coverage for valid/invalid cases.
Diagram

graph TD
  T["JsonInputTest"] --> N["JsonInput.nextNumber()"] --> SM{"RFC 8259 number grammar"} --> OUT["Number (Long/double)"]
  N --> IN[("CharInput peek/read")]
  SM --> EX{{"JsonException"}}

  subgraph Legend
    direction LR
    _code["Code"] ~~~ _io[("Input")] ~~~ _decision{"Decision"} ~~~ _err{{"Error"}}
  end
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Regex/grammar pre-check before parsing
  • ➕ Compact validation logic that can be easier to audit against RFC grammar
  • ➕ Keeps parsing (Long/BigDecimal) separate from validation concerns
  • ➖ Harder to produce precise, localized error messages during streaming reads
  • ➖ Still needs careful integration with incremental input consumption
2. Always parse as BigDecimal and validate by re-serializing
  • ➕ Avoids double overflow concerns by keeping arbitrary precision longer
  • ➕ Potentially simplifies numeric type decisions
  • ➖ Behavioral change: callers currently receive Long for integers and double for decimals/exponents
  • ➖ Re-serialization/normalization does not reliably detect all original lexical invalid forms (e.g., leading zeros) without extra checks
3. Delegate to a standard JSON parser for tokenization
  • ➕ Leverages mature RFC-compliant number lexer implementations
  • ➕ Reduces maintenance surface area for lexer correctness
  • ➖ Likely invasive change to JsonInput streaming API and existing call sites
  • ➖ Adds dependency/behavioral differences not aligned with current lightweight parser

Recommendation: The PR’s state-machine approach is the best fit for a streaming lexer: it enforces RFC 8259 at read time, preserves existing return-type behavior (Long fast-path vs double), and provides targeted failures for invalid forms and non-finite doubles. Alternatives either complicate streaming consumption (regex) or require broader API/behavior changes (BigDecimal-only or external parser).

Files changed (2) +115 / -36

Bug fix (1) +67 / -36
JsonInput.javaEnforce RFC 8259 number grammar and reject non-finite doubles +67/-36

Enforce RFC 8259 number grammar and reject non-finite doubles

• Removes '+' from NUMBER detection in peek(). Replaces the permissive character collector in nextNumber() with a grammar-driven state machine that rejects leading zeros, missing digits around '.'/exponent, and missing integer parts. Adds a finite-check for BigDecimal-to-double conversion and introduces a small digit helper.

java/src/org/openqa/selenium/json/JsonInput.java

Tests (1) +48 / -0
JsonInputTest.javaAdd coverage for spec-invalid numbers and double overflow +48/-0

Add coverage for spec-invalid numbers and double overflow

• Introduces tests that assert JsonInput rejects previously-accepted RFC-invalid numeric lexemes and still accepts valid forms. Adds an explicit regression test ensuring extreme exponents that overflow to Infinity are rejected with an out-of-range error message.

java/test/org/openqa/selenium/json/JsonInputTest.java

@qodo-code-review

qodo-code-review Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (1) 📜 Skill insights (0)

Context used
✅ Compliance rules (platform): 18 rules

Grey Divider


Action required

1. nextNumber() diverges from Python 📘 Rule violation ≡ Correctness
Description
This PR changes Java JsonInput.nextNumber() to reject non-finite double results (e.g., 1e9999)
by throwing JsonException, which is a user-visible behavior change. There is no evidence in the
codebase of a cross-language binding comparison/alignment (e.g., Python still uses json.loads
directly), violating the cross-binding consistency requirement.
Code

java/src/org/openqa/selenium/json/JsonInput.java[R289-293]

+      double value = new BigDecimal(builder.toString()).doubleValue();
+      if (Double.isInfinite(value) || Double.isNaN(value)) {
+        throw new JsonException("Number is out of range for a double: " + builder + ". " + input);
+      }
+      return value;
Evidence
The Java change explicitly throws when BigDecimal(...).doubleValue() becomes Infinity/NaN,
making 1e9999 and similar inputs fail in Java. Python’s binding helper load_json delegates
directly to json.loads with no additional guardrails, indicating this behavior change has not been
cross-checked/aligned or documented across bindings as required.

Rule 389265: Compare cross-language bindings when changing user-visible behavior
java/src/org/openqa/selenium/json/JsonInput.java[284-293]
py/selenium/webdriver/remote/utils.py[22-27]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Java `JsonInput.nextNumber()` now rejects values that overflow to non-finite `double` (Infinity/NaN). This is a user-visible behavior change and should be compared against at least one other Selenium language binding; any intentional divergence must be documented and tests/docs updated accordingly.

## Issue Context
The Java binding now throws on non-finite `double` results derived from valid JSON number syntax with very large exponents. Other bindings (e.g., Python) appear to parse JSON via their standard libraries without an equivalent non-finite rejection, which can lead to cross-binding inconsistency.

## Fix Focus Areas
- java/src/org/openqa/selenium/json/JsonInput.java[284-293]
- py/selenium/webdriver/remote/utils.py[22-27]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Plus-signed strings rejected 🐞 Bug ≡ Correctness
Description
NumberCoercer reparses numeric STRING values via a nested JsonInput.nextNumber(); after this PR
drops '+' as a NUMBER start, quoted values like "+5" that previously coerced successfully will now
throw JsonException and fail coercion. This changes behavior beyond JSON-number tokens and can break
existing inputs that relied on permissive string-to-number coercion.
Code

java/src/org/openqa/selenium/json/JsonInput.java[R135-138]

      case '-':
-      case '+':
      case '0':
      case '1':
      case '2':
Evidence
JsonInput.peek() now only returns NUMBER for '-' or digits, so a nested JsonInput created from a
string starting with '+' will throw before nextNumber() can parse anything. NumberCoercer
explicitly uses this nested-JsonInput strategy for STRING→NUMBER coercion, so the stricter lexer
impacts that coercion path as well.

java/src/org/openqa/selenium/json/JsonInput.java[124-169]
java/src/org/openqa/selenium/json/JsonInput.java[223-247]
java/src/org/openqa/selenium/json/NumberCoercer.java[59-84]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`NumberCoercer` supports coercing JSON *strings* into numbers by reparsing the string with a nested `JsonInput` and calling `nextNumber()`. With this PR, `JsonInput.peek()` no longer classifies `'+'` as a number start, so numeric strings like `"+5"` (previously accepted) now fail coercion.

### Issue Context
This is a behavior change in the STRING→NUMBER coercion path (not in JSON-number token parsing). If backward compatibility is desired, keep strict RFC 8259 parsing for actual JSON numbers, but allow a leading `'+'` when coercing from a *string*.

### Fix Focus Areas
- java/src/org/openqa/selenium/json/NumberCoercer.java[59-84]

### Suggested fix
In the `case STRING:` branch, normalize the raw string before nested parsing, e.g. strip a single leading `'+'` (and optionally trim, if that was historically accepted), then parse using the nested `JsonInput`. Add/adjust a small unit test around `NumberCoercer` (or an existing coercion test) to lock in the intended behavior.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Informational

3. EOF shown as U+FFFF ✓ Resolved 🐞 Bug ◔ Observability
Description
In nextNumber(), several exception messages cast input.peek() (an int) to char; when the stream is
exhausted this renders EOF (-1) as '\uFFFF', producing misleading diagnostics for inputs like "-",
"5.", or "1e".
This is especially confusing because the codebase explicitly treats U+FFFF as a valid literal
character, so the error text can look like it saw a real character rather than end-of-input.
Code

java/src/org/openqa/selenium/json/JsonInput.java[R234-247]

+    int first = input.peek();
+    if (first == '0') {
+      builder.append((char) input.read());
+      // Leading zeros ("00", "01", ...) are not allowed.
+      if (isDigit(input.peek())) {
+        throw new JsonException("Leading zeros are not permitted in JSON numbers. " + input);
      }
-    } while (read);
+    } else if (first >= '1' && first <= '9') {
+      while (isDigit(input.peek())) {
+        builder.append((char) input.read());
+      }
+    } else {
+      throw new JsonException("Expected digit but saw '" + (char) first + "'. " + input);
+    }
Evidence
Input.peek() returns Input.EOF (-1) at end-of-input, and nextNumber() now casts these ints to
char inside error messages, which turns -1 into \uFFFF. The test suite also explicitly asserts
that literal U+FFFF is valid input, so reporting EOF as \uFFFF is ambiguous and confusing during
debugging.

java/src/org/openqa/selenium/json/JsonInput.java[223-281]
java/src/org/openqa/selenium/json/Input.java[31-38]
java/test/org/openqa/selenium/json/JsonInputTest.java[294-309]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`JsonInput.nextNumber()` builds error messages by casting `input.peek()` (an `int`) to `char`. When `peek()` returns `Input.EOF` (`-1`), the cast becomes `\uFFFF`, so errors at end-of-input misleadingly report that a U+FFFF character was seen.

### Issue Context
`Input.EOF` is intentionally `-1` to avoid colliding with any UTF-16 code unit, including U+FFFF; and there is a unit test ensuring U+FFFF is treated as a literal character.

### Fix Focus Areas
- java/src/org/openqa/selenium/json/JsonInput.java[223-281]
- java/src/org/openqa/selenium/json/Input.java[31-38]

### Suggested fix
- Add a small helper (e.g., `describeChar(int c)`) that returns `"<EOF>"` when `c == Input.EOF`, otherwise returns a quoted printable character.
- Use that helper in the three `JsonException` messages added in `nextNumber()` (expected digit, expected digit after '.', expected exponent digit), so EOF is reported clearly as EOF rather than `\uFFFF`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread java/src/org/openqa/selenium/json/JsonInput.java
@qodo-code-review

Copy link
Copy Markdown
Contributor

Code review by qodo was updated up to the latest commit e1f3ed2

shs96c added 3 commits July 3, 2026 11:22
The previous nextNumber() collected any subset of '- + 0-9 . e E' and
then handed the string to Long.valueOf or BigDecimal. That accepted
several spec-invalid forms: '+5' (leading plus), '01' (leading zero),
'.5' (no digit before decimal), '5.' (no digit after decimal), '1e'
(exponent without digits), '-' alone. Very large exponents also
silently produced Double.POSITIVE_INFINITY, for which JSON has no
representation.

Replace the collector with a proper state machine that follows the
grammar 'number = [minus] int [frac] [exp]', and reject non-finite
doubles. Also drop '+' from the peek() classifier so it no longer
claims to see a JSON number.
Casting Input.peek() to (char) turned end-of-input (-1) into U+FFFF,
which is misleading now that U+FFFF is treated as a legitimate string
character. Route the three sites in nextNumber() through a small helper
that renders EOF as <EOF> and printable code units in quotes.
@shs96c shs96c force-pushed the shs-json-number-state-machine branch from e1f3ed2 to 064268c Compare July 3, 2026 10:25
Comment thread java/src/org/openqa/selenium/json/JsonInput.java
Comment thread java/src/org/openqa/selenium/json/JsonInput.java
@qodo-code-review

Copy link
Copy Markdown
Contributor

Code review by qodo was updated up to the latest commit 064268c

@shs96c shs96c merged commit 740b692 into SeleniumHQ:trunk Jul 3, 2026
43 of 44 checks passed
@shs96c shs96c deleted the shs-json-number-state-machine branch July 3, 2026 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C-java Java Bindings

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants