gh-143196: fix heap-buffer-overflow in JSON encoder indent cache #143246

caverac · 2025-12-28T13:09:16Z

Summary

This PR fixes two related bugs in the JSON encoder's indentation cache (introduced in gh-95382):

Heap-buffer-overflow via uninitialized cache: When c_make_encoder is called with _current_indent_level > 0, the cache is created with only 1 element, but update_indent_cache expects the cache to be incrementally built starting from level 0, causing out-of-bounds memory access.
Use-after-free via re-entrant __mul__: PySequence_Repeat(s->indent, indent_level) in create_indent_cache can execute arbitrary Python code through a custom __mul__ method, potentially causing use-after-free.

Changes

`Modules/_json.c`

Indent validation and conversion in encoder_new:
- Accept str, int, or None for indent
- Convert integer indent to string of spaces (matching Python-level behavior in encoder.py)
- Reject other types with TypeError
- Reject negative integers with ValueError
Indent level validation in encoder_call:
- Require _current_indent_level == 0 when indent is set
- This prevents the uninitialized cache access
- This also eliminates the __mul__ attack vector since PySequence_Repeat is only called when indent_level != 0

`Lib/test/test_json/test_speedups.py`

Added test_indent_argument_to_encoder: Tests indent type validation and conversion
Added test_nonzero_indent_level_with_indent: Tests indent level validation

Performance

No performance impact on the hot path. The indentation cache optimization from gh-95382 remains fully intact.

Our changes only add:

Addition	When it runs	Cost
Indent conversion (`int` → `str`)	Once, at encoder creation	O(indent_level), typically 2-4
`indent_level != 0` check	Once, at start of `encoder_call`	O(1), single integer comparison

The performance-critical recursive encoding path (encoder_listencode_list, encoder_listencode_dict) is completely unchanged. The cache is still built incrementally and reused across recursive calls exactly as before.

Affected Versions

Python Version	Affected
3.9 - 3.13	No (cache feature not present)
3.14+	Yes
3.15 (main)	Yes

The fix applies cleanly to both main and 3.14 branches.

Issue: Heap-buffer-overflow in json.encoder indentation cache via re-entrant __mul__ #143196

- Validate that _current_indent_level is 0 when indent is set, preventing uninitialized cache access and re-entrant __mul__ attacks - Convert integer indent to string of spaces in encoder_new, matching the Python-level behavior

python-cla-bot · 2025-12-28T13:09:19Z

All commit authors signed the Contributor License Agreement.

bedevere-app · 2025-12-28T13:09:20Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

picnixz · 2025-12-28T13:14:57Z

I suspect this PR has been generated by an LLM, or at least the description had. Can you confirm this? Because it's not a security issue. In addition, please read our policy on what when using LLM is acceptable: https://devguide.python.org/getting-started/generative-ai/ (if English isn't your first language, it's fine, but please indicate whether an LLM has been used and how).

caverac · 2025-12-28T13:28:03Z

I suspect this PR has been generated by an LLM, or at least the description had. Can you confirm this? Because it's not a security issue. In addition, please read our policy on what when using LLM is acceptable: https://devguide.python.org/getting-started/generative-ai/ (if English isn't your first language, it's fine, but please indicate whether an LLM has been used and how).

@picnixz I'm sorry, I always considered heap-buffer-overflow an exploitable issue. But I changed the language to "bug", perhaps more aligned with the reported issue. Is that ok?

picnixz · 2025-12-28T13:35:11Z

Bugs like that only become security issues when they are exploitable and we need to assess how feasible for an adversary it is to exploit it. In our case, the attack surface is very small and relies on invoking "semi-public" functions (and then it should be exploitable! usually there is just a hard crash, which could be considered a DoS in some sense if it's serving a web app with auto-reload for instance).

Anyway, now, I'd like you to confirm whether this has been generated by a LLM or not (and if so, which part). I'm sorry to be doubtful but because of the emergence of LLM, we have many new contributors that simply use LLM to generate their PRs and that is unacceptable.

caverac · 2025-12-28T13:49:36Z

Bugs like that only become security issues when they are exploitable and we need to assess how feasible for an adversary it is to exploit it. In our case, the attack surface is very small and relies on invoking "semi-public" functions (and then it should be exploitable! usually there is just a hard crash, which could be considered a DoS in some sense if it's serving a web app with auto-reload for instance).

Anyway, now, I'd like you to confirm whether this has been generated by a LLM or not (and if so, which part). I'm sorry to be doubtful but because of the emergence of LLM, we have many new contributors that simply use LLM to generate their PRs and that is unacceptable.

@picnixz PR description proofread with ChatGTP, but if that represents a violation I'm ok with closing the PR

picnixz · 2025-12-28T13:53:04Z

If it's only the description, it's fine. But in the future, you don't need to have an LLM generate the description. We don't really care which files were modified for instance (though it's good to know if breaking changes were introduced) or whether the tests were added (we expect them to be added!) Affected versions are already mentioned on the issue as well.

picnixz · 2025-12-28T13:57:17Z

Modules/_json.c

    }

+    /* Convert indent to str if it's not None or already a string */
+    if (indent != Py_None && !PyUnicode_Check(indent)) {


Instead of this complex checks (argument 1 also is checked) I'd suggest that we convert __new__ to argument clinic. However, I'm not entirely sure that we won't lose perfs. How many times do we call this function?

Performance considerations

For context, my original proposal was to use a simpler change

static PyObject * encoder_new(PyTypeObject *type, PyObject *args, PyObject *kwds) { ... if (indent != Py_None && !PyUnicode_Check(indent)) { PyErr_Format(PyExc_TypeError, "make_encoder() argument 4 must be str or None, " "not %.200s", Py_TYPE(indent)->tp_name); return NULL; } ... }

Let me call that fix-v1.

After comment/suggestion in the reported issue I implemented the unidecode check with the additional logical statements (that is the current branch), let me call that fix.

And finally we have main. What I did was run some tests against all these three branches. Basically ran a warm-up of the following snippet and then 100 times to measure average executing time

start = time.perf_counter() json.dumps(data, indent=2) end = time.perf_counter()

These are the results

Test fix fix-v1 main

128KB.json 1.97 ms 2.46 ms 1.95 ms

256KB.json 2.04 ms 2.03 ms 2.03 ms

512KB.json 4.04 ms 4.05 ms 4.07 ms

1MB.json 8.83 ms 8.00 ms 8.14 ms

5MB.json 42.02 ms 41.48 ms 42.02 ms

My local is a bit outdated, and I'm pretty confident these numbers can reduced in a faster cpu, but I do believe that performance is not impacted, or at least my changes are statistically consistent with main

Using the argument clinic

I'm rabbit-holing a bit into that approach, not entirely sure yet how it works. But my take on it: it would require some major changes to the module, that I don't feel comfortable messing with. If that is the desired path, I will be happy to document my findings in the reported issue, and leave it to more expert/capable hands to complete.

"How many times do we call this function?"

I am not entirely sure what do you mean with this question, sorry 😢 encoder_new is called once per JSONEncoder instance creation. In typical usage (json.dumps()),

json.dumps(data, indent=2) |-- JSONEncoder.encode() |-- JSONEncoder.iterencode() |-- c_make_encoder() |-- encoder_new() <= called only once |-- _iterencode(obj, 0) |-- encoder_listencode_obj() |-- encoder_listencode_list() |-- encoder_listencode_dict()

it's called once per encoding operation. "We" don't call it, the user does and, as I showed above, they are not going to experience any measurable changes in performance. My apologies again if I misunderstood your question

The "we" was a generic "we" (academical one). But you answered my question well. I wondered whether c_make_encoder was called multiple times when doing a single json.dumps().

Conversion to AC may introduce an overhead in how the arguments are pre-processed but if you're not comfortable with this, we can first fix this (and backport it) and then switch to AC for main.

picnixz · 2025-12-28T14:05:06Z

Modules/_json.c

+        if (indent_level > 0) {
+            memset(PyUnicode_1BYTE_DATA(indent), ' ', indent_level);
+        }


Suggested change

if (indent_level > 0) {

memset(PyUnicode_1BYTE_DATA(indent), ' ', indent_level);

}

memset(PyUnicode_1BYTE_DATA(indent), ' ', sizeof(Py_UCS1) * indent_level);

memset allows for a 0 size (and does nothing) as long as the pointer is valid. Alternatively, we could use PyUnicode_Fill but I think we usually bypass this and use memset directly.

bedevere-app bot mentioned this pull request Dec 28, 2025

Heap-buffer-overflow in json.encoder indentation cache via re-entrant __mul__ #143196

Open

bedevere-app bot added the awaiting review label Dec 28, 2025

picnixz reviewed Dec 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-143196: fix heap-buffer-overflow in JSON encoder indent cache #143246

gh-143196: fix heap-buffer-overflow in JSON encoder indent cache #143246

caverac commented Dec 28, 2025 •

edited

Loading

Uh oh!

python-cla-bot bot commented Dec 28, 2025 •

edited

Loading

Uh oh!

bedevere-app bot commented Dec 28, 2025

Uh oh!

picnixz commented Dec 28, 2025 •

edited

Loading

Uh oh!

caverac commented Dec 28, 2025

Uh oh!

picnixz commented Dec 28, 2025

Uh oh!

caverac commented Dec 28, 2025

Uh oh!

picnixz commented Dec 28, 2025 •

edited

Loading

Uh oh!

picnixz Dec 28, 2025

Uh oh!

caverac Dec 28, 2025 •

edited

Loading

Uh oh!

picnixz Dec 28, 2025

Uh oh!

picnixz Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Test	fix	fix-v1	main
128KB.json	1.97 ms	2.46 ms	1.95 ms
256KB.json	2.04 ms	2.03 ms	2.03 ms
512KB.json	4.04 ms	4.05 ms	4.07 ms
1MB.json	8.83 ms	8.00 ms	8.14 ms
5MB.json	42.02 ms	41.48 ms	42.02 ms

Uh oh!

gh-143196: fix heap-buffer-overflow in JSON encoder indent cache #143246

Are you sure you want to change the base?

gh-143196: fix heap-buffer-overflow in JSON encoder indent cache #143246

Conversation

caverac commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Modules/_json.c

Lib/test/test_json/test_speedups.py

Performance

Affected Versions

Uh oh!

python-cla-bot bot commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedevere-app bot commented Dec 28, 2025

Uh oh!

picnixz commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

caverac commented Dec 28, 2025

Uh oh!

picnixz commented Dec 28, 2025

Uh oh!

caverac commented Dec 28, 2025

Uh oh!

picnixz commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

picnixz Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

caverac Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Performance considerations

Using the argument clinic

"How many times do we call this function?"

Uh oh!

picnixz Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

picnixz Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

caverac commented Dec 28, 2025 •

edited

Loading

`Modules/_json.c`

`Lib/test/test_json/test_speedups.py`

python-cla-bot bot commented Dec 28, 2025 •

edited

Loading

picnixz commented Dec 28, 2025 •

edited

Loading

picnixz commented Dec 28, 2025 •

edited

Loading

caverac Dec 28, 2025 •

edited

Loading