Skip to content

Conversation

@caverac
Copy link

@caverac caverac commented Dec 28, 2025

Summary

This PR fixes two related bugs in the JSON encoder's indentation cache (introduced in gh-95382):

  1. Heap-buffer-overflow via uninitialized cache: When c_make_encoder is called with _current_indent_level > 0, the cache is created with only 1 element, but update_indent_cache expects the cache to be incrementally built starting from level 0, causing out-of-bounds memory access.

  2. Use-after-free via re-entrant __mul__: PySequence_Repeat(s->indent, indent_level) in create_indent_cache can execute arbitrary Python code through a custom __mul__ method, potentially causing use-after-free.

Changes

Modules/_json.c

  1. Indent validation and conversion in encoder_new:

    • Accept str, int, or None for indent
    • Convert integer indent to string of spaces (matching Python-level behavior in encoder.py)
    • Reject other types with TypeError
    • Reject negative integers with ValueError
  2. Indent level validation in encoder_call:

    • Require _current_indent_level == 0 when indent is set
    • This prevents the uninitialized cache access
    • This also eliminates the __mul__ attack vector since PySequence_Repeat is only called when indent_level != 0

Lib/test/test_json/test_speedups.py

  • Added test_indent_argument_to_encoder: Tests indent type validation and conversion
  • Added test_nonzero_indent_level_with_indent: Tests indent level validation

Performance

No performance impact on the hot path. The indentation cache optimization from gh-95382 remains fully intact.

Our changes only add:

Addition When it runs Cost
Indent conversion (intstr) Once, at encoder creation O(indent_level), typically 2-4
indent_level != 0 check Once, at start of encoder_call O(1), single integer comparison

The performance-critical recursive encoding path (encoder_listencode_list, encoder_listencode_dict) is completely unchanged. The cache is still built incrementally and reused across recursive calls exactly as before.

Affected Versions

Python Version Affected
3.9 - 3.13 No (cache feature not present)
3.14+ Yes
3.15 (main) Yes

The fix applies cleanly to both main and 3.14 branches.

- Validate that _current_indent_level is 0 when indent is set, preventing uninitialized cache access and re-entrant __mul__ attacks
- Convert integer indent to string of spaces in encoder_new, matching the Python-level behavior
@python-cla-bot
Copy link

python-cla-bot bot commented Dec 28, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

@bedevere-app
Copy link

bedevere-app bot commented Dec 28, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@picnixz
Copy link
Member

picnixz commented Dec 28, 2025

I suspect this PR has been generated by an LLM, or at least the description had. Can you confirm this? Because it's not a security issue. In addition, please read our policy on what when using LLM is acceptable: https://devguide.python.org/getting-started/generative-ai/ (if English isn't your first language, it's fine, but please indicate whether an LLM has been used and how).

@caverac
Copy link
Author

caverac commented Dec 28, 2025

I suspect this PR has been generated by an LLM, or at least the description had. Can you confirm this? Because it's not a security issue. In addition, please read our policy on what when using LLM is acceptable: https://devguide.python.org/getting-started/generative-ai/ (if English isn't your first language, it's fine, but please indicate whether an LLM has been used and how).

@picnixz I'm sorry, I always considered heap-buffer-overflow an exploitable issue. But I changed the language to "bug", perhaps more aligned with the reported issue. Is that ok?

@picnixz
Copy link
Member

picnixz commented Dec 28, 2025

Bugs like that only become security issues when they are exploitable and we need to assess how feasible for an adversary it is to exploit it. In our case, the attack surface is very small and relies on invoking "semi-public" functions (and then it should be exploitable! usually there is just a hard crash, which could be considered a DoS in some sense if it's serving a web app with auto-reload for instance).

Anyway, now, I'd like you to confirm whether this has been generated by a LLM or not (and if so, which part). I'm sorry to be doubtful but because of the emergence of LLM, we have many new contributors that simply use LLM to generate their PRs and that is unacceptable.

@caverac
Copy link
Author

caverac commented Dec 28, 2025

Bugs like that only become security issues when they are exploitable and we need to assess how feasible for an adversary it is to exploit it. In our case, the attack surface is very small and relies on invoking "semi-public" functions (and then it should be exploitable! usually there is just a hard crash, which could be considered a DoS in some sense if it's serving a web app with auto-reload for instance).

Anyway, now, I'd like you to confirm whether this has been generated by a LLM or not (and if so, which part). I'm sorry to be doubtful but because of the emergence of LLM, we have many new contributors that simply use LLM to generate their PRs and that is unacceptable.

@picnixz PR description proofread with ChatGTP, but if that represents a violation I'm ok with closing the PR

@picnixz
Copy link
Member

picnixz commented Dec 28, 2025

If it's only the description, it's fine. But in the future, you don't need to have an LLM generate the description. We don't really care which files were modified for instance (though it's good to know if breaking changes were introduced) or whether the tests were added (we expect them to be added!) Affected versions are already mentioned on the issue as well.

}

/* Convert indent to str if it's not None or already a string */
if (indent != Py_None && !PyUnicode_Check(indent)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this complex checks (argument 1 also is checked) I'd suggest that we convert __new__ to argument clinic. However, I'm not entirely sure that we won't lose perfs. How many times do we call this function?

Copy link
Author

@caverac caverac Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance considerations

For context, my original proposal was to use a simpler change

static PyObject *
encoder_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
{
    ...
    if (indent != Py_None && !PyUnicode_Check(indent)) {
        PyErr_Format(PyExc_TypeError,
                     "make_encoder() argument 4 must be str or None, "
                     "not %.200s", Py_TYPE(indent)->tp_name);
        return NULL;
    }
    ...
}

Let me call that fix-v1.

After comment/suggestion in the reported issue I implemented the unidecode check with the additional logical statements (that is the current branch), let me call that fix.

And finally we have main. What I did was run some tests against all these three branches. Basically ran a warm-up of the following snippet and then 100 times to measure average executing time

start = time.perf_counter()
json.dumps(data, indent=2)
end = time.perf_counter()

These are the results

Test fix fix-v1 main
128KB.json 1.97 ms 2.46 ms 1.95 ms
256KB.json 2.04 ms 2.03 ms 2.03 ms
512KB.json 4.04 ms 4.05 ms 4.07 ms
1MB.json 8.83 ms 8.00 ms 8.14 ms
5MB.json 42.02 ms 41.48 ms 42.02 ms

My local is a bit outdated, and I'm pretty confident these numbers can reduced in a faster cpu, but I do believe that performance is not impacted, or at least my changes are statistically consistent with main

Using the argument clinic

I'm rabbit-holing a bit into that approach, not entirely sure yet how it works. But my take on it: it would require some major changes to the module, that I don't feel comfortable messing with. If that is the desired path, I will be happy to document my findings in the reported issue, and leave it to more expert/capable hands to complete.

"How many times do we call this function?"

I am not entirely sure what do you mean with this question, sorry 😢 encoder_new is called once per JSONEncoder instance creation. In typical usage (json.dumps()),

json.dumps(data, indent=2)
    |-- JSONEncoder.encode()
          |-- JSONEncoder.iterencode()
                |-- c_make_encoder()
                      |-- encoder_new()  <= called only once
                |-- _iterencode(obj, 0)
                      |-- encoder_listencode_obj()
                            |-- encoder_listencode_list()
                            |-- encoder_listencode_dict()

it's called once per encoding operation. "We" don't call it, the user does and, as I showed above, they are not going to experience any measurable changes in performance. My apologies again if I misunderstood your question

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "we" was a generic "we" (academical one). But you answered my question well. I wondered whether c_make_encoder was called multiple times when doing a single json.dumps().

Conversion to AC may introduce an overhead in how the arguments are pre-processed but if you're not comfortable with this, we can first fix this (and backport it) and then switch to AC for main.

Comment on lines +1345 to +1347
if (indent_level > 0) {
memset(PyUnicode_1BYTE_DATA(indent), ' ', indent_level);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (indent_level > 0) {
memset(PyUnicode_1BYTE_DATA(indent), ' ', indent_level);
}
memset(PyUnicode_1BYTE_DATA(indent), ' ', sizeof(Py_UCS1) * indent_level);

memset allows for a 0 size (and does nothing) as long as the pointer is valid. Alternatively, we could use PyUnicode_Fill but I think we usually bypass this and use memset directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants