-
-
Notifications
You must be signed in to change notification settings - Fork 33.8k
gh-143196: fix heap-buffer-overflow in JSON encoder indent cache #143246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
gh-143196: fix heap-buffer-overflow in JSON encoder indent cache #143246
Conversation
- Validate that _current_indent_level is 0 when indent is set, preventing uninitialized cache access and re-entrant __mul__ attacks - Convert integer indent to string of spaces in encoder_new, matching the Python-level behavior
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
|
I suspect this PR has been generated by an LLM, or at least the description had. Can you confirm this? Because it's not a security issue. In addition, please read our policy on what when using LLM is acceptable: https://devguide.python.org/getting-started/generative-ai/ (if English isn't your first language, it's fine, but please indicate whether an LLM has been used and how). |
@picnixz I'm sorry, I always considered heap-buffer-overflow an exploitable issue. But I changed the language to "bug", perhaps more aligned with the reported issue. Is that ok? |
|
Bugs like that only become security issues when they are exploitable and we need to assess how feasible for an adversary it is to exploit it. In our case, the attack surface is very small and relies on invoking "semi-public" functions (and then it should be exploitable! usually there is just a hard crash, which could be considered a DoS in some sense if it's serving a web app with auto-reload for instance). Anyway, now, I'd like you to confirm whether this has been generated by a LLM or not (and if so, which part). I'm sorry to be doubtful but because of the emergence of LLM, we have many new contributors that simply use LLM to generate their PRs and that is unacceptable. |
@picnixz PR description proofread with ChatGTP, but if that represents a violation I'm ok with closing the PR |
|
If it's only the description, it's fine. But in the future, you don't need to have an LLM generate the description. We don't really care which files were modified for instance (though it's good to know if breaking changes were introduced) or whether the tests were added (we expect them to be added!) Affected versions are already mentioned on the issue as well. |
| } | ||
|
|
||
| /* Convert indent to str if it's not None or already a string */ | ||
| if (indent != Py_None && !PyUnicode_Check(indent)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this complex checks (argument 1 also is checked) I'd suggest that we convert __new__ to argument clinic. However, I'm not entirely sure that we won't lose perfs. How many times do we call this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance considerations
For context, my original proposal was to use a simpler change
static PyObject *
encoder_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
{
...
if (indent != Py_None && !PyUnicode_Check(indent)) {
PyErr_Format(PyExc_TypeError,
"make_encoder() argument 4 must be str or None, "
"not %.200s", Py_TYPE(indent)->tp_name);
return NULL;
}
...
}Let me call that fix-v1.
After comment/suggestion in the reported issue I implemented the unidecode check with the additional logical statements (that is the current branch), let me call that fix.
And finally we have main. What I did was run some tests against all these three branches. Basically ran a warm-up of the following snippet and then 100 times to measure average executing time
start = time.perf_counter()
json.dumps(data, indent=2)
end = time.perf_counter()These are the results
| Test | fix | fix-v1 | main |
|---|---|---|---|
| 128KB.json | 1.97 ms | 2.46 ms | 1.95 ms |
| 256KB.json | 2.04 ms | 2.03 ms | 2.03 ms |
| 512KB.json | 4.04 ms | 4.05 ms | 4.07 ms |
| 1MB.json | 8.83 ms | 8.00 ms | 8.14 ms |
| 5MB.json | 42.02 ms | 41.48 ms | 42.02 ms |
My local is a bit outdated, and I'm pretty confident these numbers can reduced in a faster cpu, but I do believe that performance is not impacted, or at least my changes are statistically consistent with main
Using the argument clinic
I'm rabbit-holing a bit into that approach, not entirely sure yet how it works. But my take on it: it would require some major changes to the module, that I don't feel comfortable messing with. If that is the desired path, I will be happy to document my findings in the reported issue, and leave it to more expert/capable hands to complete.
"How many times do we call this function?"
I am not entirely sure what do you mean with this question, sorry 😢 encoder_new is called once per JSONEncoder instance creation. In typical usage (json.dumps()),
json.dumps(data, indent=2)
|-- JSONEncoder.encode()
|-- JSONEncoder.iterencode()
|-- c_make_encoder()
|-- encoder_new() <= called only once
|-- _iterencode(obj, 0)
|-- encoder_listencode_obj()
|-- encoder_listencode_list()
|-- encoder_listencode_dict()
it's called once per encoding operation. "We" don't call it, the user does and, as I showed above, they are not going to experience any measurable changes in performance. My apologies again if I misunderstood your question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "we" was a generic "we" (academical one). But you answered my question well. I wondered whether c_make_encoder was called multiple times when doing a single json.dumps().
Conversion to AC may introduce an overhead in how the arguments are pre-processed but if you're not comfortable with this, we can first fix this (and backport it) and then switch to AC for main.
| if (indent_level > 0) { | ||
| memset(PyUnicode_1BYTE_DATA(indent), ' ', indent_level); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (indent_level > 0) { | |
| memset(PyUnicode_1BYTE_DATA(indent), ' ', indent_level); | |
| } | |
| memset(PyUnicode_1BYTE_DATA(indent), ' ', sizeof(Py_UCS1) * indent_level); |
memset allows for a 0 size (and does nothing) as long as the pointer is valid. Alternatively, we could use PyUnicode_Fill but I think we usually bypass this and use memset directly.
Summary
This PR fixes two related bugs in the JSON encoder's indentation cache (introduced in gh-95382):
Heap-buffer-overflow via uninitialized cache: When
c_make_encoderis called with_current_indent_level > 0, the cache is created with only 1 element, butupdate_indent_cacheexpects the cache to be incrementally built starting from level 0, causing out-of-bounds memory access.Use-after-free via re-entrant
__mul__:PySequence_Repeat(s->indent, indent_level)increate_indent_cachecan execute arbitrary Python code through a custom__mul__method, potentially causing use-after-free.Changes
Modules/_json.cIndent validation and conversion in
encoder_new:str,int, orNonefor indentencoder.py)TypeErrorValueErrorIndent level validation in
encoder_call:_current_indent_level == 0when indent is set__mul__attack vector sincePySequence_Repeatis only called whenindent_level != 0Lib/test/test_json/test_speedups.pytest_indent_argument_to_encoder: Tests indent type validation and conversiontest_nonzero_indent_level_with_indent: Tests indent level validationPerformance
No performance impact on the hot path. The indentation cache optimization from gh-95382 remains fully intact.
Our changes only add:
int→str)indent_level != 0checkencoder_callThe performance-critical recursive encoding path (
encoder_listencode_list,encoder_listencode_dict) is completely unchanged. The cache is still built incrementally and reused across recursive calls exactly as before.Affected Versions
The fix applies cleanly to both
mainand3.14branches.json.encoderindentation cache via re-entrant__mul__#143196