Skip to content

Fix production timeouts: stop uv re-syncing deps at container startup#54

Merged
ChiragAgg5k merged 2 commits into
mainfrom
fix/runtime-uv-resync
Jul 3, 2026
Merged

Fix production timeouts: stop uv re-syncing deps at container startup#54
ChiragAgg5k merged 2 commits into
mainfrom
fix/runtime-uv-resync

Conversation

@ChiragAgg5k

Copy link
Copy Markdown
Member

Problem

Production (mcp.appwrite.io) was timing out on tool requests. The pods on do-fra1-assets-fra1-prod were stuck in a restart storm after the 0.8.2 rollout.

Root cause: the image CMD is uv run mcp-server-appwrite without --no-sync, so every container start re-resolves the environment — downloading the dev group (ruff, black, pyright) from PyPI, rebuilding the package, and bytecode-compiling ~3900 files:

Building mcp-server-appwrite @ file:///app
Downloading ruff / black / pyright ...
Bytecode compiled 3908 files in 1.57s

That burns 30–60s of full-core CPU before port 8000 binds. In the cluster this cascaded:

  1. Liveness probe (period=10s, timeout=3s, failureThreshold=3, no startupProbe) killed pods mid-startup → restart → recompile → loop.
  2. The compile pegged a core on the small assets nodes, starving the healthy sibling pod so its /healthz exceeded the 3s probe timeout → kubelet killed it too (context deadline exceeded events).
  3. The CPU spike tripped the HPA, adding pods and more compile storms.
  4. Every kill dropped in-flight MCP requests (ASGI callable returned without completing response) → client-side tool-call timeouts.

Fix

Add --no-sync so the entrypoint runs from the venv baked at build time (uv sync --frozen --no-dev). Verified locally: container cold start drops to ~1s with no network access, and /healthz responds immediately. Also removes the runtime dependency on PyPI availability.

Follow-ups (Helm chart, separate repo)

  • Add a startupProbe so slow starts can never be liveness-killed.
  • Add a CPU limit / raise the request so a starting pod can't starve node-mates.
  • Consider session affinity on the HTTPRoute, since streamable-HTTP session state is per-pod.

The image CMD used 'uv run' without --no-sync, so every container start
re-resolved the environment — installing the dev group (ruff, black,
pyright) from PyPI and re-bytecode-compiling ~3900 files. This burned
30-60s of full-core CPU before the server bound port 8000, which in
production caused liveness-probe kills during startup, CPU starvation
of sibling pods (their /healthz exceeded the 3s probe timeout), HPA
scale-ups from the compile CPU spike, and dropped in-flight MCP
requests on every kill.

With --no-sync, uv runs the entrypoint from the venv baked at build
time (uv sync --frozen --no-dev), so cold start is ~1s and the runtime
no longer depends on PyPI availability.
The console 'all' scope already covers project and organization access,
so advertising the finer-grained aliases is redundant.
@greptile-apps

greptile-apps Bot commented Jul 3, 2026

Copy link
Copy Markdown

Greptile Summary

This PR fixes production container restart storms by adding --no-sync to the uv run CMD, preventing uv from re-resolving and bytecode-compiling dependencies on every cold start. It also removes project:all and organization:all from the PREFERRED_SCOPES list, which is a separate behavioral change to OAuth scope advertising.

  • Dockerfile: --no-sync correctly makes the container use the venv baked during uv sync --frozen --no-dev at image build time, dropping cold-start time from ~30–60s to ~1s.
  • constants.py: project:all and organization:all are silently dropped from the advertised scope catalog — this is an unexplained auth behavior change bundled into a runtime fix.
  • tests/unit/test_auth.py: Assertion updated to match the narrowed preferred-scopes list.

Confidence Score: 4/5

Safe to merge for the container startup fix; the scope removal in constants.py is unexplained and warrants a quick confirmation before shipping.

The Dockerfile change is correct and well-motivated — --no-sync with a build-time uv sync --frozen --no-dev is exactly the right pattern. The scope removal from PREFERRED_SCOPES is a functional change to OAuth discovery with no justification in the PR description; existing clients that relied on project:all or organization:all may need to re-authorize when their tokens expire.

src/mcp_server_appwrite/constants.py — the PREFERRED_SCOPES change is the only part that warrants a second look before merging.

Important Files Changed

Filename Overview
Dockerfile Adds --no-sync to uv run CMD so the container uses the venv baked at build time (uv sync --frozen --no-dev) instead of re-resolving and recompiling dependencies on every cold start. Correct and targeted fix.
src/mcp_server_appwrite/constants.py Removes project:all and organization:all from PREFERRED_SCOPES with no explanation in the PR description — a behavioral auth change bundled into a container startup fix that warrants justification.
tests/unit/test_auth.py Updates test_advertised_scopes_prefer_curated_subset assertion to match the narrowed PREFERRED_SCOPES; the test correctly verifies the new filtering behavior.

Comments Outside Diff (1)

  1. src/mcp_server_appwrite/constants.py, line 53-58 (link)

    P2 Unexplained removal of project:all and organization:all from PREFERRED_SCOPES

    This is a functional auth change bundled into a container startup fix PR, with no explanation in the description. After this change, the MCP will no longer advertise project:all or organization:all to OAuth clients during discovery — even when the Appwrite authorization server includes them in its scope catalog. Any MCP client that previously received tokens covering those scopes will need to re-authorize with a narrower grant when their token expires, and any tooling that relied on project- or organization-scoped access may silently lose permissions.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! Why are project:all and organization:all being removed from the preferred scopes? Are these scopes deprecated by the Appwrite auth server, were they causing issues with token/scope length, or is this an intentional narrowing of the default grant?

Reviews (1): Last reviewed commit: "(fix): drop project:all and organization..." | Re-trigger Greptile

@ChiragAgg5k ChiragAgg5k merged commit fbea49b into main Jul 3, 2026
5 checks passed
@ChiragAgg5k ChiragAgg5k deleted the fix/runtime-uv-resync branch July 3, 2026 04:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant