Until we get an official API that has a way to prevent DOS-ing, we scrape the SUMO (support.mozilla.org) API by driving a real browser. Since ~June 2026 the API sits behind a JavaScript challenge that blocks headless HTTP (see thunderbird/github-action-thunderbird-aaq#34); a real browser passes the challenge, then we call the JSON API from inside the browser's authenticated context.
uv sync
uv run playwright install chromium
uv run python poc.py # headed — most likely to pass the challenge
uv run python poc.py --headless # try headless (closer to CI)
uv run python poc.py --dump # also write the raw first API page to poc-sample.jsonSuccess = the script prints a non-zero count and real question records (not
challenge HTML), and reports where taken_by / operating_system /
thunderbird_version live in the API response.
# Questions for a UTC date window (single day = same date twice).
uv run python scrape_questions.py 2026 6 10 2026 6 10 --headless
# Answers for those questions (defaults to the matching answers-... filename).
uv run python scrape_answers.py \
--questions 2026/questions-thunderbird-desktop-2026-06-10.csv --headless
# Thunderbird for Android: same tools, --product thunderbird-android.
uv run python scrape_questions.py 2026 6 10 2026 6 10 \
--product thunderbird-android --headlessOutput is written to <year>/<questions|answers>-<product>-<dates>.csv, e.g.
2026/questions-thunderbird-desktop-2026-06-10.csv. CSVs are sorted by ascending
id. Both scrapers add a polite delay between API calls — a fixed --sleep 2
seconds by default, or --random-delay to vary it between --min-delay and
--max-delay (2–10s).
Questions keep the original columns/flattening plus operating_system,
thunderbird_version, and taken_by. Answers use the original columns:
id, question_id, created, updated, content, creator, is_spam, num_helpful, num_unhelpful.
The 2026/ directory holds committed fixtures from a verification run against
2026-06-10 (a pre-challenge day), reconciled against the public website.
check_schema.py samples the live API and compares its JSON fields against the
committed baseline schema/expected-fields.json:
uv run python check_schema.py --headless # exit 1 on drift
uv run python check_schema.py --headless --update-baseline # manual bumpA daily workflow (.github/workflows/schema-check.yml) runs the check and opens
(or comments on) a schema-change issue when fields are added or removed. The
baseline is only updated manually: when the API legitimately changes, review
the drift, re-run with --update-baseline, and commit — and if a field was
removed, update the scrapers so the affected CSV columns don't silently blank.
- We require all those who participate in this repo to agree and adhere to the Mozilla Community Participation Guidelines