aaq-scraper

Until we get an official API that has a way to prevent DOS-ing, we scrape the SUMO (support.mozilla.org) API by driving a real browser. Since ~June 2026 the API sits behind a JavaScript challenge that blocks headless HTTP (see thunderbird/github-action-thunderbird-aaq#34); a real browser passes the challenge, then we call the JSON API from inside the browser's authenticated context.

Proof of concept (Bucket 0)

uv sync
uv run playwright install chromium
uv run python poc.py            # headed — most likely to pass the challenge
uv run python poc.py --headless # try headless (closer to CI)
uv run python poc.py --dump     # also write the raw first API page to poc-sample.json

Success = the script prints a non-zero count and real question records (not challenge HTML), and reports where taken_by / operating_system / thunderbird_version live in the API response.

Scraping questions and answers

# Questions for a UTC date window (single day = same date twice).
uv run python scrape_questions.py 2026 6 10 2026 6 10 --headless

# Answers for those questions (defaults to the matching answers-... filename).
uv run python scrape_answers.py \
    --questions 2026/questions-thunderbird-desktop-2026-06-10.csv --headless

# Thunderbird for Android: same tools, --product thunderbird-android.
uv run python scrape_questions.py 2026 6 10 2026 6 10 \
    --product thunderbird-android --headless

Output is written to <year>/<questions|answers>-<product>-<dates>.csv, e.g. 2026/questions-thunderbird-desktop-2026-06-10.csv. CSVs are sorted by ascending id. Both scrapers add a polite delay between API calls — a fixed --sleep 2 seconds by default, or --random-delay to vary it between --min-delay and --max-delay (2–10s).

Questions keep the original columns/flattening plus operating_system, thunderbird_version, and taken_by. Answers use the original columns: id, question_id, created, updated, content, creator, is_spam, num_helpful, num_unhelpful.

The 2026/ directory holds committed fixtures from a verification run against 2026-06-10 (a pre-challenge day), reconciled against the public website.

Schema drift check

check_schema.py samples the live API and compares its JSON fields against the committed baseline schema/expected-fields.json:

uv run python check_schema.py --headless                  # exit 1 on drift
uv run python check_schema.py --headless --update-baseline  # manual bump

A daily workflow (.github/workflows/schema-check.yml) runs the check and opens (or comments on) a schema-change issue when fields are added or removed. The baseline is only updated manually: when the API legitimately changes, review the drift, re-run with --update-baseline, and commit — and if a field was removed, update the scrapers so the affected CSV columns don't silently blank.

We require all those who participate in this repo to agree and adhere to the Mozilla Community Participation Guidelines

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
.github/workflows		.github/workflows
2024		2024
2025		2025
2026		2026
backfill-reports		backfill-reports
docs		docs
schema		schema
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
check_schema.py		check_schema.py
csv_safety.py		csv_safety.py
find_updated_days.py		find_updated_days.py
normalize_csv_escaping.py		normalize_csv_escaping.py
poc.py		poc.py
pyproject.toml		pyproject.toml
redact_credentials.py		redact_credentials.py
run_backfill.py		run_backfill.py
run_backfill_months.py		run_backfill_months.py
run_refresh.py		run_refresh.py
scan_credentials.py		scan_credentials.py
scrape_answers.py		scrape_answers.py
scrape_questions.py		scrape_questions.py
sumo.py		sumo.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

aaq-scraper

Proof of concept (Bucket 0)

Scraping questions and answers

Schema drift check

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

aaq-scraper

Proof of concept (Bucket 0)

Scraping questions and answers

Schema drift check

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages