Chat smarter, speak kinder.
Production-ready FastAPI service for toxic language detection using a fine‑tuned XLM‑RoBERTa model hosted on Hugging Face Hub. Modular, type‑safe, and deployable on CPU or GPU.
- Loads a Hugging Face model at startup and serves low-latency inference
- Clean modular layout (config, model, schemas, routes)
- Single and batch prediction with optional probability thresholding
- JSON responses with stable probability keys (
clean,toxic) - OpenAPI docs via
/docsand/redoc - Health endpoint at
/health
toxicity-api/
├─ app/
│ ├─ main.py # FastAPI entry point
│ ├─ core/
│ │ ├─ config.py # Env + settings
│ │ └─ model.py # Model loader + inference logic
│ ├─ api/
│ │ └─ routes.py # HTTP endpoints
│ └─ schemas/
│ └─ predict.py # Pydantic I/O models
├─ requirements.txt
└─ README.md
pip install -r requirements.txtexport MODEL_ID=your-username/toxic-xlmr
# optional
export DEVICE=cuda # or cpu
export MAX_LENGTH=256Or you can use the default on https://huggingface.co/cedrugs/toxic-xlmr (cedrugs/toxic-xlmr)
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reloadOpen docs at http://localhost:8000/docs.
| Variable | Description | Default |
|---|---|---|
MODEL_ID |
Hugging Face model repo (e.g. user/model) | your-username/toxic-xlmr |
DEVICE |
cpu or cuda (auto-detect if unset) |
auto |
MAX_LENGTH |
Max token length per input | 256 |
GET /health
Response:
{ "status": "ok", "model_loaded": true, "device": "cuda" }GET /
POST /v1/predict
Request:
{ "text": "you are so dumb", "threshold": 0.5 }Response:
{
"label": "toxic",
"probs": { "clean": 0.12, "toxic": 0.88 },
"latency_ms": 4.2
}POST /v1/batch_predict
Request:
{ "texts": ["you suck", "have a nice day"], "threshold": 0.5 }Response:
{
"results": [
{ "label": "toxic", "probs": { "clean": 0.09, "toxic": 0.91 }, "latency_ms": 2.1 },
{ "label": "clean", "probs": { "clean": 0.97, "toxic": 0.03 }, "latency_ms": 2.1 }
],
"latency_ms": 4.2
}Build & run:
docker build -t chatguard-api .
docker run --rm -p 8000:8000 \
-e MODEL_ID=your-username/toxic-xlmr \
-e DEVICE=cpu \
chatguard-api- For GPU: set
DEVICE=cudaand ensure CUDA drivers are available. - Prefer one worker per GPU. For CPU-bound scaling:
gunicorn -k uvicorn.workers.UvicornWorker -w 4 app.main:app --bind 0.0.0.0:8000
- Pin model revisions in
MODEL_IDfor reproducible deployments (e.g.,user/model@sha). - Consider enabling request timeouts and reverse proxying behind Traefik/Caddy.
This API expects a Hugging Face repo containing a binary classifier with standard files:
pytorch_model.bin
config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
Pushing to Hub example:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model.push_to_hub("toxic-xlmr")
tokenizer.push_to_hub("toxic-xlmr")MIT
Built with FastAPI, Transformers, and PyTorch. Deployed anywhere from laptops to GPUs in the cloud.