Simple Python scripts to train Isolation Forest and XGBoost models on UNSW-NB15 dataset for DMZ Gateway proof-of-concept.
Train two ML models for malicious traffic detection:
- Isolation Forest - Anomaly detection (unsupervised)
- XGBoost - Binary classification (supervised)
Export both to ONNX format for deployment in Rust on Raspberry Pi.
.
βββ 1_preprocess_data.py # Data loading and preprocessing
βββ 2_train_isolation_forest.py # Train anomaly detection model
βββ 3_train_xgboost.py # Train classification model
βββ 4_export_to_onnx.py # Export models to ONNX
βββ requirements.txt # Python dependencies
βββ data/ # Dataset directory (create this)
β βββ [UNSW-NB15 CSV files here]
βββ models/ # Trained models (created automatically)
βββ *.pkl # Pickle models
βββ *.json # XGBoost native format
βββ *.onnx # ONNX models for Rust
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtDownload UNSW-NB15 from: https://research.unsw.edu.au/projects/unsw-nb15-dataset
You need these files:
UNSW_NB15_training-set.csvUNSW_NB15_testing-set.csv
Place them in ./data/ directory:
mkdir -p data models
# Copy CSV files to ./data/Execute scripts in order:
# Step 1: Preprocess data
python 1_preprocess_data.py
# Step 2: Train Isolation Forest
python 2_train_isolation_forest.py
# Step 3: Train XGBoost
python 3_train_xgboost.py
# Step 4: Export to ONNX
python 4_export_to_onnx.pyAfter training, you'll have:
models/
βββ isolation_forest.pkl # Scikit-learn model
βββ isolation_forest.onnx # ONNX model for Rust β
βββ xgboost_classifier.pkl # XGBoost pickle
βββ xgboost_classifier.json # XGBoost native
βββ xgboost_classifier.onnx # ONNX model for Rust β
The .onnx files are ready for Rust deployment!
Based on UNSW-NB15 research papers:
| Model | Expected Accuracy | Notes |
|---|---|---|
| Isolation Forest | 85-95% | Unsupervised anomaly detection |
| XGBoost | 95-99% | Supervised classification |
Note: These are baseline models for hardware demonstration. Production deployments should train on actual network traffic.
Edit 2_train_isolation_forest.py:
model = IsolationForest(
n_estimators=100, # Number of trees (increase for better accuracy)
contamination=0.1, # Expected % of anomalies (tune this)
max_samples='auto', # Samples per tree
random_state=42
)Edit 3_train_xgboost.py:
model = xgb.XGBClassifier(
n_estimators=100, # Number of boosting rounds
max_depth=6, # Tree depth (increase carefully)
learning_rate=0.3, # Step size (lower = slower but more accurate)
random_state=42
)Add to Cargo.toml:
[dependencies]
ort = "2.0"
ndarray = "0.15"Example Rust code:
use ort::{Environment, SessionBuilder, Value};
use ndarray::Array2;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Load models
let env = Environment::builder().build()?;
let iforest = SessionBuilder::new(&env)?
.with_model_from_file("isolation_forest.onnx")?;
let xgboost = SessionBuilder::new(&env)?
.with_model_from_file("xgboost_classifier.onnx")?;
// Prepare input (example: 1 sample with 43 features)
let input = Array2::<f32>::zeros((1, 43));
let input_tensor = Value::from_array(input)?;
// Run inference
let iforest_outputs = iforest.run(vec![input_tensor.clone()])?;
let xgboost_outputs = xgboost.run(vec![input_tensor])?;
// Process results...
println!("Inference complete!");
Ok(())
}The Isolation Forest ONNX conversion may have compatibility issues with some sklearn versions. If export fails:
- Try downgrading:
pip install scikit-learn==1.3.0 - Or use the pickle model directly in a Python service
- The XGBoost model should export successfully
XGBoost ONNX models can be large (10-50 MB). For Raspberry Pi:
- Reduce
n_estimators(e.g., 50 instead of 100) - Reduce
max_depth(e.g., 4 instead of 6)
If using UNSW-NB15 dataset, please cite:
Moustafa, Nour, and Jill Slay. "UNSW-NB15: a comprehensive data set for
network intrusion detection systems (UNSW-NB15 network data set)."
Military Communications and Information Systems Conference (MilCIS), 2015.
This is a proof-of-concept!
- Models are trained on UNSW-NB15 (2015 synthetic data)
- Real deployments need training on actual network traffic
- Customers should train their own models for their specific environment
- Goal: Demonstrate hardware capability, not production-ready models
# Make sure CSV files are in ./data/
ls -la ./data/# Reduce data size in preprocessing script
train_df = train_df.sample(n=50000) # Use subset# Update packages
pip install --upgrade onnx onnxruntime skl2onnx onnxmltools- UNSW-NB15 Dataset: https://research.unsw.edu.au/projects/unsw-nb15-dataset
- ONNX Runtime: https://onnxruntime.ai/
- sklearn-onnx: https://onnx.ai/sklearn-onnx/
- Rust ort crate: https://github.com/pykeio/ort
Next Steps: Deploy these ONNX models in your Rust DMZ Gateway! π