A machine learning project for classifying Kepler Objects of Interest (KOI) using multiple approaches including deep learning (GP+CNN pipeline), traditional ML (Random Forest, XGBoost), and neural networks.
This project aims to identify potential exoplanets from NASA's Kepler mission data by analyzing light curves and extracted features. The project implements a complete ML pipeline from Gaussian Process denoising to deep learning classification.
- GP Denoising: Gaussian Process regression for light curve preprocessing
- TLS Search: Transit Least Squares for period detection
- Deep Learning: CNN-based classification pipeline (GP+CNN)
- Traditional ML: Random Forest, XGBoost with GPU acceleration
- Neural Networks: Multiple architectures (MLP, 1D-CNN, GP+CNN)
- Comprehensive Benchmarking: CPU vs GPU performance comparison
-
tsfresh_features.csv(21.6 MB)- Time-series features extracted using TSFresh library
- Contains ~3,500 samples with extracted statistical features
- Used for traditional ML models (Random Forest, XGBoost, MLP)
-
q1_q17_dr25_koi.csv- Kepler Objects of Interest catalog (Quarters 1-17, Data Release 25)
- Contains KOI metadata: period, t0, duration, disposition
- Fields:
kepid,kepoi_name,kepler_name,koi_disposition,koi_pdisposition
- Remove columns with single unique values
- Filter out samples with infinity values
- Fill NaN values with zero or column mean
- StandardScaler normalization
- Train/Val/Test split: varies by model (typically 90%/5%/5%)
model/
├── app/ # Application modules
│ ├── models/
│ │ └── cnn1d.py # Two-Branch 1D-CNN implementation
│ ├── data/
│ │ └── fold.py # Phase folding & view construction
│ ├── trainers/
│ │ ├── cnn1d_trainer.py # Training loop for CNN
│ │ └── utils.py # Utility functions
│ ├── calibration/
│ │ └── calibrate.py # Model calibration utilities
│ ├── denoise/ # GP denoising modules
│ ├── search/ # TLS period search
│ └── validation/ # Validation utilities
├── notebooks/ # Jupyter notebooks
│ ├── 03b_cnn_train.ipynb # CNN training
│ └── 04_newdata_inference.ipynb # Inference pipeline
├── scripts/ # Executable scripts
│ ├── benchmarks/ # Performance benchmarking
│ │ ├── complete_gpcnn_benchmark.py # Complete GP+CNN benchmark
│ │ ├── ultraoptimized_benchmark.py # Ultra-optimized comparison
│ │ ├── ultraoptimized_cpu_models.py # CPU-optimized models
│ │ ├── ultraoptimized_gpu_models.py # GPU-optimized models
│ │ └── visualize_gpcnn_comparison.py # Visualization tools
│ └── legacy/ # Legacy training scripts
│ ├── koi_project_nn.py # Simple neural network (MLP)
│ ├── train_rf_v1.py # Random Forest classifier
│ └── xgboost_koi.py # XGBoost classifier
├── data/ # Data files (gitignored if large)
│ ├── tsfresh_features.csv # Extracted features
│ └── q1_q17_dr25_koi.csv # KOI catalog
├── reports/ # Generated reports & results
│ ├── figures/ # Plots and visualizations (PDF/PNG)
│ ├── results/ # Model results (JSON)
│ ├── FINAL_GPU_BENCHMARK_REPORT.txt
│ ├── FINAL_GPU_RESULTS_REPORT.md
│ ├── GP_CNN_COMPLETE_ANALYSIS.md
│ └── ULTRA_OPTIMIZATION_FINAL_REPORT.md
├── SPECS/ # Technical specifications
│ ├── 1D_CNN_SPEC.md # CNN architecture spec
│ ├── PIPELINE_SPEC.md # Full pipeline specification
│ └── INTEGRATION_PLAN.md # Integration guidelines
├── prompts/ # Development workflow prompts
│ └── claude-commands.md # Claude Code automation
├── docs/ # Documentation
├── patches/ # Code patches for upgrades
├── .claude/ # Claude Code configuration
├── CLAUDE.md # Development guide
├── CITATIONS.md # References
├── README_UPGRADE.md # Upgrade instructions
├── requirements.txt # Python dependencies
└── README.md # This file
pip install -r requirements.txtKey dependencies:
torch- Deep learning frameworknumpy,pandas- Data processingscikit-learn- Traditional MLxgboost- Gradient boostingtransitleastsquares- Period detectioncelerite2orstarry_process- GP denoising (optional)
python scripts/benchmarks/complete_gpcnn_benchmark.pyRuns comprehensive benchmark including:
- GP+CNN pipeline
- Neural Networks (Simple MLP, Heavy NN)
- XGBoost (GPU & CPU)
- Random Forest (CPU)
python scripts/benchmarks/ultraoptimized_benchmark.pyTests 2025 best practices:
- GPU optimizations (cuDNN, TF32, Mixed Precision)
- CPU optimizations (Intel MKL, OpenMP)
- Comparison across all model types
python scripts/benchmarks/visualize_gpcnn_comparison.pyGenerates comparison plots and reports.
python scripts/legacy/koi_project_nn.py- 3-layer MLP: 256→64→1
- BatchNorm + GELU activation
- AdamW optimizer (lr=3e-5)
python scripts/legacy/train_rf_v1.py- Optimized: depth=8, n_estimators=200
- Grid search with cross-validation
python scripts/legacy/xgboost_koi.py- Gradient boosting with tree-based learning
Pipeline:
- GP Denoising: Remove systematics using Gaussian Process regression
- TLS Search: Detect periods with Transit Least Squares
- CNN Classification: Deep learning on denoised light curves
Architecture:
- GP simulator: Linear(input) → 1024 → 2048
- CNN layers: Conv1D blocks with BatchNorm
- Classifier: FC layers with LayerNorm and GELU
- Optimizations: Mixed precision (AMP), GPU acceleration
Key Features:
- Handles raw light curves (no manual feature engineering)
- Multi-scale pattern recognition
- GPU-optimized for fast training
- Best for transit morphology analysis
Random Forest:
- Best for TSFresh features
- No GPU required
- Excellent interpretability
- Achieves ~88% ROC-AUC
XGBoost:
- GPU acceleration available
- Fast training on large datasets
- Good balance of speed and accuracy
- Achieves ~87% ROC-AUC
Simple MLP:
- Lightweight baseline
- Fast training
- Good for feature-based data
Heavy NN:
- 5-layer deep network
- GPU-optimized
- Mixed precision training
| Model | ROC-AUC | Accuracy | Device | Training Time | GPU Util |
|---|---|---|---|---|---|
| Random Forest | 0.881 | 81.6% | CPU | 14.2s | N/A |
| XGBoost (GPU) | 0.871 | 79.9% | GPU | 3.4s | 84% |
| GP+CNN | 0.734 | 62.9% | GPU | 25.8s | 100% |
| Heavy NN | 0.683 | 65.0% | GPU | 8.9s | 7% |
| Simple MLP | 0.667 | 61.8% | GPU | 2.1s | 0% |
Key Findings:
- Random Forest (CPU) achieves best performance on TSFresh features
- XGBoost shows excellent GPU utilization (84%) with 4x speedup
- GP+CNN designed for raw light curves; underperforms on extracted features
- Tree-based models excel on tabular data
GPU Models:
- ✅ cuDNN benchmark mode
- ✅ TF32 for Tensor Cores
- ✅ Mixed Precision (AMP)
- ✅ Pinned memory transfers
- ✅ Non-blocking data loading
- ✅ Batch size tuning (divisible by 8)
CPU Models:
- ✅ Physical core allocation
- ✅ MKL/OpenMP threading
- ✅ Memory-aligned arrays
- ✅ Intel Extension (if available)
This project includes automation for development with Claude:
- Setup: Read
CLAUDE.mdfor guidelines - Specs: Review detailed specifications in
SPECS/ - Prompts: Execute workflows from
prompts/claude-commands.md - Benchmarks: Run scripts in
scripts/benchmarks/for performance testing
- Implement in
app/models/ - Add trainer in
app/trainers/ - Create benchmark in
scripts/benchmarks/ - Document performance in
reports/
All benchmark results and reports are stored in reports/:
- Figures:
reports/figures/*.{png,pdf}- Visualizations - Results:
reports/results/*.json- Numerical results - Reports:
reports/*.md- Analysis and findings
Detect periodic dips in stellar brightness when an orbiting planet passes in front of the star.
NASA space telescope that monitored 150,000+ stars (2009-2018).
Distinguish true planetary transits from false positives (eclipsing binaries, stellar variability).
If you use this code, please cite:
- Kepler Mission: https://www.nasa.gov/kepler
- NASA Exoplanet Archive
- See
CITATIONS.mdfor detailed references
This project analyzes public NASA Kepler mission data.
- Review specifications in
SPECS/ - Follow coding patterns in
app/ - Add tests for new features
- Update documentation and benchmarks
Note: This project demonstrates a complete ML pipeline for exoplanet detection, from GP denoising to deep learning classification. The GP+CNN pipeline represents the modern approach for raw light curve analysis, while feature-based models (RF, XGBoost) provide strong baseline performance on extracted features.