A complete, automated pipeline for Next-Generation Sequencing (NGS) exome analysis. Process FASTQ files to annotated variants with a single command.
# 1. Clone the repository (run once)
git clone https://github.com/Babajan-B/Exome-Analysis-End-to-END.git ~/NGS
cd ~/NGS
# 2. Install every dependency (tools, reference genome, databases)
bash install_all.sh
# 3. Run the full pipeline on all FASTQ pairs in ~/NGS/data/
bash ULTIMATE_MASTER_PIPELINE.sh ~/NGS/data 16After the first run you only need step 3.
The pipeline creates a timestamped ZIP in ~/NGS/ with every report, annotation, and summary.
Input: Paired-end FASTQ files
Output: Annotated variants, functional classifications, quality reports
Automated Steps:
- Quality Control (FastQC)
- Read Trimming (fastp)
- Alignment (BWA-MEM to hg19)
- BAM Processing (SAMtools, GATK)
- Variant Calling (GATK HaplotypeCaller)
- Variant Filtering (PASS only)
- ANNOVAR Annotation (5 databases)
- refGene
- ClinVar
- gnomAD
- dbSNP (avsnp150)
- Prediction scores (dbnsfp42a)
- snpEff Annotation
- Zygosity Classification (Heterozygous/Homozygous)
- Functional Separation
- By Type: SNPs, Insertions, Deletions
- By Location: Exonic, Non-Exonic
- By Effect: Nonsynonymous, Synonymous, Stopgain, Frameshift
- VCF Compression & Indexing (for IGV)
- ZIP Archive with all essential results
β
One-Command Installation - Everything installed automatically
β
One-Command Analysis - From FASTQ to results in one run
β
Dual Annotation - Both ANNOVAR and snpEff
β
Smart Filtering - PASS variants only, reduces file sizes by 90%
β
Functional Classification - Automatic separation by variant type and effect
β
Zygosity Information - Het/Hom status for each variant
β
IGV-Ready - Compressed, indexed VCFs for genome browser
β
Excel-Ready - Tab-delimited TXT files for easy analysis
β
Production-Ready - Used in clinical research labs
- OS: Linux (Ubuntu 20.04+, CentOS 7+, or similar)
- RAM: 16 GB minimum, 32 GB recommended
- Storage: 100 GB free space
- CPU: 8+ cores (16+ recommended)
- Internet: Required for initial setup
- Python 3.7+
- Java 21 (for GATK, snpEff)
- Perl (for ANNOVAR)
- All bioinformatics tools installed by
install_all.sh
git clone https://github.com/Babajan-B/Exome-Analysis-End-to-END.git ~/NGS
cd ~/NGSbash install_all.shThis installs:
- FastQC, fastp, BWA, SAMtools, GATK, snpEff
- Reference genome (hg19) with indices
- ANNOVAR with 9 clinical databases
- All dependencies
Time: 45-60 minutes
Storage: ~15 GB
bash CHECK_INSTALLATION.shConfirms all tools and databases are ready.
# 1. Add FASTQ files to data directory
mkdir -p ~/NGS/data
# Copy your *_R1.fastq.gz and *_R2.fastq.gz files here
# 2. Run complete pipeline
cd ~/NGS
bash ULTIMATE_MASTER_PIPELINE.sh ~/NGS/data 16Parameters:
- First argument: Directory containing FASTQ files
- Second argument: Number of CPU threads (default: 16)
The pipeline will:
- Auto-detect all sample pairs
- Process each sample sequentially
- Generate comprehensive results
- Create final ZIP archive
One ZIP File: NGS_Results_Complete_[timestamp].zip (~400-600 MB)
Contains:
results/
βββ Sample1/
β βββ annovar/
β β βββ annotated_Sample1.hg19_multianno.txt # ANNOVAR annotation
β β βββ annotated_Sample1_with_zygosity.txt # With Het/Hom status
β β βββ annotated_Sample1.hg19_multianno.vcf.gz # ANNOVAR VCF
β β βββ separated_by_type/
β β β βββ SNPs.txt
β β β βββ Insertions.txt
β β β βββ Deletions.txt
β β βββ functional_classification/
β β β βββ SNPs_Exonic.txt
β β β βββ SNPs_NonExonic.txt
β β β βββ Exonic_Nonsynonymous.txt
β β β βββ Exonic_Synonymous.txt
β β β βββ Exonic_Stopgain.txt
β β β βββ Exonic_Frameshift.txt
β β βββ snpeff/
β β βββ Sample1_snpEff_annotated.vcf.gz # snpEff VCF
β β βββ Sample1_snpEff_summary.html # snpEff report
β β βββ Sample1_snpEff_summary.csv
β βββ fastqc/
β β βββ Sample1_R1_fastqc.html # QC reports
β β βββ Sample1_R2_fastqc.html
β βββ trimmed/
β βββ fastp_report.html # Trimming stats
βββ MASTER_ANALYSIS_SUMMARY.txt # Overall summary
Excluded from ZIP (to keep size small):
- β BAM files (large, 15-20 GB each)
- β FASTQ files (large, input data)
- β Intermediate files
-
annotated_[SAMPLE]_with_zygosity.txt- Main annotation file with zygosity
- Open in Excel/LibreOffice
- Filter by:
- ClinVar significance (Pathogenic/Likely Pathogenic)
- gnomAD frequency (rare: < 0.01)
- Functional effect (nonsynonymous, stopgain)
- Zygosity (Heterozygous/Homozygous)
-
functional_classification/Exonic_Nonsynonymous.txt- Coding variants that change amino acids
- High priority for pathogenicity analysis
-
functional_classification/Exonic_Stopgain.txt- Variants creating premature stop codons
- Often deleterious
-
snpeff/[SAMPLE]_snpEff_summary.html- Visual summary of variant effects
- Statistics and charts
-
VCF Files (
.vcf.gz)- Load into IGV (Integrative Genomics Viewer)
- Visualize variants in genomic context
ANNOVAR (5 databases):
- refGene: Gene-based annotation
- ClinVar: Clinical significance
- gnomAD: Population frequencies (exomes)
- dbSNP (avsnp150): rsIDs and allele frequencies
- dbnsfp42a: Prediction scores (SIFT, PolyPhen, CADD, etc.)
snpEff:
- Functional consequences (HIGH/MODERATE/LOW impact)
- Protein change predictions
- Transcript-level annotations
By Variant Type:
- SNPs (Single Nucleotide Polymorphisms)
- Insertions
- Deletions
- MNPs (Multiple Nucleotide Polymorphisms)
By Location:
- Exonic (in coding regions) vs Non-Exonic
- UTR, intronic, intergenic
By Effect:
- Nonsynonymous (amino acid change)
- Synonymous (silent mutation)
- Stopgain (premature termination)
- Frameshift (indel changing reading frame)
Per Sample (Typical Exome):
- Raw FASTQ: 6-8 GB
- Analysis time: 2.5-3 hours (16 CPUs)
- Storage used: 20-25 GB
- Final ZIP: 150-200 MB
3 Samples:
- Total time: 7-9 hours
- Runs sequentially (one completes β next starts)
# Verify installation
bash CHECK_INSTALLATION.sh
# Check disk space
df -h ~
# Check Java version (needs 21+)
java -version| Error | Solution |
|---|---|
command not found |
Re-run install_all.sh or check PATH |
Out of memory |
Reduce thread count or add more RAM |
No FASTQ pairs found |
Check file naming: *_R1.fastq.gz + *_R2.fastq.gz |
snpEff class version error |
Install Java 21: sudo apt install openjdk-21-jdk |
# Pipeline logs for each sample
cat ~/NGS/results/[SAMPLE]/pipeline.log
# Summary of all samples
cat ~/NGS/MASTER_ANALYSIS_SUMMARY.txt- PROJECT_STRUCTURE.md - Detailed project layout
- RUN_ULTIMATE_PIPELINE.txt - Quick reference guide
- LICENSE - MIT License
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
For major changes, open an issue first.
This pipeline integrates these excellent tools:
- FastQC - Quality control
- fastp - Read preprocessing
- BWA - Read alignment
- SAMtools - BAM/SAM manipulation
- GATK - Variant calling
- ANNOVAR - Variant annotation
- snpEff - Functional annotation
MIT License - see LICENSE file for details.
- Issues: GitHub Issues
- Documentation: Check the docs/ folder
- Updates: Watch the repository for new releases
If you use this pipeline in your research:
NGS Exome Analysis Pipeline
https://github.com/Babajan-B/Exome-Analysis-End-to-END
Three scripts. One command. Complete analysis.
𧬠Made for researchers, by researchers.