These Python scripts create metrics and graphs from genome assemblies.
To use, first install the dependancies with conda install biopython matplotlib.
After that, edit the input.csv file to match your situation.
Then you can run the image and stats generating script with:
python AssemblyStats.py input.fasta
or
python AssemblyStats.py input.fasta input.fasta.gz input.fasta.bz2
or
python AssemblyStats.py input.csv
The CSV (comma separated) columns are: Seq_Name Seq_File
Spaces are allowed.
The order in the file will define the order in the graphics legend.
You can comment lines with #
First line HAS to be the title:
Seq_Name,Seq_File
To change the output of the graph, one can set a couple of environment values. These are :
MAX_CONTIGS
MIN_LENGTH
TITLE
TYPE
EXPECTED_GENOME_SIZE
So an example looks like this :
export TITLE="103"
export MAX_CONTIGS=4000
export MIN_LENGTH=100
python ~/A50-plot/AssemblyStats.py 103_assemblies.csv > 103.stats
To parse the output to human readable, one can use this :
cat 103.stats | tr ',' '\t' | numfmt --header --field 2-5,7-18,20 --grouping | numfmt --header --field 6,19,21 --format '%.1f' | column -t
This then results in:
Name Count Sum Max Min Average Median N50 L50 NG50 LG50 N90 L90 N95 L95 Count>1000 Count>10000 #GC GC #N N
103/assembly.fasta 3,467 3,002,559,922 53,397,245 104 866039.8 6,152 10,128,795 77 0 0 2,373,600 302 1,366,482 382 2,897 1,327 1,045,685,741 34.9 5,100 0.1
103_final/assembly.fasta 3,636 3,006,248,922 40,557,336 104 826801.2 5,803 11,571,573 76 0 0 2,427,154 296 1,406,690 374 3,019 1,323 1,047,101,522 34.9 4,800 0.1
103_pbont/assembly.fasta 3,547 3,007,150,960 38,923,932 100 847801.3 5,785 11,975,311 76 0 0 2,910,579 273 1,394,760 344 2,870 1,297 1,047,384,588 34.9 5,500 0.1
As I had not entered the expected genome size, the NG50 and LG50 are zero.
The image output of the script looks like this :
