Decomposes satellite DNA arrays into monomers within telomere-to-telomere (T2T) assemblies. Ideal for analyzing centromeric and pericentromeric regions on monomeric level.
Status: Production ready. Successfully handles arrays from kilobase to megabase scale.
Key Features:
- De novo monomer identification without prior knowledge
- Autocorrelation-based period detection for robust periodicity analysis
- Automatic orientation to canonical form (A>T, C>G)
- Deterministic output sorted by genomic coordinates
- Multi-threaded processing
Performance: CHM13v2.0 assembly (~1300 alpha satellite arrays) processes in ~3.5 minutes (16 threads)
pip install arraysplitterOr build from source:
cd src/rust/arraysplitter
cargo build --release# Basic decomposition
arraysplitter -i arrays.fa -o output_prefix -t 16
# With predefined cut sequences
arraysplitter -i arrays.fa -o output_prefix -c ATG,CGCG -t 16
# Show version
arraysplitter --versionfwd= sequence kept as-is (canonical form: more A's than T's)rev= sequence was reverse-complemented to canonical form
Why it matters: Satellite arrays on opposite strands appear as different sequences. Canonical orientation ensures consistent comparison across the genome.
Measures how different each monomer is from the consensus sequence of all monomers in the array.
- Low ed_tmpl (0-5) = monomer is very similar to consensus → typical/canonical monomer
- High ed_tmpl (>20) = monomer is divergent → possibly a variant, mutation hotspot, or misalignment
Use case: Filter out divergent monomers, identify conserved vs variable positions.
Measures how different each monomer is from its adjacent monomers (previous and next).
- Low ed_prev/ed_next = adjacent monomers are similar → homogeneous region
- High ed_prev/ed_next = sudden change → possible HOR boundary, structural variant, or sequence transition
Use case: Detect HOR structure boundaries, identify recombination breakpoints.
Measures periodicity strength (0.0 to 1.0):
- High autocorr (>0.7) = strong, regular periodicity → well-conserved tandem repeat
- Medium autocorr (0.5-0.7) = detectable but irregular periodicity
- Low autocorr (<0.5) = weak or no periodicity → degenerate or non-repetitive
Standard deviation of monomer lengths divided by mean length:
- Low cv (<0.1) = uniform monomer sizes → canonical repeat structure
- High cv (>0.2) = variable sizes → insertions/deletions, heterogeneous array
Edit distance divided by sequence length. Allows comparison across different monomer sizes:
- 0.00-0.02 = highly conserved
- 0.02-0.05 = moderately divergent
- >0.05 = significantly divergent
All output is deterministically sorted by chromosome and genomic position (chr1 → chr22 → chrX → chrY → chrM).
| File | Description |
|---|---|
.decomposed.fasta |
Monomers with orientation info in headers |
.hors.tsv |
HOR-level decomposition (16 columns) |
.monomers.tsv |
Base-level monomers from recursive decomposition (17 columns) |
.summary.tsv |
One-row-per-array summary with HOR and monomer statistics (23 columns) |
.lengths |
Fragment lengths for each array |
One row per array combining HOR-level and monomer-level statistics. Useful for overview analysis.
| Column | Description |
|---|---|
array_id |
Array identifier (chr_start_end_len_period_type) |
array_length |
Total array length in bp |
orientation |
fwd or rev (reverse complemented to canonical) |
method |
Detection method used (autocorr, classic) |
| HOR-level stats | |
hor_period |
Detected HOR period in bp |
hor_autocorr |
Autocorrelation at HOR period |
hor_n_monomers |
Number of HOR-level monomers |
hor_mean_ed_tmpl |
Mean edit distance to HOR consensus |
hor_mean_ed_prev |
Mean edit distance between adjacent HORs |
hor_cv |
Coefficient of variation for HOR lengths |
hor_consensus |
Consensus sequence at HOR level |
hor_iupac |
IUPAC ambiguity codes (bases ≥20% frequency) |
hor_quality |
Per-position support (digit 0-9, 9=90-100%) |
| Monomer-level stats | |
mono_period |
Median base monomer period |
mono_autocorr |
Mean autocorrelation at monomer level |
mono_n_monomers |
Total number of base monomers |
mono_mean_ed_tmpl |
Mean edit distance to monomer consensus |
mono_mean_ed_prev |
Mean edit distance between adjacent monomers |
mono_cv |
Mean coefficient of variation |
mono_consensus |
Consensus sequence at monomer level |
mono_iupac |
IUPAC ambiguity codes |
mono_quality |
Per-position support |
cut_sequence |
Anchor k-mer used for splitting |
Contains the primary decomposition into HOR (Higher Order Repeat) monomers. Multiple rows per array.
Row types (in order):
pred_array- Array-level prediction/header rowflank- Terminal fragments <70% of periodmonomer- Full HOR monomers (sorted by idx)array- Summary statistics rowconsensus- Consensus sequence row
| Column | Description |
|---|---|
array_id |
Array identifier (chr_start_end_len_period_type) |
type |
pred_array, monomer, flank, array, consensus |
idx |
Monomer index within array (0-based) |
length |
Sequence length in bp |
source |
Detection method: anchor, split_2x, split_3x, left_flank, right_flank |
ed_tmpl |
Edit distance to consensus template |
ed_prev |
Edit distance to previous monomer |
ed_next |
Edit distance to next monomer |
period |
Detected repeat period in bp |
autocorr |
Autocorrelation value at detected period |
n_expected |
Expected count of monomers (array_len / period) |
ed_per_bp |
Normalized edit distance (ed / length) |
cv |
Coefficient of variation for lengths |
cut_sequence |
Anchor sequence used for splitting |
orientation |
fwd or rev (reverse complemented) |
sequence |
Actual DNA sequence (or - for pred_array/array rows) |
Contains base-level monomers after recursive HOR decomposition. Unified format matching .hors.tsv plus parent_idx.
Each HOR is recursively decomposed until:
- No further periodicity detected (autocorrelation ≤ 0.5)
- Minimum length (5bp) reached
Row types (in order):
pred_array- Array-level summary rowbase_monomer- Base-level monomers from recursive decompositionmonomer- Non-decomposable monomers (e.g., telomeres)
| Column | Description |
|---|---|
array_id |
Array identifier |
type |
pred_array, base_monomer, monomer |
idx |
Global index within array (0-based) |
length |
Sequence length in bp |
source |
recursive_anchor, recursive_split, base, recursive_flank |
ed_tmpl |
Edit distance to submonomer consensus |
ed_prev |
Edit distance to previous base monomer |
ed_next |
Edit distance to next base monomer |
period |
Detected period at this level (0 if base) |
autocorr |
Autocorrelation value |
n_expected |
Always 1 for individual monomers |
ed_per_bp |
Normalized edit distance |
cv |
Coefficient of variation within parent group |
cut_sequence |
Inherited anchor sequence |
orientation |
Inherited from array (fwd/rev) |
parent_idx |
Index of parent HOR from .hors.tsv |
sequence |
Actual DNA sequence |
For a typical α-satellite HOR (512bp → 3×171bp monomers):
.hors.tsv - 10 HOR monomers (~512bp each):
array_id type idx length period ...
chr1_centromere pred_array 10 5120 512 ...
chr1_centromere monomer 0 512 512 ...
chr1_centromere monomer 1 512 512 ...
...
chr1_centromere array 10 5120 512 ...
chr1_centromere consensus 10 512 512 ... [consensus seq]
.monomers.tsv - 30 base monomers (~171bp each):
array_id type idx length parent_idx ...
chr1_centromere pred_array 30 5120 - ...
chr1_centromere base_monomer 0 171 0 ...
chr1_centromere base_monomer 1 171 0 ...
chr1_centromere base_monomer 2 170 0 ...
chr1_centromere base_monomer 3 171 1 ...
...
.summary.tsv - Single row with both levels:
array_id length hor_period hor_n_monomers mono_period mono_n_monomers ...
chr1_centromere 5120 512 10 171 30 ...
ArraySplitter employs an autocorrelation-based algorithm for detecting repeat periods and decomposing satellite DNA arrays.
Arrays are oriented to canonical form:
- Primary rule: A > T (more A's than T's)
- Secondary rule: C > G (if A=T)
- Non-canonical arrays are reverse complemented
The algorithm computes sequence autocorrelation to detect periodicity:
autocorr(offset) = matches / comparisons
Where matches counts identical nucleotides at positions i and i + offset.
Key innovations:
- Random expectation correction: Subtracts expected random match rate based on nucleotide composition
- Refined period search: Uses FFT-like peak detection to find true period vs harmonics
- Confidence scoring: Autocorrelation excess over random indicates detection confidence
For the detected period, finds optimal anchor (cut sequence) using:
- K-mer enumeration: Extract all k-mers (k=10 by default) from the sequence
- Position analysis: For each k-mer, record all occurrence positions
- Scoring metrics:
- Uniqueness: Fraction of occurrences exactly
periodapart - Regularity: How evenly spaced the occurrences are
- Uniqueness: Fraction of occurrences exactly
- Combined score:
uniqueness × regularity - Deterministic selection: K-mers sorted lexicographically for reproducible tie-breaking
Using the selected anchor:
- Split array at all anchor occurrences
- First fragment → left flank (if < 70% of period)
- Middle fragments → monomers
- Last fragment → right flank (if < 70% of period)
- Apply heuristics for multiplet splitting (doublets, triplets, etc.)
Results are:
- Sorted by chromosome (natural order: 1, 2, ..., 22, X, Y, M)
- Within chromosome, sorted by start position
- Fully deterministic across runs
Uses autocorrelation for period detection. Best for:
- Regular tandem repeats
- Alpha satellite arrays
- HOR (Higher Order Repeat) structures
Uses frequency suffix tree approach. Better for:
- Irregular or degenerate repeats
- Very short arrays
- Arrays with high mutation rates
Tries autocorrelation first, falls back to classic if autocorr fails.
arraysplitter --help
Options:
-i, --input <FILE> Input FASTA file
-o, --output <PREFIX> Output prefix
-t, --threads <N> Number of threads [default: all cores]
-c, --cuts <SEQ,SEQ> Predefined cut sequences (comma-separated)
-d, --depth <N> Max depth for cut search [default: 100]
--method <METHOD> Detection method: autocorr, classic, both [default: autocorr]
--max-ed-len <N> Max monomer length for edit distance [default: 10000]
--stats Print detailed statistics
--top-outliers <N> Number of outliers to show [default: 10]
-V, --version Print version
If you use ArraySplitter in your research, please cite: [Publication pending]
For questions or support: [email protected]