Work in progress

Chains and nets

Kent and coll. (2003) computed chained alignments with the AXTCHAIN program.

LAST

see LAST for a detailed bibliography.

Cactus

  • Paten and coll., (March 2011) describe cactus graphs where nodes are sets of adjascencies and edges are aligned blocks of sequences. A genome can be represented as path in these graphs.

  • Armstrong and coll. (2020) describe progressive cactus, an iterative approach where ancestral genomes are reconstituted using 2-5 pairs of in- and out-group comparisons, and then progressively aligned to each other.

Consumers of multiple genome sequence alignments

PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions.

Lin MF, Jungreis I, Kellis M.

Bioinformatics. 2011 Jul 1;27(13):i275-82. doi: 10.1093/bioinformatics/btr209

PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions.

Classifies coding and non-coding regions using multiple genome sequence alignments and the fit with separate codon score matrices for coding an non-coding.

Posted
Progressive Cactus is a multiple-genome aligner for the thousand-genome era.

Armstrong J, Hickey G, Diekhans M, Fiddes IT, Novak AM, Deran A, Fang Q, Xie D, Feng S, Stiller J, Genereux D, Johnson J, Marinescu VD, Alföldi J, Harris RS, Lindblad-Toh K, Haussler D, Karlsson E, Jarvis ED, Zhang G, Paten B.

Nature. 2020 Nov;587(7833):246-251. doi: 10.1038/s41586-020-2871-y

Progressive Cactus is a multiple-genome aligner for the thousand-genome era

Aligns with cactus 2~5 genomes, in- and out-group, and reconstitutes an ancestral genome. Recurses the phylogenetic tree progressively. A ‘best-hit-filtering’ step is added to catch duplications that are not seen in the outgroups. Also runs a step ‘removing recoverable chains’ to allow for corrections and mitigate error propagation. Aligning 600 amniotes took ~2 months.

Posted
Cactus graphs for genome comparisons.

Paten B, Diekhans M, Earl D, John JS, Ma J, Suh B, Haussler D.

J Comput Biol. 2011 Mar;18(3):469-81. doi: 10.1089/cmb.2010.0252

Cactus graphs for genome comparisons.

Alignments are transformed in sequence graphs where nodes are connected by either alignment blocks or adjascencies between blocks, and this graph is progressively transformed in a cactus graph where the nodes are sets of adjascencies connected together without crossing a block.

Posted
Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny.

Nat Commun. 2022 Nov 15;13(1):6968. doi:10.1038/s41467-022-34630-w

Edgar RC.

Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny.

Random variations of HMM parameters and guide trees in different replicates to explore systematic biases and remove them by averaging the replicates. “Unlike typical conservation-based metrics, a column with many gaps or with biochemically dissimilar amino acids will be assigned high [confidence] if it is consistently reproduced.”

Posted
Parameters for accurate genome alignment.

Frith MC, Hamada M, Horton P.

BMC Bioinformatics. 2010 Feb 9;11:80. doi: 10.1186/1471-2105-11-80

Parameters for accurate genome alignment.

Aligned genomes after reversing (not reverse-complementing) them as a negative controls. In these comparisons, all alignments are spurious. A large number of spurious alignments were found, and this could be reduced by masking tandem repeats. Spuriously alignments in tandem repeats get abnormally high scores. “Bad” scoring matrices tend to extend alignments with spurious low-quality arms. The X-drop parameter prevents the aligner from extending alignments too far, but high X-drop values can cause small alignments to be discarded by some software because the score becomes negative.

Posted
AnchorWave: Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication.

Song B, Marco-Sola S, Moreto M, Johnson L, Buckler ES, Stitzer MC.

Proc Natl Acad Sci U S A. 2022 Jan 4;119(1):e2113075119. doi:10.1073/pnas.2113075119

AnchorWave: Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication.

Maps a transcriptome to its reference genome. 2) Extracts “anchor” coding sequences. 3) Searches for homologous sequences in the query genome. 4) Realigns the sequences between homologous anchors. The so-called comparison to LAST is actually a comparison with LAST + AxtChain, that is: it does not use last-split.

Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes.

Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D.

Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. doi:10.1073/pnas.1932072100

Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes.

Primary paper for chains and nets, built with the BLASTZ and AXTCHAIN programs. Chains are one-to-many alignments and allow skipping over local inversions. In human/mouse comparisons, 2.0 inversion per Mbp, median length 814. Double gaps ≥ 100 per Mbp: 398.6, median length 411. Chains are called “short” when their span is <100,000 bases (span distribution of short chains apparently bimodal). 579 “long” chains (average length 983 kb) cover 32.9% of the bases in the human genome. Collectively all chains span 96.3% of the human genome and align to 34.6% of it. The authors note that the observed distribution of gap lengths violate the usual affine model of aligners.

“A chained alignment [is] an ordered sequence of traditional pairwise nucleotide alignments (“blocks”) separated by larger gaps, some of which may be simultaneous gaps in both species. [...] intervening DNA in one species that does not align with the other because it is locally inverted or has been inserted in by lineage-specific translocation or duplication is skipped”

“The chains are then put into a list sorted with the highest-scoring chain first. [...] each iteration taking the next chain off of the list, throwing out the parts of the chain that intersect with bases already covered by previously taken chains, and then marking the bases that are left in the chain as covered. [...] If a chain covers bases that are in a gap in a previously taken chain, it is marked as a child of the previous chain. In this way, a hierarchy of chains is formed that we call a net.”

“To be considered syntenic, a chain has to either have a very high score itself or be embedded in a larger chain, on the same chromosome, and come from the same region as the larger chain. Thus, inversions and tandem duplications are considered syntenic.”

“We define the (human) span of a chain to be the distance in bases in the human genome from the first to the last human base in the chain, including gaps, and we define the size of the chain as the number of aligning bases in it, not including gaps.”

Improved search heuristics find 20,000 new alignments between human and mouse genomes.

Frith MC, Noé L.

Nucleic Acids Res. 2014 Apr;42(7):e59. doi:10.1093/nar/gku104

Improved search heuristics find 20,000 new alignments between human and mouse genomes.

“using more codesigned seed patterns makes the alignment more sensitive but slower. The interesting point, though, is that using more seeds beats increasing the rareness threshold. For example, using four seeds with m 1⁄4 10 is both faster and more sensitive than one seed with m 1⁄4 100. The downside is that more seeds require more memory.” “We also tried aligning 10 000 random 1-kb chunks of the melanogaster genome to the pseudoobscura genome. In this case, the 1:1 [transitions:transversions] seeds perform better than the 3:2 seeds, as expected.” “Mammals have a greater excess than Drosophila, presumably because they have more methylcytosine, which mutates rapidly to thymine. Less-similar genomes have a lower excess of transitions: this is as expected because the transitions cannot keep increasing linearly but instead tend to an asymptote.”

Posted
Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads.

Mitsuhashi S, Frith MC, Mizuguchi T, Miyatake S, Toyota T, Adachi H, Oma Y, Kino Y, Mitsuhashi H, Matsumoto N.

Genome Biol. 2019 Mar 19;20(1):58. doi:10.1186/s13059-019-1667-6

Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads.

Primary paper for the tandem-genotypes tool. Be sure to compar both strands to prevent confusion between alleles and sequencer-specific biases.

Posted
The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes.

Treangen TJ, Ondov BD, Koren S, Phillippy AM.

Genome Biol. 2014;15(11):524.

The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes.

Requires > 97% similarity between genomes. Species with a too low Maximal Unit Match (MUM) index distance are not incorporated in the core genome alignment.

Posted
Split-alignment of genomes finds orthologies more accurately.

Frith MC, Kawaguchi R.

Genome Biol. 2015 May 21;16:106. doi:10.1186/s13059-015-0670-9

Split-alignment of genomes finds orthologies more accurately.

Optimal set of local alignments. Striking example of intra-chromosomal loss of synteny between D. melanogaster and D. pseudoobscura. Heuristic approach inspired by the “repeated matches algorithm” of Durbin and coll., 1998.

Variation graph toolkit improves read mapping by representing genetic variation in the reference.

Garrison E, Sirén J, Novak AM, Hickey G, Eizenga JM, Dawson ET, Jones W, Garg S, Markello C, Lin MF, Paten B, Durbin R.

Nat Biotechnol. 2018 Oct;36(9):875-879. doi:10.1038/nbt.4227

Variation graph toolkit improves read mapping by representing genetic variation in the reference.

“Using the vg toolkit, we can construct or import a graph, modify it, visualize it, and use it as a reference.”

Posted