bibliography in progress...

  • Whole-genome alignments with reversed sequences as negative controls showed that e-value filtering is not enough to remove spurious alignments of tandem repeat which therefore need to be masked (Frith MC, Hamada M and Horton P., 2011).

  • lastdb can use various seeding schemes to build its index. Frith and Noé (2014) discuss some of them. The RY seeds are made of non-overlapping words using the two-letter alphabet R = A|G, Y = C|T, to increase speed with a good tradeoff in sensitivity (Frith MC, Noé L, Kucherov G, 2020).

  • last-postmask (Frith, 2011): discards alignments that contain a significant amount of lower-case-masked sequences.

  • last-split (Frith and Kawaguchi, 2015): heuristic algorithm inspired by the “repeated matches algorithm” of Durbin and coll. (1998). It searchs for an optimal set of local alignments (as opposed to a set of optimal local alignments). Its output is also used by third-party tool NanoSV (Cretu and coll., 2017).

  • last-train (Hamada, Ono, Asai and Frith, 2017): estimation of alignment parameters.

  • local-rearrangements (Frith and Khan, 2018): detection and display of rearrangements supported by multiple long reads and by the ancestrality of the reference sequence.

  • tandem-genotypes (Mitsuhashi and coll., 2019): detection of expansion of tandem repeats, after alignment with last-split.

  • LAST can align DNA sequences to protein databases using a 64 x 21 substitution matrix Yao and Frith, 2020.

  • JRA (Joint Read Alignment) uses LAST Shrestha and coll., 2018.

  • A tutorial for the use of dnarrange is published in Frith and Mitsuhashi, 2022.

Parameters for accurate genome alignment.

Frith MC, Hamada M, Horton P.

BMC Bioinformatics. 2010 Feb 9;11:80. doi: 10.1186/1471-2105-11-80

Parameters for accurate genome alignment.

Aligned genomes after reversing (not reverse-complementing) them as a negative controls. In these comparisons, all alignments are spurious. A large number of spurious alignments were found, and this could be reduced by masking tandem repeats. Spuriously alignments in tandem repeats get abnormally high scores. “Bad” scoring matrices tend to extend alignments with spurious low-quality arms. The X-drop parameter prevents the aligner from extending alignments too far, but high X-drop values can cause small alignments to be discarded by some software because the score becomes negative.

Improved DNA-versus-Protein Homology Search for Protein Fossils

Yin Yao, Martin C. Frith

In: Martín-Vide C., Vega-Rodríguez M.A., Wheeler T. (eds) Algorithms for Computational Biology. AlCoB 2021. Lecture Notes in Computer Science, vol 12715. Springer, Cham. DOI:10.1007/978-3-030-74432-8_11

Improved DNA-versus-Protein Homology Search for Protein Fossils

Uses a 64 x 21 substitution matrix and automatically learns the genetic code. Detected fossils of the polinton and DIRS/Ngaro repeat elements in the human genome. 10 times faster than blastx.

Posted
Improved search heuristics find 20,000 new alignments between human and mouse genomes.

Frith MC, Noé L.

Nucleic Acids Res. 2014 Apr;42(7):e59. doi:10.1093/nar/gku104

Improved search heuristics find 20,000 new alignments between human and mouse genomes.

“using more codesigned seed patterns makes the alignment more sensitive but slower. The interesting point, though, is that using more seeds beats increasing the rareness threshold. For example, using four seeds with m 1⁄4 10 is both faster and more sensitive than one seed with m 1⁄4 100. The downside is that more seeds require more memory.” “We also tried aligning 10 000 random 1-kb chunks of the melanogaster genome to the pseudoobscura genome. In this case, the 1:1 [transitions:transversions] seeds perform better than the 3:2 seeds, as expected.” “Mammals have a greater excess than Drosophila, presumably because they have more methylcytosine, which mutates rapidly to thymine. Less-similar genomes have a lower excess of transitions: this is as expected because the transitions cannot keep increasing linearly but instead tend to an asymptote.”

Posted
Mapping and phasing of structural variation in patient genomes using nanopore sequencing.

Cretu Stancu M, van Roosmalen MJ, Renkens I, Nieboer MM, Middelkamp S, de Ligt J, Pregno G, Giachino D, Mandrile G, Espejo Valle-Inclan J, Korzelius J, de Bruijn E, Cuppen E, Talkowski ME, Marschall T, de Ridder J, Kloosterman WP.

Nat Commun. 2017 Nov 6;8(1):1326. doi:10.1038/s41467-017-01343-4

Mapping and phasing of structural variation in patient genomes using nanopore sequencing.

Primary paper for NanoSV. Fed with last-split alignments.

Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads.

Mitsuhashi S, Frith MC, Mizuguchi T, Miyatake S, Toyota T, Adachi H, Oma Y, Kino Y, Mitsuhashi H, Matsumoto N.

Genome Biol. 2019 Mar 19;20(1):58. doi:10.1186/s13059-019-1667-6

Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads.

Primary paper for the tandem-genotypes tool. Be sure to compar both strands to prevent confusion between alleles and sequencer-specific biases.

Split-alignment of genomes finds orthologies more accurately.

Frith MC, Kawaguchi R.

Genome Biol. 2015 May 21;16:106. doi:10.1186/s13059-015-0670-9

Split-alignment of genomes finds orthologies more accurately.

Optimal set of local alignments. Striking example of intra-chromosomal loss of synteny between D. melanogaster and D. pseudoobscura. Heuristic approach inspired by the “repeated matches algorithm” of Durbin and coll., 1998.