Genome assembly software
In progress...
Prior assembly, MinIONQC (Lanfear and coll., 2018) allows for the comparison of multiple Nanopore runs on the same plot, to assess if read length is satisfactory.
The Flye assembler (Kolmogorov and coll., 2018) creates an A-Bruijn (assembly) graph from draft contigs using long error-prone reads, untangles the graph by resolving repeats, and then uses it to refine the contings and increase their accuracy. (The predecessor of Flye, ABruijn, was reported by Istace and coll. (2017) to be able to assemble mitochondrial genomes, unlike Canu and other assemblers.)
The Shasta assembler Shafin and coll., 2020 is designed for Nanopore data. Reads are run length encoded before assembly, to mitigate the impact of errors in homopolymer tracts. The assembly runs entirely in memory; it needs terabyte amounts for a human genome, but as a consequence it runs fast. Shasta assemblies tend to be more fragmented, but have less disagreement with the reference. Shasta also comes with polishing modules similar to Racon and Medaka, but also to be faster.
The HiCanu assembler (Nurk, Walenz and coll., 2020) can take advantage of high-accuracy sequences such as the ones of the PacBio HiFi platform, to assemble multiple variants of the same locus.
Some genome assemblers produce a graph file that can be visualised with tools such as Bandage Wick and coll., 2015.
After assembly, the contigs can be further polished with Racon (Vaser, Sović, Nagarajan and Šikić, 2017).
When coverage is too low for efficient reference-free assembly, related references can be used as a guide. The Ragout software (Kolmogorov and coll., 2014, Kolmogorov and coll., 2018) can take multiple reference genomes to guide the scaffolding of a target assembly. Polymorphisms unique to the target genome can be recovered, but chromosome fusions are typically hard to detect. Compared to version 1, version 2 infers phylogenetic relationships between the reference genomes automatically. An alterative reference-guided scaffolder, RagOO (Alonge and coll., 2019) is reported to be faster, but can only take a single reference.
The HaploMerger2 pipeline (Huang and coll., 2017) takes a diploid assembly and outputs a reference and an alternative sub-assembly for each haplotype. However, they are not phased: “If only one allele is available for a locus (often due to haplotype collapsing or the allele is simply discarded by the de novo assembler), HM2 puts this allele into both sub-assemblies. In the sub-assemblies, the allelic scaffolds are given the same scaffold name. Finally, because there are switches between haplotypes in the rebuilt haploid sub-assemblies, the sub-assemblies are not haplotype phased.” Relase notes of HM2 version 20180603 suggest to use “HapCUT2 or other phasing tools to get the high-quality haplotype assembly based on the reference haploid assembly”.
Purge Haplotigs (Roach, Schmidt and Borneman (2018) ) is an alternative to HaploMerger that takes read coverage into account when detecting potential haplotigs. However, it does not merge haplotypes.
purge_dups
Guan and coll.,
2020, is another alternative to HaploMerger2. Like Purge
Haplotigs, it does not attempt to merge contigs. purge_dups
performed well
on Flye 2.5 assemblies (?Guiglielmoni and
coll.,2020).
SALSA (Simple AssembLy ScAffolder, Ghurye and coll., 2017) takes Hi-C data and contigs as input and scaffolds them under the hypothesis that most contact points are due to local (same-chromosome) proximity. Version 2 of SALSA uses unitigs and the assembly graph as input (Ghurye and coll., 2019).
The Hi-C data can also be used to call the location of centromeres (Varoquaux and coll., 2015, Marie-Nelly and coll., 2014(not read)).
Assemblies can be aligned with last-dotplot or, for SVG export and interactive browsing with D-GENIES (Cabanettes and Klopp 2018). The CNEr package Tan, Polychronopoulos and Lenhard, 2019 can be used to search for conserved non-coding elements.
BUSCO (Simão and coll., 2015, Waterhouse and coll., 2017) assesses the presence of evolutionary conserved single-copy genes in the assemblies. Seppey, Manni and Zdobnov EM (2019) wrote a good introduction in Methods Mol Biol. BUSCO v5 is based on OrthoDB v10, and support MetaEuk (default) and AUGUSTUS for eukaryotes Manni and coll., 2021.
AUGUSTUS can be trained for a new species with transcriptome data, as explained by Hoff and Stanke, 2018.
A reference assembly can be used to search for structural variants in a different individual, for instance with NanoSV (Cretu Stancu and coll., 2017) or SyRI (Goel and coll., 2019).
In 2003, Kent and coll. aligned the human and mouse genome together using the BLASTZ and AXTCHAIN software.
Stevens L, Moya ND, Tanny RE, Gibson SB, Tracey A, Na H, Chitrakar R, Dekker J, Walhout AJM, Baugh LR, Andersen EC.
Genome Biol Evol. 2022 Apr 10;14(4):evac042. doi:10.1093/gbe/evac042
Chromosome-Level Reference Genomes for Two Strains of Caenorhabditis briggsae: An Improved Platform for Comparative Genomics.
Rhle and many collaborators.
Nature. 2021 Apr;592(7856):737-746. doi:10.1038/s41586-021-03451-0
Towards complete and error-free genome assemblies of all vertebrate species
Nadège Guiglielmoni, Antoine Houtain, Alessandro Derzelle, Karine van Doninck, Jean-François Flot
BMC Bioinformatics. 2021 Jun 5;22(1):303. doi:10.1186/s12859-021-04118-3
Overcoming uncollapsed haplotypes in long-read assemblies of non-model organisms
Sergey Nurk, Brian P Walenz, Arang Rhie, Mitchell R Vollger, Glennis A Logsdon, Robert Grothe, Karen H Miga, Evan E Eichler, Adam M Phillippy, Sergey Koren
Genome Res. 2020 Sep;30(9):1291-1305. doi:10.1101/gr.263566.120
HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads
Wick RR, Schultz MB, Zobel J, Holt KE.
Bioinformatics. 2015 Oct 15;31(20):3350-2. doi:10.1093/bioinformatics/btv383
Bandage: interactive visualization of de novo genome assemblies.
Interactive visualisation and command-line generation of reports.
Guan D, McCarthy SA, Wood J, Howe K, Wang Y, Durbin R.
Bioinformatics. 2020 May 1;36(9):2896-2898. doi: 10.1093/bioinformatics/btaa025.
Identifying and removing haplotypic duplication in primary genome assemblies.
Shafin K, Pesout T, Lorig-Roach R, Haukness M, Olsen HE, Bosworth C, Armstrong J, Tigyi K, Maurer N, Koren S, Sedlazeck FJ, Marschall T, Mayes S, Costa V, Zook JM, Liu KJ, Kilburn D, Sorensen M, Munson KM, Vollger MR, Monlong J, Garrison E, Eichler EE, Salama S, Haussler D, Green RE, Akeson M, Phillippy A, Miga KH, Carnevali P, Jain M, Paten B.
Nat Biotechnol. 2020 Sep;38(9):1044-1053. doi:10.1038/s41587-020-0503-6
Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes.
Varoquaux N, Liachko I, Ay F, Burton JN, Shendure J, Dunham MJ, Vert JP, Noble WS.
Nucleic Acids Res. 2015 Jun 23;43(11):5331-9. doi:10.1093/nar/gkv424.
Accurate identification of centromere locations in yeast genomes using Hi-C.
Cabanettes F, Klopp C.
PeerJ. 2018 Jun 4;6:e4958. doi:10.7717/peerj.4958
D-GENIES: dot plot large genomes in an interactive, efficient and simple way.
Alonge M, Soyk S, Ramakrishnan S, Wang X, Goodwin S, Sedlazeck FJ, Lippman ZB, Schatz MC.
Genome Biol. 2019 Oct 28;20(1):224. doi:10.1186/s13059-019-1829-6
RaGOO: fast and accurate reference-guided scaffolding of draft genomes.
Roach MJ, Schmidt SA, Borneman AR.
BMC Bioinformatics. 2018 Nov 29;19(1):460. doi:10.1186/s12859-018-2485-7
Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies.
Pettersson ME, Rochus CM, Han F, Chen J, Hill J, Wallerman O, Fan G, Hong X, Xu Q, Zhang H, Liu S, Liu X, Haggerty L, Hunt T, Martin FJ, Flicek P, Bunikis I, Folkvord A, Andersson L.
Genome Res. 2019 Nov;29(11):1919-1928. doi:10.1101/gr.253435.119
A chromosome-level assembly of the Atlantic herring genome-detection of a supergene and other signals of selection.
PLoS Comput Biol. 2019 Aug 21;15(8):e1007273. doi:10.1371/journal.pcbi.1007273
Jay Ghurye, Arang Rhie, Brian P. Walenz, Anthony Schmitt, Siddarth Selvaraj, Mihai Pop, Adam M. Phillippy, Sergey Koren
Integrating Hi-C links with assembly graphs for chromosome-scale assembly
Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin, Pavel Pevzner
Nat Biotechnol. 2019 May;37(5):540-546. doi:10.1038/s41587-019-0072-8
Assembly of Long Error-Prone Reads Using Repeat Graphs
Vaser R, Sović I, Nagarajan N, Šikić M.
Genome Res. 2017 May;27(5):737-746. doi:10.1101/gr.214270.116
Fast and accurate de novo genome assembly from long uncorrected reads.
Racon can be used to correct contigs or to correct raw reads.
Istace B, Friedrich A, d'Agata L, Faye S, Payen E, Beluche O, Caradec C, Davidas S, Cruaud C, Liti G, Lemainque A, Engelen S, Wincker P, Schacherer J, Aury JM.
Gigascience. 2017 Feb 1;6(2):1-13. doi:10.1093/gigascience/giw018
de novo assembly and population genomic survey of natural yeast isolates with the Oxford Nanopore MinION sequencer.
Ghurye J, Pop M, Koren S, Bickhart D, Chin CS.
BMC Genomics. 2017 Jul 12;18(1):527. doi:10.1186/s12864-017-3879-z
Scaffolding of long read assemblies using long range contact information.
Bioinformatics. 2014 Jun 15;30(12):i302-9. doi:10.1093/bioinformatics/btu280
Kolmogorov M, Raney B, Paten B, Pham S.
Ragout-a reference-assisted assembly tool for bacterial genomes.
Kolmogorov M, Armstrong J, Raney BJ, Streeter I, Dunn M, Yang F, Odom D, Flicek P, Keane TM, Thybert D, Paten B, Pham S.
Genome Res. 2018 Nov;28(11):1720-1732. doi:10.1101/gr.236273.118
Chromosome assembly of large and complex genomes using multiple references.
Lanfear R, Schalamun M, Kainer D, Wang W, Schwessinger B.
Bioinformatics. 2018 Jul 23. doi:10.1093/bioinformatics/bty654
MinIONQC: fast and simple quality control for MinION sequencing data.
Huang S, Kang M, Xu A.
Bioinformatics. 2017 Aug 15;33(16):2577-2579. doi:10.1093/bioinformatics/btx220
HaploMerger2: rebuilding both haploid sub-assemblies from high-heterozygosity diploid genome assembly.
Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM.
Bioinformatics. 2015 Oct 1;31(19):3210-2. doi:10.1093/bioinformatics/btv351
BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.
Benchmarking Universal Single-Copy Orthologs. Can be used to train predictors such as Augustus.
Waterhouse RM, Seppey M, Simão FA, Manni M, Ioannidis P, Klioutchnikov G, Kriventseva EV, Zdobnov EM.
Mol Biol Evol. 2017 Dec 6. doi:10.1093/molbev/msx319
BUSCO applications from quality assessments to gene prediction and phylogenomics.
G3 (Bethesda). 2018 Aug 7. pii: g3.200160.2018. doi:10.1534/g3.118.200160
Miller DE, Staber C, Zeitlinger J, Hawley RS.
Highly Contiguous Genome Assemblies of 15 Drosophila Species Generated Using Nanopore Sequencing.
Genome Res. 2016 Mar;26(3):342-50. doi:10.1101/gr.193474.115
Putnam NH, O'Connell BL, Stites JC, Rice BJ, Blanchette M, Calef R, Troll CJ, Fields A, Hartley PD, Sugnet CW, Haussler D, Rokhsar DS, Green RE.
Chromosome-scale shotgun assembly using an in vitro method for long-range linkage.
G3 (Bethesda). 2018 Jul 17. pii: g3.200162.2018. doi:10.1534/g3.118.200162
Solares EA, Chakraborty M, Miller DE, Kalsow S, Hall K, Perera AG, Emerson JJ, Hawley RS.
Rapid Low-Cost Assembly of the Drosophila melanogaster Reference Genome Using Low-Coverage, Long-Read Sequencing.
In silico proof of principle aimed at short (35–50 bp) reads.