Genome assembly software

In progress...

Prior assembly, MinIONQC (Lanfear and coll., 2018) allows for the comparison of multiple Nanopore runs on the same plot, to assess if read length is satisfactory.

The Flye assembler (Kolmogorov and coll., 2018) creates an A-Bruijn (assembly) graph from draft contigs using long error-prone reads, untangles the graph by resolving repeats, and then uses it to refine the contings and increase their accuracy. (The predecessor of Flye, ABruijn, was reported by Istace and coll. (2017) to be able to assemble mitochondrial genomes, unlike Canu and other assemblers.)

The Shasta assembler Shafin and coll., 2020 is designed for Nanopore data. Reads are run length encoded before assembly, to mitigate the impact of errors in homopolymer tracts. The assembly runs entirely in memory; it needs terabyte amounts for a human genome, but as a consequence it runs fast. Shasta assemblies tend to be more fragmented, but have less disagreement with the reference. Shasta also comes with polishing modules similar to Racon and Medaka, but also to be faster.

The HiCanu assembler (Nurk, Walenz and coll., 2020) can take advantage of high-accuracy sequences such as the ones of the PacBio HiFi platform, to assemble multiple variants of the same locus.

Some genome assemblers produce a graph file that can be visualised with tools such as Bandage Wick and coll., 2015.

After assembly, the contigs can be further polished with Racon (Vaser, Sović, Nagarajan and Šikić, 2017).

When coverage is too low for efficient reference-free assembly, related references can be used as a guide. The Ragout software (Kolmogorov and coll., 2014, Kolmogorov and coll., 2018) can take multiple reference genomes to guide the scaffolding of a target assembly. Polymorphisms unique to the target genome can be recovered, but chromosome fusions are typically hard to detect. Compared to version 1, version 2 infers phylogenetic relationships between the reference genomes automatically. An alterative reference-guided scaffolder, RagOO (Alonge and coll., 2019) is reported to be faster, but can only take a single reference.

The HaploMerger2 pipeline (Huang and coll., 2017) takes a diploid assembly and outputs a reference and an alternative sub-assembly for each haplotype. However, they are not phased: “If only one allele is available for a locus (often due to haplotype collapsing or the allele is simply discarded by the de novo assembler), HM2 puts this allele into both sub-assemblies. In the sub-assemblies, the allelic scaffolds are given the same scaffold name. Finally, because there are switches between haplotypes in the rebuilt haploid sub-assemblies, the sub-assemblies are not haplotype phased.” Relase notes of HM2 version 20180603 suggest to use “HapCUT2 or other phasing tools to get the high-quality haplotype assembly based on the reference haploid assembly”.

Purge Haplotigs (Roach, Schmidt and Borneman (2018) ) is an alternative to HaploMerger that takes read coverage into account when detecting potential haplotigs. However, it does not merge haplotypes.

purge_dups Guan and coll., 2020, is another alternative to HaploMerger2. Like Purge Haplotigs, it does not attempt to merge contigs. purge_dups performed well on Flye 2.5 assemblies (?Guiglielmoni and coll.,2020).

SALSA (Simple AssembLy ScAffolder, Ghurye and coll., 2017) takes Hi-C data and contigs as input and scaffolds them under the hypothesis that most contact points are due to local (same-chromosome) proximity. Version 2 of SALSA uses unitigs and the assembly graph as input (Ghurye and coll., 2019).

The Hi-C data can also be used to call the location of centromeres (Varoquaux and coll., 2015, Marie-Nelly and coll., 2014(not read)).

Assemblies can be aligned with last-dotplot or, for SVG export and interactive browsing with D-GENIES (Cabanettes and Klopp 2018). The CNEr package Tan, Polychronopoulos and Lenhard, 2019 can be used to search for conserved non-coding elements.

BUSCO (Simão and coll., 2015, Waterhouse and coll., 2017) assesses the presence of evolutionary conserved single-copy genes in the assemblies. Seppey, Manni and Zdobnov EM (2019) wrote a good introduction in Methods Mol Biol. BUSCO v5 is based on OrthoDB v10, and support MetaEuk (default) and AUGUSTUS for eukaryotes Manni and coll., 2021.

AUGUSTUS can be trained for a new species with transcriptome data, as explained by Hoff and Stanke, 2018.

A reference assembly can be used to search for structural variants in a different individual, for instance with NanoSV (Cretu Stancu and coll., 2017) or SyRI (Goel and coll., 2019).

In 2003, Kent and coll. aligned the human and mouse genome together using the BLASTZ and AXTCHAIN software.

Chromosome-Level Reference Genomes for Two Strains of Caenorhabditis briggsae: An Improved Platform for Comparative Genomics.

Stevens L, Moya ND, Tanny RE, Gibson SB, Tracey A, Na H, Chitrakar R, Dekker J, Walhout AJM, Baugh LR, Andersen EC.

Genome Biol Evol. 2022 Apr 10;14(4):evac042. doi:10.1093/gbe/evac042

Chromosome-Level Reference Genomes for Two Strains of Caenorhabditis briggsae: An Improved Platform for Comparative Genomics.

“the genomes of C. elegans and C. briggsae are more highly rearranged than their outcrossing sister species, C. inopinata and C. nigoni (17.1% of neighboring genes are rearranged in the selfers compared with 15.0% in the outcrossers)”

Overcoming uncollapsed haplotypes in long-read assemblies of non-model organisms

Nadège Guiglielmoni, Antoine Houtain, Alessandro Derzelle, Karine van Doninck, Jean-François Flot

BMC Bioinformatics. 2021 Jun 5;22(1):303. doi:10.1186/s12859-021-04118-3

Overcoming uncollapsed haplotypes in long-read assemblies of non-model organisms

Benchark using a bdelloid rotifer. Performance of most software plateaus over 50× depth. purge_dups performed well on Flye assemblies. Filtering our shorter reads did not dramatically change the N50 of Flye 2.5 assemblies

Posted
HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads

Sergey Nurk, Brian P Walenz, Arang Rhie, Mitchell R Vollger, Glennis A Logsdon, Robert Grothe, Karen H Miga, Evan E Eichler, Adam M Phillippy, Sergey Koren

Genome Res. 2020 Sep;30(9):1291-1305. doi:10.1101/gr.263566.120

HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads

“HiCanu modifies the input reads by compressing every homopolymer to a single nucleotide.” “Outputs contigs as “pseudo-haplotypes” that preserve local allelic phasing but may switch between haplotypes”

Identifying and removing haplotypic duplication in primary genome assemblies.

Guan D, McCarthy SA, Wood J, Howe K, Wang Y, Durbin R.

Bioinformatics. 2020 May 1;36(9):2896-2898. doi: 10.1093/bioinformatics/btaa025.

Identifying and removing haplotypic duplication in primary genome assemblies.

Used by the Vertebrate Genomes Project assembly pipeline. Remaps the reads onto the assembly to evaluate heterozygocity of regions where the genome self-maps to itself, and removes the regions where necessary.

Posted
Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes.

Shafin K, Pesout T, Lorig-Roach R, Haukness M, Olsen HE, Bosworth C, Armstrong J, Tigyi K, Maurer N, Koren S, Sedlazeck FJ, Marschall T, Mayes S, Costa V, Zook JM, Liu KJ, Kilburn D, Sorensen M, Munson KM, Vollger MR, Monlong J, Garrison E, Eichler EE, Salama S, Haussler D, Green RE, Akeson M, Phillippy A, Miga KH, Carnevali P, Jain M, Paten B.

Nat Biotechnol. 2020 Sep;38(9):1044-1053. doi:10.1038/s41587-020-0503-6

Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes.

Runs in memory (no disk IO) and requires terabyte amounts for human genome. Designed for Nanopore data. Reads are run length encoded before assembling. Assemblies are more fragmented, but with less disagreements to the reference. The estimated cost of running is lower than for competitors.

Accurate identification of centromere locations in yeast genomes using Hi-C.

Varoquaux N, Liachko I, Ay F, Burton JN, Shendure J, Dunham MJ, Vert JP, Noble WS.

Nucleic Acids Res. 2015 Jun 23;43(11):5331-9. doi:10.1093/nar/gkv424.

Accurate identification of centromere locations in yeast genomes using Hi-C.

Tested on yeast and Plasmodium falciparium. First, detects regions enriched in Hi-C contacts. Then, prioritises local maxima enriched in trans-contacts.

Posted
RaGOO: fast and accurate reference-guided scaffolding of draft genomes.

Alonge M, Soyk S, Ramakrishnan S, Wang X, Goodwin S, Sedlazeck FJ, Lippman ZB, Schatz MC.

Genome Biol. 2019 Oct 28;20(1):224. doi:10.1186/s13059-019-1829-6

RaGOO: fast and accurate reference-guided scaffolding of draft genomes.

Aligns contigs to a reference genome with minimap2, and resolves structural variants with a modified version of Assemblytics that uses minimap2.

Posted
Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies.

Roach MJ, Schmidt SA, Borneman AR.

BMC Bioinformatics. 2018 Nov 29;19(1):460. doi:10.1186/s12859-018-2485-7

Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies.

Aligns the reads to the draft genome in order to estimate coverage. A bimodal distribution is expected: the 0.5× peak represents areas where both alleles are present in the assembly.

Posted
A chromosome-level assembly of the Atlantic herring genome-detection of a supergene and other signals of selection.

Pettersson ME, Rochus CM, Han F, Chen J, Hill J, Wallerman O, Fan G, Hong X, Xu Q, Zhang H, Liu S, Liu X, Haggerty L, Hunt T, Martin FJ, Flicek P, Bunikis I, Folkvord A, Andersson L.

Genome Res. 2019 Nov;29(11):1919-1928. doi:10.1101/gr.253435.119

A chromosome-level assembly of the Atlantic herring genome-detection of a supergene and other signals of selection.

Linkage analysis of 45,000 markers from 2 crosses with ~50 offsprings each confirmed that there are 26 linkage groups, and suggests ~1 recombination per chromosome pair at meiosis. Recombination rate is lower towards centromeres. This is in line with the known fact that 3 chromosomes are metacentric and the other are acrocentric. When comparing with other fish species, genes tend to stay on the same chromosomes, but move within (like birds and invertebrates, but unlike mammals). A 7.8-Mb region on chr12 with strange linkage desequilibrium pattern was shown to be an inversion between southern and northern individuals. It may act as a supergene. Genetic exchanges between both haplotypes is reduced by the inversion.

Assembly of Long Error-Prone Reads Using Repeat Graphs

Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin, Pavel Pevzner

Nat Biotechnol. 2019 May;37(5):540-546. doi:10.1038/s41587-019-0072-8

Assembly of Long Error-Prone Reads Using Repeat Graphs

“Flye constructs (overlapping) contigs with possible assembly errors at the initial stage, combines them into an accurate assembly graph, resolves repeats in the assembly graph using small variations between various repeat instances that were left unresolved during the initial assembly stage, constructs a new, less tangled assembly graph based on resolved repeats, and finally outputs accurate contigs as paths in this graph.”

Posted
de novo assembly and population genomic survey of natural yeast isolates with the Oxford Nanopore MinION sequencer.

Istace B, Friedrich A, d'Agata L, Faye S, Payen E, Beluche O, Caradec C, Davidas S, Cruaud C, Liti G, Lemainque A, Engelen S, Wincker P, Schacherer J, Aury JM.

Gigascience. 2017 Feb 1;6(2):1-13. doi:10.1093/gigascience/giw018

de novo assembly and population genomic survey of natural yeast isolates with the Oxford Nanopore MinION sequencer.

Assembled mitochondrial genomes with ABruijn.

Chromosome assembly of large and complex genomes using multiple references.

Kolmogorov M, Armstrong J, Raney BJ, Streeter I, Dunn M, Yang F, Odom D, Flicek P, Keane TM, Thybert D, Paten B, Pham S.

Genome Res. 2018 Nov;28(11):1720-1732. doi:10.1101/gr.236273.118

Chromosome assembly of large and complex genomes using multiple references.

Ragout 2 can take multiple reference genomes as input and automatically infers phylogenetic relationship between them. Polymorphisms unique to the target genome can be recovered, but chromosome fusions are typically hard to detect.

Highly Contiguous Genome Assemblies of 15 Drosophila Species Generated Using Nanopore Sequencing.

G3 (Bethesda). 2018 Aug 7. pii: g3.200160.2018. doi:10.1534/g3.118.200160

Miller DE, Staber C, Zeitlinger J, Hawley RS.

Highly Contiguous Genome Assemblies of 15 Drosophila Species Generated Using Nanopore Sequencing.

29× coverage and N50 of 4.4. Mb in average. A multiplexed NextSeq 500 run was used for polishing. Optimisation of Nanopore throughput by reorganising pore groups periodically, extracting high molecular weight DNA with phenol/chloroform extration, and using more DNA in the library preparation. Benchmark of various tools including minimap/miniasm and canu.

Chromosome-scale shotgun assembly using an in vitro method for long-range linkage.

Genome Res. 2016 Mar;26(3):342-50. doi:10.1101/gr.193474.115

Putnam NH, O'Connell BL, Stites JC, Rice BJ, Blanchette M, Calef R, Troll CJ, Fields A, Hartley PD, Sugnet CW, Haussler D, Rokhsar DS, Green RE.

Chromosome-scale shotgun assembly using an in vitro method for long-range linkage.

“Chicago”: in vitro assembly of artificial chromosomes, followed by Hi-C. Sold as a kit by Dovetail genomics.

Rapid Low-Cost Assembly of the Drosophila melanogaster Reference Genome Using Low-Coverage, Long-Read Sequencing.

G3 (Bethesda). 2018 Jul 17. pii: g3.200162.2018. doi:10.1534/g3.118.200162

Solares EA, Chakraborty M, Miller DE, Kalsow S, Hall K, Perera AG, Emerson JJ, Hawley RS.

Rapid Low-Cost Assembly of the Drosophila melanogaster Reference Genome Using Low-Coverage, Long-Read Sequencing.

Hybrid de novo assembly (Nanopore / Illumina / Optical) of the Drosophila genome reaches a high (>98%) « BUSCO » score typical of high-quality mainstream reference assemblies. (BUSCO stands for « Benchmarking Universal Single-Copy Orthologs ».)