In collaboration with Dr. Casey Bergman at the University of Manchester and Drs. Susan Celniker and Roger Hoskins of the Berkeley Drosophila Genome Project (BDGP) at Lawrence Berkeley National Laboratory, we have sequenced adult males from a subline of the ISO1 (y; cn, bw, sp) strain of D. melanogaster. This is the same stock used in the official BDGP reference assemblies since the first genome sequence release in 2000. The DNA was size-selected for >15 kb elution using the BluePippinTM system (Sage Science), and in total, ~15 Gb of sequence was generated from a 20 kb library using P5-C3 sequencing chemistry on the PacBio® RS II.
Total number of bases: 15,208,567,933 bp
Total number of reads: 1,514,730
Average read length: 10,040 bp
Half of sequenced bases in reads greater than: 14,214 bp
PacBio RS II instrument time for sequencing: 6 days
Number of SMRT® Cells: 42
Some preliminary analyses and step-by-step instructions for downloading, mapping, and visualizing these raw data are described on the Bergman lab blog. This analysis shows that the depth of coverage for this dataset is >90x for reads mapping to autosomes in the D. melanogaster Release 5 reference genome sequence. Dr. Bergman also shows that individual PacBio long reads can uniquely localize repetitive transposable elements up to ~10 kb in size, and can be used to fill at least one of the rare but persistent gaps remaining in the euchromatic portion of the reference genome.
Most assembly algorithms collapse contigs into a single copy of the genome (haploid); however this does not reflect the true underlying state in diploid genomes such as flies or human, which have two copies of each chromosome - inheriting one copy from the mother and another copy from the father. In the case of this inbred subline, with limited allelic variation, both assembly strategies are feasible. We attempted both a traditional haploid assembly as well as a first-ever diploid assembly of the D. melanogaster genome, with preliminary results summarized below. The maximum contig length in both haploid (25 Mb) and diploid (21 Mb) assemblies produced contigs that span almost the entire length of chromosome arm 3L:
The preliminary haploid assembly using the PacBio Corrected Reads (PBcR) pipeline in Celera® Assembler 8.1 was carried out in collaboration with Drs. Sergey Koren and Adam Phillippy at the University of Maryland. They were able to assemble entire chromosome arms de novo into fewer pieces than the current version of the reference genome (Release 5) – an effort that has spanned two decades, cost millions of dollars, and involved laborious BAC design and sequencing as well as manual finishing and optical mapping. This level of completeness in a de novo assembly is unprecedented in a metazoan genome, and required a total of 6 weeks from initial fly collection and sorting to final analysis and assembly. The actual sequencing time was 6 days. The contig count for each chromosome is tabulated below.
The larger number of contigs for chromosome X can be explained because only adult males were sequenced here (with a 50:50 ratio of chrX to chrY), whereas the Release 5 genome is from mixed-sex embryos. Chromosome X and Y assemblies can further be improved using a higher coverage of reads, and this analysis is underway. Additional analysis and data associated with the haploid assembly are provided here by Dr. Sergey Koren and Dr. Adam Phillippy, including 70x of high-quality, pre-assembled reads using the Celera Assembler PBcR pipeline. Dr. Koren will be presenting the full results at the International Plant and Animal Genome XXII meeting on Tuesday, January 14th.
We also attempted a diploid assembly using an early version of FALCON (Fast Alignment and CONsensus), a new assembly algorithm that is currently being developed at PacBio to explore de novo assembly of diploid genomes. The FALCON assembler extends the Hierarchical Genome Assembly Process (HGAP), with faster implementations of the aligner and consensus algorithms for the pre-assembly and overlap steps of the process. A string-graph is used for the layout stage, which preserves structural and phasing information in polymorphic and heterozygous diploid genomes. The algorithm outputs “primary contigs” and their “associated contigs”, capturing alternative local variants associated with a primary contig. Details were presented at the Genome Informatics Conference in Cold Spring Harbor in November 2013 with slides available here. We assessed the results of our preliminary FALCON assembly by aligning the assembled contigs (light green) to the euchromatic arms (dark green) of the D. melanogaster reference genome using nucmer from the Mummer3 package. Alignments to 2L, 2R, 3L, 3R, 4, X and Y are shown below with sequence identity of each alignment block color-coded in blue indicating >99.96% identity and yellow indicating >99.9% identity. The blue arches represent “unused edges” in the FALCON string graph, and can be interpreted as potential connection points if a more aggressive contig joining threshold were desired.
A development version of FALCON is available on github to developers and bioinformaticians interested in this new assembly process.
We are also releasing an updated dataset of the Arabidopsis thaliana genome using the latest P5-C3 chemistry for comparison to our initial data release using P4-C2 chemistry. In total, we have now released four datasets from three important model organisms: S. cerevisiae (yeast), A. thaliana (flowering plant), and D. melanogaster (fruit fly). This data is freely available and we invite the research community to explore the collection:
S. cerevisiae: P4-C2 Chemistry
A. thaliana: P4-C2 Chemistry, P5-C3 Chemistry
D. melanogaster: P5-C3 Chemistry