Data From: Pseudo-De Novo Assembly and Analysis of Unmapped Genome Sequence Reads in Wild Zebrafish Reveals Novel Gene Content

Document Type


Publication Date



Zebra danio -- Genetics, Zabra danio -- Mitochondrial DNA -- Analysis, Zebra danio -- Development


Zebrafish represents the third vertebrate with an officially completed genome, yet it remains incomplete with additions and corrections continuing with the current release, GRCz10, having 13% of zebrafish cDNA sequences unmapped. This disparity may result from population differences given the reference was generated from clonal individuals with limited genetic diversity. This is supported by the recent analysis of a single wild zebrafish which identified over 5.2 million SNPs and 1.6 million in/dels in the previous genome build, zv9. Re-examination of this sequence dataset indicated that 13.8% of quality sequence reads failed to align to GRCz10. Using a novel bioinformatics de novo assembly pipeline on these unmappable reads we identified 1,514,491 novel contigs covering ~224 Mb of genomic sequence. Among these, 1,083 contigs were found to contain potential gene coding sequence. RNA-seq data comparison confirmed 362 contigs contained transcribed DNA sequence, suggesting that a large amount of functional genomic sequence remains unannotated in zebrafish. By utilizing the bioinformatics pipeline developed in this study the zebrafish genome will be bolstered as a model for human disease research. Adaptation of the pipeline described here also offers a cost-efficient and effective method to identify and map novel genetic content across any genome and will ultimately aid in the completion of additional genomes for a broad range of species.


This dataset is associated with a manuscript published in Zebrafish. March 2016, 13(2): 95-102 (http://dx.doi.org/10.1089/zeb.2015.1154)

The supplementary data sets in this file contain excel tables (.xlsx), text file tables (.txt), sequence files (.fasta) and sequence chromatogram files. Chromatogram files must be viewed in programs such as "Sequence Scanner Software" or "TraceViewer" which are available for free download on the internet.

For preservation purposes .xlsx files (Supplementary Tables 2- 5) were converted to OpenDocument Spreadsheet (.ods) files. The files are available and marked accordingly.



Persistent Identifier


Supplementary Table 1.txt (41514 kB)
Supplementary Table 1 - Approximate mapping coordinates for all contigs; text file, can be opened in notepad or excel

Supplementary Table 2.xlsx (5858 kB)
Supplementary Table 2 - High quality mapping coordinates; excel file

Supplementary Table 3.xlsx (170 kB)
Supplementary Table 3 - BLASTx sequence identities for 1083 contigs; excel file

Supplementary Table 4.xlsx (73 kB)
Supplementary Table 4 - BLASTx gene hits for 1083 contigs; excel file

Supplementary Table 5.xlsx (16 kB)
Supplementary Table 5 -Gene names for 95%+ BLASTx hits; excel file

Supplementary_Data_1.fasta (95159 kB)
Supplementary Data 1 - fasta sequences for all contigs; fasta file

Supplementary_Data_2.tar.gz (1760 kB)
Supplementary Data 2 - sequence chromatograms; must be opened in compatible program such as TraceViewer

Supplementary Table 2.ods (4387 kB)
Preservation copy - Supplementary Table 2

Supplementary Table 3.ods (173 kB)
Preservation copy - Supplementary Table 3

Supplementary Table 4.ods (62 kB)
Preservation copy - Supplementary Table 4

Supplementary Table 5.ods (8 kB)
Preservation copy - Supplementary Table 5