Data From: Pseudo-De Novo Assembly and Analysis of Unmapped Genome Sequence Reads in Wild Zebrafish Reveals Novel Gene Content
Document Type
Dataset
Publication Date
2015
Subjects
Zebra danio -- Genetics, Zebra danio -- Mitochondrial DNA -- Analysis, Zebra danio -- Development
Abstract
Zebrafish represents the third vertebrate with an officially completed genome, yet it remains incomplete with additions and corrections continuing with the current release, GRCz10, having 13% of zebrafish cDNA sequences unmapped. This disparity may result from population differences given the reference was generated from clonal individuals with limited genetic diversity. This is supported by the recent analysis of a single wild zebrafish which identified over 5.2 million SNPs and 1.6 million in/dels in the previous genome build, zv9. Re-examination of this sequence dataset indicated that 13.8% of quality sequence reads failed to align to GRCz10. Using a novel bioinformatics de novo assembly pipeline on these unmappable reads we identified 1,514,491 novel contigs covering ~224 Mb of genomic sequence. Among these, 1,083 contigs were found to contain potential gene coding sequence. RNA-seq data comparison confirmed 362 contigs contained transcribed DNA sequence, suggesting that a large amount of functional genomic sequence remains unannotated in zebrafish. By utilizing the bioinformatics pipeline developed in this study the zebrafish genome will be bolstered as a model for human disease research. Adaptation of the pipeline described here also offers a cost-efficient and effective method to identify and map novel genetic content across any genome and will ultimately aid in the completion of additional genomes for a broad range of species.
Rights
This work is marked with CC0 1.0 Universal
DOI
10.15760/data.2
Persistent Identifier
http://archives.pdx.edu/ds/psu/16241
Recommended Citation
Faber-Hammond, Joshua J. and Brown, Kim H., "Data From: Pseudo-De Novo Assembly and Analysis of Unmapped Genome Sequence Reads in Wild Zebrafish Reveals Novel Gene Content" (2015). Dataset. https://doi.org/10.15760/data.2
Supplementary Table 1 - Approximate mapping coordinates for all contigs; text file, can be opened in notepad or excel
Supplementary Table 2.xlsx (5858 kB)
Supplementary Table 2 - High quality mapping coordinates; excel file
Supplementary Table 3.xlsx (170 kB)
Supplementary Table 3 - BLASTx sequence identities for 1083 contigs; excel file
Supplementary Table 4.xlsx (73 kB)
Supplementary Table 4 - BLASTx gene hits for 1083 contigs; excel file
Supplementary Table 5.xlsx (16 kB)
Supplementary Table 5 -Gene names for 95%+ BLASTx hits; excel file
Supplementary_Data_1.fasta (95159 kB)
Supplementary Data 1 - fasta sequences for all contigs; fasta file
Supplementary_Data_2.tar.gz (1760 kB)
Supplementary Data 2 - sequence chromatograms; must be opened in compatible program such as TraceViewer
Supplementary Table 2.ods (4387 kB)
Preservation copy - Supplementary Table 2
Supplementary Table 3.ods (173 kB)
Preservation copy - Supplementary Table 3
Supplementary Table 4.ods (62 kB)
Preservation copy - Supplementary Table 4
Supplementary Table 5.ods (8 kB)
Preservation copy - Supplementary Table 5
Description
This dataset is associated with a manuscript published in Zebrafish. March 2016, 13(2): 95-102 (http://dx.doi.org/10.1089/zeb.2015.1154)
The supplementary data sets in this file contain excel tables (.xlsx), text file tables (.txt), sequence files (.fasta) and sequence chromatogram files. Chromatogram files must be viewed in programs such as "Sequence Scanner Software" or "TraceViewer" which are available for free download on the internet.
For preservation purposes .xlsx files (Supplementary Tables 2- 5) were converted to OpenDocument Spreadsheet (.ods) files. The files are available and marked accordingly.