Title

Data From: Anchored Pseudo-De Novo Assembly of Human Genomes Identifies Extensive Sequence Variation from Unmapped Sequence Reads

Document Type

Dataset

Publication Date

2015

Subjects

Human genome

Abstract

The Human Genome Reference (HGR) completion marked the genomics era beginning, yet despite its utility universal application is limited by the small number of individuals used in its development. This is highlighted by the presence of high quality sequence reads failing to map within the HGR. Sequences failing to map generally represent 2-5% of total reads, which may harbor regions that would enhance our understanding of population variation, evolution, and disease. Alternatively, complete de novo assemblies can be created, but these effectively ignore the groundwork of the HGR. In an effort to find a middle ground we developed a bioinformatic pipeline that maps paired-end reads to the HGR as separate single reads, exports unmappable reads, de novo assembles these reads per individual, then combines assemblies into a secondary reference assembly used for comparative analysis. Using 45 diverse 1000 Genomes Project individuals, we identified 351,361 contigs covering 195.5 Mb of sequence unincorporated in GRCh38. 30,879 contigs are represented in multiple individuals with ~40% showing high sequence complexity. Genomic coordinates were generated for 99.9%, with 52.5% exhibiting high quality mapping scores. Comparative genomic analyses with archaic humans and primates revealed significant sequence alignments and comparisons with model organism RefSeq gene datasets identified novel human genes. If incorporated, these sequences will expand the HGR, but more importantly, our data highlights that with this method low coverage (~10-20X) next generation sequencing can still be used to identify novel unmapped sequences to explore biological functions contributing to human phenotypic variation, disease and functionality for personal genomic medicine.

Description

The data supports a manuscript published in Human Geneticstitled "Anchored Pseudo-De Novo Assembly of Human Genomes Identifies Extensive Sequence Variation From Unmapped Sequence Reads" (2016). https://doi.org/10.1007/s00439-016-1667-5

The supplementary data sets in this file contain excel tables (.xlsx), text file tables (.txt), sequence files (.fasta) and a compressed text file (.gz).

DOI

10.15760/data.1

Persistent Identifier

http://archives.pdx.edu/ds/psu/16928

SupplementalData1.fasta (63156 kB)
Supplemental Data 1. Fasta file containing secondary assembly contigs not present within GRCh38

SupplementalData2.fasta (143715 kB)
Supplemental Data 2. Fasta file containing contigs from 1000 Genomes individual primary assemblies are not represented in the secondary assembly and not present within GRCh38

SupplementalData3.xlsx (7000 kB)
Supplemental Data 3. Predicted presence/absence data for secondary assembly contigs

SupplementalData4.gz (57839 kB)
Supplemental Data 4. List of predicted loci for secondary assembly contigs based on mapping one-end anchored read pairs in both the genome and assembly

SupplementalData5.txt (9377 kB)
Supplemental Data 5. List of high quality predicted loci for secondary assembly contigs based on mapping one-end anchored read pairs in both the genome and assembly

SupplementalData6.xlsx (737 kB)
Supplemental Data 6. Representation of secondary assembly contigs in related primate species and archaic hominids

SupplementalData7.xlsx (104 kB)
Supplemental Data 7. Annotation results for secondary assembly contigs

Share

COinS