To access the ENCODE III PCR resequencing data, please visit the BCM-HGSC public ftp site at ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Encode
This is draft release 1.0 for the third phase of the HapMap project (HapMap III). Data consist of:
(i) SNP genotype data generated from HapMap Phase III samples (including the original HapMap samples used in Phase I and II of the International HapMap Project). In total, this release contains genotypes data from 1115 individuals from 11 populations, collected using two platforms: the Illumina Human1M (WTSI) and the Affymetrix SNP 6.0 (BI). Data from the two platforms have been merged for this release.
(ii) DNA variation data developed from PCR-resequencing (BCM-HGSC) of ten 100 KB regions from 712 individuals from the HapMap Phase III sample collection. Five of the ten 100 kb regions are within the ENCODE I regions, the other five are randomly selected new regions.
HapMap 3 Release 3
POP Num_samples Num_SNPs_QC Num_SNPs_QC_poly ------------------------------------------------------------------- ASW 87 1623986 1543115 CEU 165 1623122 1397814 CHB 137 1626122 1341772 CHD 109 1620198 1311767 GIH 101 1630857 1408904 JPT 113 1634041 1294406 LWK 110 1625159 1526783 MEX 86 1604948 1453054 MKK 184 1611733 1532002 TSI 102 1632607 1419970 YRI 203 1625669 1493761 Consensus 1397 1481135 1457897 99.3% platform concordance 99.7% call rate 1198 founders and 199 non-founders 683 males, 714 females 23238 monomorphic SNPs removed from consensus
(i). Population samples for genotyping: Number of individuals with Hapmap 3 genotypes in this release (Number of individuals total): Number of SNPs included in this release (after QC)
Consensus (polymorphic) dataset of this release (35023 monomorphic SNPs removed. 1115 (of 1261) :1 490 422
(ii). Population Samples for PCR Resequencing
For each population the number of individuals for whom sequence was generated is shown:
ASW 55 CEU 119 CHB 90 CHD 30 GIH 60 JPT 91 LWK 60 MEX 27 MKK 0 TSI 60 YRI 120 Total 712
(i). GENOTYPING: Genotyping concordance between the two platforms was 0.9931 (computed over 249889 overlapping SNPs).
Data from the two platforms was merged using PLINK (--merge-mode 1), keeping only genotype calls if there is consensus between non-missing genotype calls (that is, merged genotype is set to missing if the two platforms give different, non-missing calls).
Quality control at the individual level was performed separately for different platforms. Only individuals with QC passed genotype data on both platforms were kept in this release. The following criteria were used to keep SNPs in the data sets of this release:
Hardy-Weinberg p>0.000001 (per population) missingness <0.05 (per population) <3 Mendel errors (per population; only applies to YRI, CEU, ASW, MEX, MKK) SNP must have a rsID and map to a unique genomic location The "consensus" data set contains data for all individuals (558 males, 557 females; 924 founders and 191 non-founders), only keeping SNPs that passed QC in all populations (overall call rate is 0.998). The "consensus/polymorphic" data set has In all genotype files, alleles are expressed as being on the (+/fwd) strand of NCBI build 36.
(ii). PCR RESEQUENCING
The sequence based variant calls were generated by tiling with PCR primer sets spaced approximately 800 bases apart across the following regions:
Following filtration of low quality reads the data were analyzed with SNP Detector version 3, for polymorphic site discovery and individual genotype calling. Various QC filters were then applied. Specifically, we filtered out PCR amplicons with too many SNPs, and SNPs with discordant allele calls in mutliple amplicons. We also filtered out SNPs with low completeness in samples, or with too many conflicting genotype calls in two different strands.
In the "QC+" data set, we applied the HapMap QC parameters, specifically, we filtered out samples with low completeness, and filtered out SNPs with low call rate in each population (<80%) and not in HWE (P<0.001). In the QC+ data set, overall false positive rate is ~3.2%, based on limited number of validation assays.
A. GENOTYPING
Missing from this release are Illumina SNPs that are A/T or C/G due to strandedness issues. Missing from this release are Illumina SNPs that are mitochondrial (as they do not have rsIDs). There may be few remaining SNPs (Illumina) in this release that are still on (-/rev) strand of NCBI build 36, but they are not A/T or C/G SNPs, so easy to identify downstream.
B. PCR RESEQUENCING
All variant calls have not been validated: we estimate that there is currently a false positive rate of ~12% among all calls, with a slightly higher rate (~14%) if considering just the singletons. Additional validation is ongoing PCR sequencing of additional samples (Masai) is also ongoing.
The data can be downloaded from the WTSI ftpsite
To access ENCODE III PCR resequencing data:
Please visit the BCM-HGSC public ftp site (ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Encode). coriell-lookup.xls - list of 712 unrelated samples sequenced [60 KB] bcm-encode3-submission.txt.gz - genotypes of 10,076 SNP sites by 712 samples [626 KB] bcm-encode3-QC.txt - QC+ genotypes of 6,223 SNPs sites by 692 samples [8700KB]
Below are the analysis plans that the consortium pursuing:
The release of pre-publication data from large resource-generating scientific projects was the subject of a meeting held in January 2003, the "Fort Lauderdale" meeting. The report from that meeting can be viewed here.
The recommendations of the Fort Lauderdale meeting address the roles and responsibilities of data producers, data users, and funders of "community resource projects", with the aim of establishing and maintaining an appropriate balance between the interests of data users in rapid access to data and the needs of data producers to receive recognition for their work. The conclusion of the attendees at the meeting was that responsible use of the data is necessary to ensure that first-rate data producers will continue to participate in such projects and produce and quickly release valuable large-scale data sets. "Responsible use" was defined as allowing the data producers to have the opportunity to publish the initial global analyses of the data, as articulated at the outset of the project. Doing so also will ensure that the data generated are fully described.
webmaster@sanger.ac.uk
Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK Tel:+44 (0)1223 834244
Last Modified Fri Sep 4 23:17:58 2009
Genome Research Limited is a charity registered in England with number 1021457
Data Sharing Policy | Conditions of Use | Copyright