Contact WTSI Webmaster Printer friendly format Login to WTSI resources WTSI RSS feed
Scientific Divisions
  • Human Genetics
  • Model Organisms
  • Pathogens
  • Bioinformatics
  • Sequencing
  • HapMap 3
  • Home
  • FTP Data Download
  • Website Search
  • People Search
  • Library Services
  • Site Map
  • Feedback / Help
HapMap 3
HapMap 3 release 3

SNP GENOTYPE DATA

  • hapmap3_r3_b36_fwd.qc.poly.tar.gz - tarball of QC+ polymorphic genotype data per population, formatted as PLINK PED and MAP files [1.4GB]
  • hapmap3_r3_b36_fwd.consensus.qc.poly.ped.gz - PED file of QC+ polymorphic genotype data (consensus) [1.2 GB]
  • hapmap3_r3_b36_fwd.consensus.qc.poly.map.gz - MAP file of QC+ polymorphic genotype data (consensus) [11 MB]

HapMap 3 release 2

A. SNP GENOTYPE DATA

  • hapmap3_r2_b36_fwd.qc.poly.tar.bz2 - tarball of QC+ polymorphic genotype data per population, formatted as PLINK PED and MAP files [889 MB]
  • hapmap3_r2_b36_fwd.consensus.qc.poly.ped.bz2 - PED file of QC+ polymorphic genotype data (consensus) [757 MB]
  • hapmap3_r2_b36_fwd.consensus.qc.poly.map.bz2 - MAP file of QC+ polymorphic genotype data (consensus) [11 MB]
  • relationships_w_pops_051208.txt - family (pedigree) relationships and population labels for 1,301 HapMap 3 samples [37 KB]
  • hapmap_270_samples.txt - list of the 270 samples used in Phase I and II of the International HapMap Project [2 KB]

B. PCR RESEQUENCING DATA

To access the ENCODE III PCR resequencing data, please visit the BCM-HGSC public ftp site at ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Encode


HapMap 3 draft release

This is draft release 1.0 for the third phase of the HapMap project (HapMap III). Data consist of:

(i) SNP genotype data generated from HapMap Phase III samples (including the original HapMap samples used in Phase I and II of the International HapMap Project). In total, this release contains genotypes data from 1115 individuals from 11 populations, collected using two platforms: the Illumina Human1M (WTSI) and the Affymetrix SNP 6.0 (BI). Data from the two platforms have been merged for this release.

(ii) DNA variation data developed from PCR-resequencing (BCM-HGSC) of ten 100 KB regions from 712 individuals from the HapMap Phase III sample collection. Five of the ten 100 kb regions are within the ENCODE I regions, the other five are randomly selected new regions.


Data Production Institutions
  • Baylor College of Medicine Human Genome Sequencing Center (BCM-HGSC)
  • Broad Institute (BI)
  • Wellcome Trust Sanger Institute (WTSI)

Funding Agencies
  • National Institutes of Health – National Human Genome Research Institute (NHGRI)
  • Wellcome Trust

Populations and Abbreviations
  • ASW African ancestry in Southwest USA
  • CEU Utah residents with Northern and Western European ancestry from the CEPH collection
  • CHB Han Chinese in Beijing, China
  • CHD Chinese in Metropolitan Denver, Colorado
  • GIH Gujarati Indians in Houston, Texas
  • JPT Japanese in Tokyo, Japan
  • LWK Luhya in Webuye, Kenya
  • MEX Mexican ancestry in Los Angeles, California
  • MKK Maasai in Kinyawa, Kenya
  • TSI Toscani in Italia
  • YRI Yoruba in Ibadan, Nigeria

Data content of this release

HapMap 3 Release 3

POP             Num_samples     Num_SNPs_QC     Num_SNPs_QC_poly
-------------------------------------------------------------------
ASW             87              1623986         1543115
CEU             165             1623122         1397814
CHB             137             1626122         1341772
CHD             109             1620198         1311767
GIH             101             1630857         1408904
JPT             113             1634041         1294406
LWK             110             1625159         1526783
MEX             86              1604948         1453054
MKK             184             1611733         1532002
TSI             102             1632607         1419970
YRI             203             1625669         1493761

Consensus       1397            1481135         1457897

99.3% platform concordance
99.7% call rate

1198 founders and 199 non-founders
683 males, 714 females
23238 monomorphic SNPs removed from consensus

(i). Population samples for genotyping: Number of individuals with Hapmap 3 genotypes in this release (Number of individuals total): Number of SNPs included in this release (after QC)

ASW 71 (of 90) 1 632 186
CEU162 (of 180) 1 634 020
CHB82 (of 92) 1 637 672
CHD70 (of 90) 1 619 203
GIH83 (of 90) 1 631 060
JPT82 (of 89) 1 637 610
LWK83 (of 90) 1 631 688
MEX71 (of 90) 1 614 892
MKK171 (of 180) 1 621 427
TSI77 (of 90) 1 629 957
YRI163 (of 180) 1 634 666
Total1115 (of 1261) 1525445

Consensus (polymorphic) dataset of this release (35023 monomorphic SNPs removed. 1115 (of 1261) :1 490 422

(ii). Population Samples for PCR Resequencing

For each population the number of individuals for whom sequence was generated is shown:

ASW    55
CEU   119
CHB    90
CHD    30
GIH    60
JPT    91
LWK    60
MEX    27
MKK     0
TSI    60
YRI   120
Total 712

Data Production and Quality control for this release

(i). GENOTYPING: Genotyping concordance between the two platforms was 0.9931 (computed over 249889 overlapping SNPs).

Data from the two platforms was merged using PLINK (--merge-mode 1), keeping only genotype calls if there is consensus between non-missing genotype calls (that is, merged genotype is set to missing if the two platforms give different, non-missing calls).

Quality control at the individual level was performed separately for different platforms. Only individuals with QC passed genotype data on both platforms were kept in this release. The following criteria were used to keep SNPs in the data sets of this release:

Hardy-Weinberg p>0.000001 (per population) missingness <0.05 (per population) <3 Mendel errors (per population; only applies to YRI, CEU, ASW, MEX, MKK) SNP must have a rsID and map to a unique genomic location The "consensus" data set contains data for all individuals (558 males, 557 females; 924 founders and 191 non-founders), only keeping SNPs that passed QC in all populations (overall call rate is 0.998). The "consensus/polymorphic" data set has In all genotype files, alleles are expressed as being on the (+/fwd) strand of NCBI build 36.

(ii). PCR RESEQUENCING

The sequence based variant calls were generated by tiling with PCR primer sets spaced approximately 800 bases apart across the following regions:

RegionChromosomeCoordinatesStatus
ENm010727,124,046-27,224,045ENCODE I
ENr3218119,082,221-119,182,220ENCODE I
ENr2329130,925,123-131,025,122ENCODE I
ENr1231238,826,477-38,926,476ENCODE I
ENr2131823,919,232-24,019,231ENCODE I
ENr3312220,185,590-220,285,589New
ENr221556,071,007-56,171,006New
ENr2331541,720,089-41,820,088New
ENr3131661,033,950-61,133,949New
RNr1332139,444,467-39,544,466New

Following filtration of low quality reads the data were analyzed with SNP Detector version 3, for polymorphic site discovery and individual genotype calling. Various QC filters were then applied. Specifically, we filtered out PCR amplicons with too many SNPs, and SNPs with discordant allele calls in mutliple amplicons. We also filtered out SNPs with low completeness in samples, or with too many conflicting genotype calls in two different strands.

In the "QC+" data set, we applied the HapMap QC parameters, specifically, we filtered out samples with low completeness, and filtered out SNPs with low call rate in each population (<80%) and not in HWE (P<0.001). In the QC+ data set, overall false positive rate is ~3.2%, based on limited number of validation assays.


Caveats to this release include

A. GENOTYPING

Missing from this release are Illumina SNPs that are A/T or C/G due to strandedness issues. Missing from this release are Illumina SNPs that are mitochondrial (as they do not have rsIDs). There may be few remaining SNPs (Illumina) in this release that are still on (-/rev) strand of NCBI build 36, but they are not A/T or C/G SNPs, so easy to identify downstream.

B. PCR RESEQUENCING

All variant calls have not been validated: we estimate that there is currently a false positive rate of ~12% among all calls, with a slightly higher rate (~14%) if considering just the singletons. Additional validation is ongoing PCR sequencing of additional samples (Masai) is also ongoing.


How to download this release

Ftp Download

The data can be downloaded from the WTSI ftpsite

To access ENCODE III PCR resequencing data:

Please visit the BCM-HGSC public ftp site (ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Encode).
coriell-lookup.xls - list of 712 unrelated samples sequenced [60 KB]
bcm-encode3-submission.txt.gz - genotypes of 10,076 SNP sites by 712 samples [626 KB]
bcm-encode3-QC.txt - QC+ genotypes of 6,223 SNPs sites by 692 samples [8700KB]


Links
  • BCM-HGSC http://www.hgsc.bcm.tmc.edu/
  • BI http://www.broad.mit.edu/
  • WTSI http://www.sanger.ac.uk/
  • International HapMap Project http://www.hapmap.org/

Analysis plans

Below are the analysis plans that the consortium pursuing:

  • SNP allele frequency estimation
  • Population differentiation
  • Linkage disequilibrium analysis
  • SNP Tagging
  • Imputation efficiency
  • Genomic locations of human CNVs
  • Genotypes for CNVs
  • Population genetic properties of CNVs (allele frequencies, population differentiation, etc.)
  • Mutation rate (frequency of de novo CNV) and potential mutational mechanisms
  • Linkage disequilibrium properties of CNVs
  • Tagging and imputation of CNVs
  • Signals of selection around CNVs
  • Association of SNPs and CNVs with expression phenotypes

Data Release Policy

The release of pre-publication data from large resource-generating scientific projects was the subject of a meeting held in January 2003, the "Fort Lauderdale" meeting. The report from that meeting can be viewed here.

The recommendations of the Fort Lauderdale meeting address the roles and responsibilities of data producers, data users, and funders of "community resource projects", with the aim of establishing and maintaining an appropriate balance between the interests of data users in rapid access to data and the needs of data producers to receive recognition for their work. The conclusion of the attendees at the meeting was that responsible use of the data is necessary to ensure that first-rate data producers will continue to participate in such projects and produce and quickly release valuable large-scale data sets. "Responsible use" was defined as allowing the data producers to have the opportunity to publish the initial global analyses of the data, as articulated at the outset of the project. Doing so also will ensure that the data generated are fully described.

Human Genetics Model Organisms Pathogen Biology Bioinformatics Sequencing
Section Home
Cancer Genome Project
COSMIC
Statistical Genetics
Human Genome Project
Case-Control Consortium
Section Home
Mouse
Zebrafish
C. elegans
S. pombe
Section Home
Bacteria
Protozoa
Helminths
Section Home
Software
Databases
Blast
Ensembl
Vega
GeneDB
Section Home
Sequencing Projects
sequencing Information

webmaster@sanger.ac.uk

Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK  Tel:+44 (0)1223 834244

Last Modified Fri Sep 4 23:17:58 2009

Genome Research Limited is a charity registered in England with number 1021457

Data Sharing Policy | Conditions of Use | Copyright