HapMap 3

HapMap 3 is the third phase of the International HapMap project. This phase increases the number of DNA samples covered from 270 in phases I and II to 1,301 samples from a variety of human populations. This is the draft release 3.

The definitive data are available from the HapMap ftp site. The data available from these pages at the Sanger Institute are raw unfiltered data, provided as a resource to the community.

[Genome Research Limited]

Populations

The following population samples were studied:

ASW
African ancestry in Southwest USA
CEU
Utah residents with Northern and Western European ancestry from the CEPH collection
CHB
Han Chinese in Beijing, China
CHD
Chinese in Metropolitan Denver, Colorado
GIH
Gujarati Indians in Houston, Texas
JPT
Japanese in Tokyo, Japan
LWK
Luhya in Webuye, Kenya
MXL
Mexican ancestry in Los Angeles, California
MKK
Maasai in Kinyawa, Kenya
TSI
Toscani in Italia
YRI
Yoruba in Ibadan, Nigeria

Data released

HapMap 3 Release 3

POP Num_samples Num_SNPs_QC Num_SNPs_QC_poly
ASW 87 1 623 986 1 543 115
CEU 165 1 623 122 1 397 814
CHB 137 1 626 122 1 341 772
CHD 109 1 620 198 1 311 767
GIH 101 1 630 857 1 408 904
JPT 113 1 634 041 1 294 406
LWK 110 1 625 159 1 526 783
MXL 86 1 604 948 1 453 054
MKK 184 1 611 733 1 532 002
TSI 102 1 632 607 1 419 970
YRI 203 1 625 669 1 493 761
Consensus 1397 1 481 135 1 457 897
  • 99.3% platform concordance
  • 99.7% call rate
  • 1198 founders and 199 non-founders
  • 683 males, 714 females
  • 23238 monomorphic SNPs removed from consensus

i) Population samples for genotyping

Population samples for genotyping: Number of individuals with Hapmap 3 genotypes in this release (Number of individuals total): Number of SNPs included in this release (after QC)
ASW 71 (of 90) 1 632 186
CEU 162 (of 180) 1 634 020
CHB 82 (of 92) 1 637 672
CHD 70 (of 90) 1 619 203
GIH 83 (of 90) 1 631 060
JPT 82 (of 89) 1 637 610
LWK 83 (of 90) 1 631 688
MXL 71 (of 90) 1 614 892
MKK 171 (of 180) 1 621 427
TSI 77 (of 90) 1 629 957
YRI 163 (of 180) 1 634 666
Total 1115 (of 1261) 1 525 445
Consensus (polymorphic) dataset* 1115 (of 1261) 1 490 422

* Consensus (polymorphic) dataset of this release (35,023 monomorphic SNPs removed)

ii) Population Samples for PCR Resequencing

For each population the number of individuals for whom sequence was generated is shown:

ASW 55
CEU 119
CHB 90
CHD 30
GIH 60
JPT 91
LWK 60
MXL 27
MKK 0
TSI 60
YRI 120
Total 712

Data download

The definitive data are available from the HapMap ftp site. The data available from these pages at the Sanger Institute are raw unfiltered data, provided as a resource to the community.

FTP Download

The data can be downloaded from the Hapmap ftp site.

To access ENCODE III PCR resequencing data:

Please visit the BCM-HGSC public ftp site (ftp://ftp.hgsc.bcm.tmc.edu/pub/data/HapMap3-ENCODE/ENCODE3/ENCODE3v1/).

Data Release Policy

The release of pre-publication data from large resource-generating scientific projects was the subject of a meeting held in January 2003, the "Fort Lauderdale" meeting. The report from that meeting can be viewed here.

The recommendations of the Fort Lauderdale meeting address the roles and responsibilities of data producers, data users, and funders of "community resource projects", with the aim of establishing and maintaining an appropriate balance between the interests of data users in rapid access to data and the needs of data producers to receive recognition for their work.

The conclusion of the attendees at the meeting was that responsible use of the data is necessary to ensure that first-rate data producers will continue to participate in such projects and produce and quickly release valuable large-scale data sets. "Responsible use" was defined as allowing the data producers to have the opportunity to publish the initial global analyses of the data, as articulated at the outset of the project. Doing so also will ensure that the data generated are fully described.

Production and QC

i) Genotyping

Genotyping concordance between the two platforms was 0.9931 (computed over 249889 overlapping SNPs).

Data from the two platforms was merged using PLINK (--merge-mode 1), keeping only genotype calls if there is consensus between non-missing genotype calls (that is, merged genotype is set to missing if the two platforms give different, non-missing calls).

Quality control at the individual level was performed separately for different platforms. Only individuals with QC passed genotype data on both platforms were kept in this release. The following criteria were used to keep SNPs in the data sets of this release:

Hardy-Weinberg p>0.000001 (per population) missingness <0.05 (per population) <3 Mendel errors (per population; only applies to YRI, CEU, ASW, MXL, MKK) SNP must have a rsID and map to a unique genomic location The "consensus" data set contains data for all individuals (558 males, 557 females; 924 founders and 191 non-founders), only keeping SNPs that passed QC in all populations (overall call rate is 0.998). The "consensus/polymorphic" data set has In all genotype files, alleles are expressed as being on the (+/fwd) strand of NCBI build 36

ii) PCR Resequencing

The sequence based variant calls were generated by tiling with PCR primer sets spaced approximately 800 bases apart across the following regions:

Region Chromosome Coordinates Status
ENm010 7 27 124 046 - 27 224 045 ENCODE I
ENr321 8 119 082 221 - 119 182 220 ENCODE I
ENr232 9 130 925 123 - 131 025 122 ENCODE I
ENr123 12 38 826 477 - 38 926 476 ENCODE I
ENr213 18 23 919 232 - 24 019 231 ENCODE I
ENr331 2 220 185 590 - 220 285 589 New
ENr221 5 56 071 007 - 56 171 006 New
ENr233 15 41 720 089 - 41 820 088 New
ENr313 16 61 033 950 - 61 133 949 New
RNr133 21 39 444 467 - 39 544 466 New

Following filtration of low quality reads the data were analyzed with SNP Detector version 3, for polymorphic site discovery and individual genotype calling. Various QC filters were then applied. Specifically, we filtered out PCR amplicons with too many SNPs, and SNPs with discordant allele calls in mutliple amplicons. We also filtered out SNPs with low completeness in samples, or with too many conflicting genotype calls in two different strands.

In the "QC+" data set, we applied the HapMap QC parameters, specifically, we filtered out samples with low completeness, and filtered out SNPs with low call rate in each population (<80%) and not in HWE (P<0.001). In the QC+ data set, overall false positive rate is ~3.2%, based on limited number of validation assays.

Caveats

i) Genotyping

Missing from this release are Illumina SNPs that are A/T or C/G due to strandedness issues. Missing from this release are Illumina SNPs that are mitochondrial (as they do not have rsIDs). There may be few remaining SNPs (Illumina) in this release that are still on (-/rev) strand of NCBI build 36, but they are not A/T or C/G SNPs, so easy to identify downstream.

ii) PCR Resequencing

All variant calls have not been validated: we estimate that there is currently a false positive rate of ~12% among all calls, with a slightly higher rate (~14%) if considering just the singletons. Additional validation is ongoing PCR sequencing of additional samples (Masai) is also ongoing.

Analysis plans

Below are the analysis plans that the consortium pursuing:

  • SNP allele frequency estimation
  • Population differentiation
  • Linkage disequilibrium analysis
  • SNP Tagging
  • Imputation efficiency
  • Genomic locations of human CNVs
  • Genotypes for CNVs
  • Population genetic properties of CNVs (allele frequencies, population differentiation, etc.)
  • Mutation rate (frequency of de novo CNV) and potential mutational mechanisms
  • Linkage disequilibrium properties of CNVs
  • Tagging and imputation of CNVs
  • Signals of selection around CNVs
  • Association of SNPs and CNVs with expression phenotypes

Institutions and funding

The data for HapMap3 has been produced by the following institutions:

  • Baylor College of Medicine Human Genome Sequencing Center (BCM-HGSC)
  • Broad Institute (BI)
  • Wellcome Trust Sanger Institute (WTSI)

Funding for phase 3 of the International HapMap project has been provided by:

  • National Institutes of Health - National Human Genome Research Institute (NHGRI)
  • Wellcome Trust
* quick link - http://q.sanger.ac.uk/7hsxp002