High resolution CNV discovery (Conrad et al, 2010)

The Genome Structural Variation Consortium has conducted a CNV discovery project to identify common CNVs greater than 500bp in size using array-Comparative Genome Hybridisation at tiling resolution on isothermal oligonucleotide arrays. We analysed 20 females with European ancestry and 20 females with African ancestry, against a single male reference sample.

[Matt Hurles, Genome Research Limited]

We analysed 20 CEU HapMap samples, 20 YRI HapMap samples and one Polymorphism Discovery Resource sample for CNVs by array-CGH using a set of NimbleGen arrays that tile across the assayable portion of the genome with approximately 42 million probes spread across twenty 2.1 million probe (HD2) arrays.

Samples used were: NA06985, NA07037, NA07045, NA11894, NA11931, NA11993, NA11995, NA12004, NA12006, NA12044, NA12156, NA12239, NA12287, NA12414, NA12489, NA12749, NA12776, NA12828, NA12878, NA15510, NA18502, NA18505, NA18508, NA18511, NA18517, NA18523, NA18858, NA18861, NA18907, NA18909, NA18916, NA19099, NA19108, NA19114, NA19129, NA19147, NA19190, NA19225, NA19240 and NA19257. Reference sample was NA10851.

Data download

Data access summary:

Normalised intensity data from the CNV discovery array-CGH

Data description

A copy-number variation (CNV) analysis was run on 20 CEU and 20 YRI samples across an array set tiling across the human genome. The array contained 42 million oligos spread across 20 arrays. The final design provided 1 probe per 50bp average density across the genome.

Quality control

Three sets of special probes were used for this experiment. Over 1400 exons of known dosage sensitive genes were identified, and a single probe placed in each exon. This set of control probes was printed on each subarray. Second, each successive chip overlapped the previous one by about 14,000 probes, equivalent to the average number of probes per megabase. Third, X-linked probes were printed on each subarray, which allowed empirical measurement of experimental dose-response for each of our male-female cohybridizations.

Normalization

The normalization pipeline begins with the q-spline normalized data provided by Nimblegen; q-spline normalization transforms the red and green channel data from a single array experiment to the identical distribution. Log2 ratios are then obtained at each probe position as Cy3/Cy5. In-house, we correct for GC effects by fitting a model with linear and quadratic effects of GC content to the log2 ratios, separately for each subarray. We take the GC percentage in a 300bp window centered on each probe as our data for this analysis, using NCBI36 as our reference genome sequence. Finally, long-range spatial autocorrelation in log2 ratios (the 'wave effect') is modeled and removed using the method described in Marioni, et al. (2007) Genome Biology 8(10):R228.

All of the data in this dataset are generated from female samples cohybridized with a common male reference. For X-linked probes outside of PAR1 and PAR2, we take a slightly different normalization approach, to remove the effect of the reference sample. The raw data for non-PAR1/PAR2 X-linked probes are separated from all other probes and normalized as above (q-spline, GC, wave). Following this, the population median log2 ratio at each probe is calculated, and this value is subtracted from each probe in turn.

Download the data here.

Validated CNVs called from normalised data

Provisional Data Release

The regions made available in this provisional data release are the chromosomal start and end coordinates of 8,599 copy number variation events (CNVEs), which are representative of the underlying sample level CNV calls. Briefly, clusters of overlapping CNVs at the sample level are merged into CNVE if they have at least 51% of reciprocal overlap. This means that CNV calls with similar boundaries are merged, while overlapping CNVs of very different size or different start and end points will be kept separated also at the CNVE level. Overlapping CNVEs will therefore be present in the dataset. Each of the CNVEs in this provisional data release has some level of independent validation, either by an independent platform or by overlap with other published datasets.

Download the data here.

Genotyping of selected CNVs

Table of CNV genotypes

This table contains absolute, integer-valued copy number estimates for 450 HapMap samples. The first worksheet is a map describing the location of all loci in the dataset. The CGH array used for CNV "genotyping" targeted some loci not discovered with the 42m probe array. Each set of integer-value copy number is indexed by the CNV name. Sample labels are in the column headers. "NA" is the missing data character for genotype calls that could not be assigned. The "Genotype Map" contains the following annotation information on each CNV:

  • CNV: CNV ID
  • chr, start, end: chromosome coordinates with respect to NCBI36/hg18. Note that CNVs of novel insert sequences are not mapped but have values of "NA" instead.
  • cn: the set of integer value absolute copy numbers observed at the locus
  • source: source of the CNV location. Either a publication, or "unpublished"

Download the data here.

Data description

These data are being released freely to the scientific community and can be considered a community resource. However, the data generators reserve the right to be the first to publish on the bulk data as indicated by the Fort Lauderdale meeting report (see data release policy below). Our groups are performing various global analyses in this dataset, including:

  • generating a genome-wide map of copy number variation
  • mapping the genomic-wide CNV map onto functional annotation of the genome
  • associations to SNP and haplotype variation
  • associations to gene expression variation
  • quantify population differentiation for copy number variation
  • investigating mechanisms of CNV formation

Authors who use data from this project for presentation and/or publication should acknowledge the project. Below is a sample acknowledgement statement:

This study makes use of data generated by the Genome Structural Variation Consortium (PIs Nigel Carter, Matthew Hurles, Charles Lee and Stephen Scherer) whom we thank for pre-publication access to their CNV discovery [and/or] genotyping data, made available through the websites http://www.sanger.ac.uk/humgen/cnv/42mio/ and http://projects.tcag.ca/variation/ as a resource to the community. Funding for the project was provided by the Wellcome Trust [Grant No. 077006/Z/05/Z], Canada Foundation of Innovation and Ontario Innovation Trust, Canadian Institutes of Health Research, Genome Canada/Ontario Genomics Institute, the McLaughlin Centre for Molecular Medicine, Ontario Ministry of Research and Innovation, the Hospital for Sick Children Foundation, the Department of Pathology at Brigham and Women's Hospital and the National Institutes of Health grants HG004221 and GM081533.

Users should note that the Consortium bears no responsibility for the further analysis or interpretation of these data, over and above that published by the Consortium.

Acknowledgments

Wellcome Trust Sanger Institute: Don Conrad, Richard Redon, Tomas Fitzgerald, Nelo Onyiah, Jan Aerts, Chris Tyler-Smith, Nigel Carter, Matthew Hurles

The Centre for Applied Genomics: Steve Scherer, Lars Feuk, Dalila Pinto

Harvard Medical School, Brigham and Women's Hospital: Charles Lee, Omer Gokcumen

Publication

  • Origins and functional impact of copy number variation in the human genome.

    Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P, Fitzgerald T, Hu M, Ihm CH, Kristiansson K, Macarthur DG, Macdonald JR, Onyiah I, Pang AW, Robson S, Stirrups K, Valsesia A, Walter K, Wei J, Wellcome Trust Case Control Consortium, Tyler-Smith C, Carter NP, Lee C, Scherer SW and Hurles ME

    Nature 2010;464;7289;704-12

CNV project pages

Software

  • CNVFinder - an algorithm designed to detect copy number variants (CNVs) in the human population from large-insert clone DNA microarray
  • CNVTools - a collection of packages useful in the analysis of copy number variants (CNV).
* quick link - http://q.sanger.ac.uk/nmt745vx