Sequencing

The Wellcome Trust Sanger Institute was founded to exploit DNA sequence in order to understand biology and disease.

That imperative remains as strong today as the Institute capitalises on leading-edge technologies to answer questions that were unanswerable only a few years ago. We have always used and will continue to use the most appropriate technology to provide the data that solve important questions in biomedicine.

3730 sequencing

Biochemistry

Frederick Sanger.

Frederick Sanger.

Sanger dideoxy terminator sequencing has provided the backbone technology for DNA sequencing for the last 40 years. Originally derived by Fred Sanger, the Institute was founded on this technology and was named in honour of Fred. There has been a steady improvement in the systems reading the sequence data, the current 3730 XL generation being provided by Applied Biosystems.

Three groups provide samples to the facility: Small sequencing projects, whole genome sequencing, and sequence improvement.

The Small sequencing project group provides a low- to medium-throughput sequencing service to Sanger Institute Faculty groups.

Whole genome sequencing groups work on a range of organisms: the pathogen sequencing team generate quality finished sequence for a number of pathogen and model organism genomes, including bacterial, viral and protozoa that infect species from plants to humans. Current projects include Plasmodium species for comparative studies and a vast array of bacterial genomes for both finishing and comparative genetics.

The Sequence improvement group aims to provide a consistent and reliable quality laboratory sequencing service to aid improvement of genomes and projects. Auto-prefinishing can be carried out on some large genomes and many pathogen genomes before they are passed to the finishing teams, who can subsequently request bespoke DNA sequencing reactions to complete projects.

They also carry out small insert sequencing, generating automatically large numbers of small DNA fragments whether cloned or PCR product, de novo or reference confirmation. They carry out multiple iterative rounds of automated custom oligonucleotide primer design either from the reference sequence or from the previously generated sequence. This continues until the DNA fragment is contiguous and sequenced in both directions.

Examples of such projects include;

  • 1. Gallus gallus EST and cDNA
  • 2. Xenopus tropicalis EST and cDNA
  • 3. Human Open Reading Frame project
  • 4. Danio rerio EST and cDNA

The group as a whole has reduced in size as newer faster cheaper technologies have replaced this traditional sequencing methodology; however the group remain vital to the goals of the Institute as a whole.

454 sequencing

The pyro-sequencing production group provides high throughput second generation sequencing for the Sanger Institute and its collaborators. The 454 sequencing system is an ultra-high throughput method of producing DNA sequence with upwards of 500 MB of quality-control-checked data being produced in a 9.5 hour run. This data is contained in 1 million to 1.2 million reads with a current mean average read length of 380 bp (modal average 460 bp). The system is ideally suited for de novo sequencing of eukaryotic, prokaryotic and viral genomes as well as amplicon variant analysis.

The process can be split in to three parts; library preparation, emulsion PCR (emPCR) and the sequencing run itself.

Library preparation

  • Shotgun: Genomic DNA is fragmented to around 550 bp by nebulisation and size selection. Adapters containing the sequence for both emPCR and sequencing are ligated. One of the strands of the DNA is melted off, leaving a single-stranded (ss) DNA library ready for emPCR. Samples can be multiplexed by use of 12 commercially available 'MID' adapter tags.
  • Paired end: Genomic DNA is sheared to 3, 8 or 20 kb (depending on the insert size required) using a hydroshear followed by ligation of circularisation adaptors. These adapters take advantage of the Cre-lox system, where Cre recombinase mediates a site-specific recombination at loxP sites in the adapters. This produces a circular DNA molecule with ends a known distance apart. This molecule is fragmented by nebulisation, and sequences containing the adapter are selected using a biotin tag present on the adapters. The process is then the same as the shotgun procedure, and results in shotgun (only one end present) and paired end fragments at a 50:50 ratio.
  • Amplicon: This procedure is used for deep sequencing of a targeted region. The customer provides the sample ready for emPCR. Fusion primers which comprise a 19 bp sequence, primer A or primer B (used in emPCR and sequencing) are fused to a typically 20-25 bp target-specific sequence during their own PCR experiment. Samples can be multiplexed by adding a barcode sequence immediately after primer A or B.

Emulsion Polymerase Chain Reaction (emPCR)

The aim of emPCR is to clonally amplify the target sequence to produce a signal detectable by the 454 sequencer. This is achieved by the use of micro-reactors present in an oil-water emulsion. Library DNA fragments are immobilised onto DNA capture beads. The beads are segregated in the emulsion with ideally one bead, (with one strand of DNA bound) per micro-reactor containing the necessary reagents to perform a PCR reaction. Emulsified beads are subjected to PCR to clonally amplify each template DNA molecule. After PCR, the emulsions are broken, meaning that the beads are released and oil removed. Beads carrying DNA are selected using the biotin label, and a sequencing primer annealed.

Sequencing run

Sequencing equipment.

Sequencing equipment.

DNA carrying beads are loaded into the wells of a Picotitre sequencing plate (PTP). Sequencing reagents are then sequentially flowed over the plate. The incorporation of a new base is associated with release of inorganic pyrophosphate. This starts a chemical cascade which ultimately leads to the conversion of Luciferin to oxy-luciferin and a light signal, with the intensity of the light signal proportional to the number of the base being flowed being incorporated. Depending on the amount of data required the PTP can be split into 2, 4, 8 or 16 regions.

When the sequencing run is completed, we contact the customer with data storage information. If required, a de novo assembly can be performed, using the 454 assembler 'newbler', or the data can be mapped against a fasta target sequence file. Amplicon users can take advantage of the amplicon variant analysis software also provided by 454.

Links

References

  • A high-throughput splinkerette-PCR method for the isolation and sequencing of retroviral insertion sites.

    Uren AG, Mikkers H, Kool J, van der Weyden L, Lund AH, Wilson CH, Rance R, Jonkers J, van Lohuizen M, Berns A and Adams DJ

    Nature protocols 2009;4;5;789-98

  • High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi.

    Holt KE, Parkhill J, Mazzoni CJ, Roumagnac P, Weill FX, Goodhead I, Rance R, Baker S, Maskell DJ, Wain J, Dolecek C, Achtman M and Dougan G

    Nature genetics 2008;40;8;987-93

Illumina sequencing

Illumina instrumentation.

Illumina instrumentation.

The Illumina Production Sequencing core facility provides a next-generation large-scale DNA (and RNA) sequencing service to the entire Sanger Institute.

Next-generation sequencing involves the application of glass micro-chip based methods and small-volume liquid handling (microfluidics) to sequence DNA more quickly and more cheaply than ever before, indeed about 100 times less costly than the technology used to sequence the first human genome just a few years ago. These methods rely on reacting millions of molecules simultaneously in a single vessel and analysing those molecules in parallel on a single chip using a state-of-the-art optical detection instrument. A further increase in speed and a decrease in cost are attained by running multiple instruments concurrently and Sanger Institute has 37 Solexa/Illumina DNA sequencing instruments available to tackle ambitious research projects in genomic medicine.

The advent of next-generation technologies has fuelled an explosion in the quantity of raw DNA sequence that can be generated by a reasonably sized genomics facility. Compared to about 800 million bases per week generated at the Sanger Institute in the height of the human genome project using conventional capillary electrophoresis methods, the Illumina production facility currently averages about 300 billion bases (Gigabases = Gb) per week. This translates into 5000 human-genome equivalents per year for the approximately 3 billion bases in a human genome.

Illumina instrumentation.

Illumina instrumentation.

This enormous capacity is being translated into amazing new scientific endeavours by the Institute faculty, tackling exciting new genomics projects. Researchers here are cataloguing what makes cancer cells dangerous down at the level of individual genetic changes, how and why pathogens like malaria evolve to be more (or less) harmful and how humans adapt to those changes. Metagenomics is the study of the sequences of large populations of different organisms all growing in a common environment - as for example seawater, soil, the human gut - and these studies are made vastly easier by next-generation sequencing. We are looking at how human (and mouse) genomes vary between individuals to help get a handle on how genetics plays a role in the risk, generation, prognosis and treatment of disease.

The Solexa/Illumina technology is based on amplified single-molecule arrays - many millions of single molecules of DNA are placed onto a glass chip and each of those molecules is amplified in situ to form localised colonies or clusters of DNA. Each element of a cluster is virtually identical to its neighbours and thereby the signal from a single molecule is increased linearly to give robust, reliable detection. Sequencing-by-synthesis is carried out on all these clusters simultaneously, using fluorescent reversible terminators, which allow one and only one nucleoside to be added to a growing strand in a single cycle of sequencing.

Close-up of cluster station.

Close-up of cluster station.

After incorporation of the terminators, the instrument images and distinguishes the four different terminators (A, C, G and T) by their unique attached fluorescent dye using two different lasers (red and green) and four different optical filters. After imaging one small part of the chip, the instrument continues scanning over the 960 imaging tiles. The last step of the cycle is removal of the fluorescence group and reversal of the termination allowing the next single base to be sequenced in the subsequent cycle. One complete cycle of chemistry and imaging typically takes about 1 hour on the instrument.

Close-up of a chip.

Close-up of a chip.

zoom

The chips have eight channels or lanes, allowing up to eight sample libraries to be simultaneously analysed. Additional samples can be analysed employing a technique called multiplexing or indexing to mix different samples in a single lane of the chip; these samples can be subsequently separated in software using their unique sequence barcodes. Typically, all eight lanes of a 100-cycle run generate about 30 Gb of sequence in paired-end mode (sequencing sequentially 100 bases, e.g., from each end of the molecules).

Running as facility of this size requires a massive amount of support and we work closely with the library preparation team that supplies large numbers of DNA templates in a from ready to be sequenced , the Institute's IT team that maintains the extensive amount of compute and storage infrastructure necessary, sequencing informatics which develops software tools to process, analyse, store and track all the data, projects and samples for the Illumina pipeline and the development team which invents novel and improved protocols to take better advantage of this new technology.

Links

References

  • Human Y chromosome base-substitution mutation rate measured by direct sequencing in a deep-rooting pedigree.

    Xue Y, Wang Q, Long Q, Ng BL, Swerdlow H, Burton J, Skuce C, Taylor R, Abdellah Z, Zhao Y, Asan, MacArthur DG, Quail MA, Carter NP, Yang H and Tyler-Smith C

    Current biology : CB 2009;19;17;1453-7

  • Accurate whole human genome sequencing using reversible terminator chemistry.

    Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, Rasolonjatovo IM, Reed MT, Rigatti R, Rodighiero C, Ross MT, Sabot A, Sankar SV, Scally A, Schroth GP, Smith ME, Smith VP, Spiridou A, Torrance PE, Tzonev SS, Vermaas EH, Walter K, Wu X, Zhang L, Alam MD, Anastasi C, Aniebo IC, Bailey DM, Bancarz IR, Banerjee S, Barbour SG, Baybayan PA, Benoit VA, Benson KF, Bevis C, Black PJ, Boodhun A, Brennan JS, Bridgham JA, Brown RC, Brown AA, Buermann DH, Bundu AA, Burrows JC, Carter NP, Castillo N, Chiara E Catenazzi M, Chang S, Neil Cooley R, Crake NR, Dada OO, Diakoumakos KD, Dominguez-Fernandez B, Earnshaw DJ, Egbujor UC, Elmore DW, Etchin SS, Ewan MR, Fedurco M, Fraser LJ, Fuentes Fajardo KV, Scott Furey W, George D, Gietzen KJ, Goddard CP, Golda GS, Granieri PA, Green DE, Gustafson DL, Hansen NF, Harnish K, Haudenschild CD, Heyer NI, Hims MM, Ho JT, Horgan AM, Hoschler K, Hurwitz S, Ivanov DV, Johnson MQ, James T, Huw Jones TA, Kang GD, Kerelska TH, Kersey AD, Khrebtukova I, Kindwall AP, Kingsbury Z, Kokko-Gonzales PI, Kumar A, Laurent MA, Lawley CT, Lee SE, Lee X, Liao AK, Loch JA, Lok M, Luo S, Mammen RM, Martin JW, McCauley PG, McNitt P, Mehta P, Moon KW, Mullens JW, Newington T, Ning Z, Ling Ng B, Novo SM, O'Neill MJ, Osborne MA, Osnowski A, Ostadan O, Paraschos LL, Pickering L, Pike AC, Pike AC, Chris Pinkard D, Pliskin DP, Podhasky J, Quijano VJ, Raczy C, Rae VH, Rawlings SR, Chiva Rodriguez A, Roe PM, Rogers J, Rogert Bacigalupo MC, Romanov N, Romieu A, Roth RK, Rourke NJ, Ruediger ST, Rusman E, Sanches-Kuiper RM, Schenker MR, Seoane JM, Shaw RJ, Shiver MK, Short SW, Sizto NL, Sluis JP, Smith MA, Ernest Sohna Sohna J, Spence EJ, Stevens K, Sutton N, Szajkowski L, Tregidgo CL, Turcatti G, Vandevondele S, Verhovsky Y, Virk SM, Wakelin S, Walcott GC, Wang J, Worsley GJ, Yan J, Yau L, Zuerlein M, Rogers J, Mullikin JC, Hurles ME, McCooke NJ, West JS, Oaks FL, Lundberg PL, Klenerman D, Durbin R and Smith AJ

    Nature 2008;456;7218;53-9

Development

Sequencing technology development group

Next-generation sequencing allows the Sanger Institute to produce immense quantities of DNA sequence data at a rate that would have been unimaginable a few years ago. The aim of the Sequencing technology development group is to develop high-throughput research methods that can exploit this explosion of data in order to answer a broad range of biological questions. The team, led by Dr Daniel J. Turner, uses a wide variety of molecular biological techniques and pre-sequencing technologies to help the Sanger Institute to remain at the forefront of genomic research, and to ensure that the Institute's resources are used effectively. The team undertakes independent research to create new front-end sequencing applications, collaborates with Faculty groups to develop bespoke assays, and provides robust protocols for production teams.

The ability to read the sequence of bases that comprise a polynucleotide has had an impact on biological research that is difficult to overstate. For the majority of the past 30 years, dideoxy 'Sanger' DNA sequencing has been used as the standard sequencing technology in many laboratories, and its acme was the completion of the human genome sequence. However, because Sanger sequencing is performed on single amplicons, its throughput is limited, and large-scale sequencing projects using this technology are expensive and laborious: the human genome sequence took hundreds of sequencing machines several years and cost several hundred million US dollars.

The paradigm of DNA sequencing changed with the advent of 'next-generation' technologies. The great power of next-generation sequencing lies in the ability to process hundreds of thousands to millions of DNA templates in parallel, resulting in a low running cost per base of generated sequence and a throughput on the gigabase (Gb) scale. As a consequence, we can now start to define the characteristics of entire genomes, and delineate differences between them. Ultimately, whole genome sequencing of complex organisms will become routine, which will allow us to gain a deeper understanding of the full spectrum of genetic variation and to define its role in phenotypic variation and the pathogenesis of complex traits.

Each sequence obtained from a next-generation DNA sequencer is derived from a single template strand, which becomes amplified clonally prior to the sequencing reaction. This, in addition to the immense throughput of next-generation sequencers, allows us to apply these technologies to a wide range of research areas, from transcript counting to the identification of insertion sites in transposon mutagenesis, as well as whole genome and whole exome sequencing.

Research

Members of the Sequencing technology development group work closely with many other teams within the Sanger Institute, and have a number of active collaborations with external groups. Brief descriptions of a selection of recent projects are given below.

FRT-seq

Analysis of complementary DNA by next-generation sequencing (RNA-seq) enables us to build an accurate picture of active tran-scriptional patterns in an organism. The ideal RNA-seq protocol would be accurate, strand-specific and quantitative across a wide dynamic range, compatible with paired-end sequencing, and would be free from inter- and intramolecular priming artifacts, allowing us to detect antisense transcripts unambiguously. However, none of the prior methodologies can meet all of these requirements.

The team has developed a method ('FRT-seq'), in which we use a reverse transcriptase to perform the first stage of the bridge-amplification step on the Illumina Genome Analyzer. In this way, RNA becomes converted to cDNA on the solid surface, and the individual template strands are isolated from one another, preventing priming artifacts. This results in the generation of highly accurate and reproducible strand-specific sequences, with greatly reduced bias compared to PCR-based approaches.

Target enrichment

In spite of the high throughput of next-generation DNA sequencers, it is not yet feasible to sequence large numbers of complex genomes in their entirety, because the cost and time taken are still too great. In addition to the demands such a project would place on laboratory time and funding, the primary analysis, where the image files captured during the sequencing reaction are converted into nucleotide sequences, would place a significant burden on a research centre's informatics infrastructure, as would storage of the resulting sequence information.

Consequently, the group has developed protocols for 'target enrichment', where unwanted genomic regions are selectively depleted from a DNA sample prior to sequencing, as part of the sample preparation. Resequencing the genomic regions that are retained is necessarily more time- and cost-effective and is considerably less cumbersome to analyse. Using the protocols developed in the Sequencing technology development team, the Sanger Institute has been able to set up a 'Pulldown' Production team, who provide a target enrichment service for the whole Institute.

Sample multiplexing

For some sequencing projects, the high throughput of a next-generation sequencing run would represent a massive excess. For the sequencing to be economical, the Sequencing technology development group, in collaboration with the Human evolution group, has developed effective protocols allowing the barcoding and pooling over a hundred samples, so that these pools can be sequenced as a single sequencing library. During the sequencing reaction, the barcode is read, allowing each sequence to be attributed to the sample from which it was derived.

Transposon insertion site mapping

Transposon mutagenesis is a powerful method by which the function of genes can be modified, allowing us to evaluate how important each gene is for the viability of an organism. Until now, there has been no satisfactory way to perform such surveys on a genome-wide scale. Together with the Pathogen genomics group, the team developed TraDIS (Transposon Directed Insertion-site Sequencing), which they applied to a pool of over a million Salmonella enterica serovar Typhi mutants. This enabled the presence of every gene in the genome to be assessed for its necessity to a cell's survival, in a single experiment for the first time, and allowed ~400,000 unique transposon insertion sites in the bacterial chromosome to be identified.

The Sequencing technology development team is continuing to develop TraDIS, to adapt it to work with a wide range of bacterial transposons, to apply it to transposons used in the mutagenesis of large eukaryotic genomes, and to investigate other applications of the method.

Malaria genome sequencing

Genomes with biased nucleotide compositions present appreciable technical challenges to the currently available sequencing platforms. In several malaria species, including Plasmodium falciparum, the mean exonic AT content is >75 per cent, and in intergenic and intronic regions can be close to 100 per cent. Amplification steps, performed during library preparation, struggle with such extreme templates, so the resulting sequences do not represent the whole genome.

To address this problem, the team developed an alternative method of library preparation, in which amplification is avoided. This allowed them to generate sequence data from Plasmodium falciparum that not only improves SNP detection, compared to the standard method of library preparation, but also facilitates de novo assembly of this genome using short read assemblers, which was previously impossible. The 'no-PCR' method has established itself as a beneficial approach to library-making for all genomes, regardless of base composition, as the sequences obtained are necessarily free from amplification duplicates, and show a more even representation of the sample genome than those derived from standard libraries.

The team continues to work closely with the Sanger Malaria Programme: Kwiatkowski group, to improve malaria genome sequencing further.

Links

References

  • FRT-seq: amplification-free, strand-specific transcriptome sequencing.

    Mamanova L, Andrews RM, James KD, Sheridan EM, Ellis PD, Langford CF, Ost TW, Collins JE and Turner DJ

    Nature methods 2010;7;2;130-2

  • Simultaneous assay of every Salmonella Typhi gene using one million transposon mutants.

    Langridge GC, Phan MD, Turner DJ, Perkins TT, Parts L, Haase J, Charles I, Maskell DJ, Peters SE, Dougan G, Wain J, Parkhill J and Turner AK

    Genome research 2009;19;12;2308-16

  • Improved protocols for the illumina genome analyzer sequencing system.

    Quail MA, Swerdlow H and Turner DJ

    Current protocols in human genetics / editorial board, Jonathan L. Haines ... [et al.] 2009;Chapter 18;Unit 18.2

  • Next-generation sequencing of vertebrate experimental organisms.

    Turner DJ, Keane TM, Sudbery I and Adams DJ

    Mammalian genome : official journal of the International Mammalian Genome Society 2009;20;6;327-38

  • Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes.

    Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M and Turner DJ

    Nature methods 2009;6;4;291-5

  • A large genome center's improvements to the Illumina sequencing system.

    Quail MA, Kozarewa I, Smith F, Scally A, Stephens PJ, Durbin R, Swerdlow H and Turner DJ

    Nature methods 2008;5;12;1005-10

  • A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylome analysis.

    Down TA, Rakyan VK, Turner DJ, Flicek P, Li H, Kulesha E, Gräf S, Johnson N, Herrero J, Tomazou EM, Thorne NP, Bäckdahl L, Herberth M, Howe KL, Jackson DK, Miretti MM, Marioni JC, Birney E, Hubbard TJ, Durbin R, Tavaré S and Beck S

    Nature biotechnology 2008;26;7;779-85

Access

All external access is on a collaborative basis, and should have the involvement of a member of Sanger Institute Faculty.

The Sequencing Committee will consider external proposals. It is composed of representatives from Faculty, Informatics, Finance and the Sequencing group, and oversees all large-scale projects, both internal and external.

All projects must comply with Sanger Institute data release policies.

* quick link - http://q.sanger.ac.uk/p6qf32y8