Sanger Institute - Publications 2003

Number of papers published in 2003: 31

  • Gene annotation: prediction and testing.

    Ashurst JL and Collins JE

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.

    Fifty years after the publication of DNA structure, the whole human genome sequence will be officially finished. This achievement marks the beginning of the task to catalogue every human gene and identify each of their function expression patterns. Currently, researchers estimate that there are about 30,000 human genes and approximately 70% of these can be automatically predicted using a combination of ab initio and similarity-based programs. However, to experimentally investigate every gene's function, the research community requires a high-quality annotation of alternative splicing, pseudogenes, and promoter regions that can only be provided by manual intervention. Manual curation of the human genome will be a long-term project as experimental data are continually produced to confirm or refine the predictions, and new features such as noncoding RNAs and enhancers have not been fully identified. Such a highly curated human gene-set made publicly available will be a great asset for the experimental community and for future comparative genome projects.

    Annual review of genomics and human genetics 2003;4;69-88

  • Candidate gene association study in type 2 diabetes indicates a role for genes involved in beta-cell function as well as insulin action.

    Barroso I, Luan J, Middelberg RP, Harding AH, Franks PW, Jakes RW, Clayton D, Schafer AJ, O'Rahilly S and Wareham NJ

    Incyte, Palo Alto, California, USA.

    Type 2 diabetes is an increasingly common, serious metabolic disorder with a substantial inherited component. It is characterised by defects in both insulin secretion and action. Progress in identification of specific genetic variants predisposing to the disease has been limited. To complement ongoing positional cloning efforts, we have undertaken a large-scale candidate gene association study. We examined 152 SNPs in 71 candidate genes for association with diabetes status and related phenotypes in 2,134 Caucasians in a case-control study and an independent quantitative trait (QT) cohort in the United Kingdom. Polymorphisms in five of 15 genes (33%) encoding molecules known to primarily influence pancreatic beta-cell function-ABCC8 (sulphonylurea receptor), KCNJ11 (KIR6.2), SLC2A2 (GLUT2), HNF4A (HNF4alpha), and INS (insulin)-significantly altered disease risk, and in three genes, the risk allele, haplotype, or both had a biologically consistent effect on a relevant physiological trait in the QT study. We examined 35 genes predicted to have their major influence on insulin action, and three (9%)-INSR, PIK3R1, and SOS1-showed significant associations with diabetes. These results confirm the genetic complexity of Type 2 diabetes and provide evidence that common variants in genes influencing pancreatic beta-cell function may make a significant contribution to the inherited component of this disease. This study additionally demonstrates that the systematic examination of panels of biological candidate genes in large, well-characterised populations can be an effective complement to positional cloning approaches. The absence of large single-gene effects and the detection of multiple small effects accentuate the need for the study of larger populations in order to reliably identify the size of effect we now expect for complex diseases.

    PLoS biology 2003;1;1;E20

  • Sequencing and analysis of the genome of the Whipple's disease bacterium Tropheryma whipplei.

    Bentley SD, Maiwald M, Murphy LD, Pallen MJ, Yeats CA, Dover LG, Norbertczak HT, Besra GS, Quail MA, Harris DE, von Herbay A, Goble A, Rutter S, Squares R, Squares S, Barrell BG, Parkhill J and Relman DA

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.

    Background: Whipple's disease is a rare multisystem chronic infection, involving the intestinal tract as well as various other organs. The causative agent, Tropheryma whipplei, is a Gram-positive bacterium about which little is known. Our aim was to investigate the biology of this organism by generating and analysing the complete DNA sequence of its genome.

    Methods: We isolated and propagated T whipplei strain TW08/27 from the cerebrospinal fluid of a patient diagnosed with Whipple's disease. We generated the complete sequence of the genome by the whole genome shotgun method, and analysed it with a combination of automatic and manual bioinformatic techniques.

    Findings: Sequencing revealed a condensed 925938 bp genome with a lack of key biosynthetic pathways and a reduced capacity for energy metabolism. A family of large surface proteins was identified, some associated with large amounts of non-coding repetitive DNA, and an unexpected degree of sequence variation.

    Interpretation: The genome reduction and lack of metabolic capabilities point to a host-restricted lifestyle for the organism. The sequence variation indicates both known and novel mechanisms for the elaboration and variation of surface structures, and suggests that immune evasion and host interaction play an important part in the lifestyle of this persistent bacterial pathogen.

    Funded by: NIDDK NIH HHS: DK56339

    Lancet (London, England) 2003;361;9358;637-44

  • The complete genome sequence and analysis of Corynebacterium diphtheriae NCTC13129.

    Cerdeño-Tárraga AM, Efstratiou A, Dover LG, Holden MT, Pallen M, Bentley SD, Besra GS, Churcher C, James KD, De Zoysa A, Chillingworth T, Cronin A, Dowd L, Feltwell T, Hamlin N, Holroyd S, Jagels K, Moule S, Quail MA, Rabbinowitsch E, Rutherford KM, Thomson NR, Unwin L, Whitehead S, Barrell BG and Parkhill J

    The Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Corynebacterium diphtheriae is a Gram-positive, non-spore forming, non-motile, pleomorphic rod belonging to the genus Corynebacterium and the actinomycete group of organisms. The organism produces a potent bacteriophage-encoded protein exotoxin, diphtheria toxin (DT), which causes the symptoms of diphtheria. This potentially fatal infectious disease is controlled in many developed countries by an effective immunisation programme. However, the disease has made a dramatic return in recent years, in particular within the Eastern European region. The largest, and still on-going, outbreak since the advent of mass immunisation started within Russia and the newly independent states of the former Soviet Union in the 1990s. We have sequenced the genome of a UK clinical isolate (biotype gravis strain NCTC13129), representative of the clone responsible for this outbreak. The genome consists of a single circular chromosome of 2 488 635 bp, with no plasmids. It provides evidence that recent acquisition of pathogenicity factors goes beyond the toxin itself, and includes iron-uptake systems, adhesins and fimbrial proteins. This is in contrast to Corynebacterium's nearest sequenced pathogenic relative, Mycobacterium tuberculosis, where there is little evidence of recent horizontal DNA acquisition. The genome itself shows an unusually extreme large-scale compositional bias, being noticeably higher in G+C near the origin than at the terminus.

    Nucleic acids research 2003;31;22;6516-23

  • Ensembl 2002: accommodating comparative genomics.

    Clamp M, Andrews D, Barker D, Bevan P, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Hubbard T, Kasprzyk A, Keefe D, Lehvaslaiho H, Iyer V, Melsopp C, Mongin E, Pettett R, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I and Birney E

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.

    The Ensembl ( database project provides a bioinformatics framework to organise biology around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of human, mouse and other genome sequences, available as either an interactive web site or as flat files. Ensembl also integrates manually annotated gene structures from external sources where available. As well as being one of the leading sources of genome annotation, Ensembl is an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements. These range from sequence analysis to data storage and visualisation and installations exist around the world in both companies and at academic sites. With both human and mouse genome sequences available and more vertebrate sequences to follow, many of the recent developments in Ensembl have focusing on developing automatic comparative genome analysis and visualisation.

    Nucleic acids research 2003;31;1;38-42

  • Reevaluating human gene annotation: a second-generation analysis of chromosome 22.

    Collins JE, Goward ME, Cole CG, Smink LJ, Huckle EJ, Knowles S, Bye JM, Beare DM and Dunham I

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    We report a second-generation gene annotation of human chromosome 22. Using expressed sequence databases, comparative sequence analysis, and experimental verification, we have extended genes, fused previously fragmented structures, and identified new genes. The total length in exons of annotation was increased by 74% over our previously published annotation and includes 546 protein-coding genes and 234 pseudogenes. Thirty-two potential protein-coding annotations are partial copies of other genes, and may represent duplications on an evolutionary path to change or loss of function. We also identified 31 non-protein-coding transcripts, including 16 possible antisense RNAs. By extrapolation, we estimate the human genome contains 29,000-36,000 protein-coding genes, 21,300 pseudogenes, and 1500 antisense RNAs. We suggest that our revised annotation criteria provide a paradigm for future annotation of the human genome.

    Genome research 2003;13;1;27-36

  • Pathogenomics.

    Crossman L, Cerdeño-Tárraga A, Bentley S and Parkhill J

    Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    The genomes described this month reflect the overall historical bias of microbial genomics towards pathogenic bacteria. Although the balance is now being redressed to some extent, especially through the study of extremophiles, it is still the case that the opportunities provided by genomic studies are primarily taken up by those who study bacterial pathogenicity. This part of the field is, however, being broadened by including the study of pathogens of animals, insects and plants alongside those that afflict humans.

    Nature reviews. Microbiology 2003;1;3;176-7

  • A DNA damage checkpoint response in telomere-initiated senescence.

    d'Adda di Fagagna F, Reaper PM, Clay-Farrace L, Fiegler H, Carr P, Von Zglinicki T, Saretzki G, Carter NP and Jackson SP

    [1] The Wellcome Trust/Cancer Research UK Institute of Cancer and Developmental Biology, University of Cambridge, Cambridge CB2 1QR, UK [2] Present address: IFOM-FIRC Institute of Molecular Oncology, via Adamello 16, 20139 Milan, Italy.

    Most human somatic cells can undergo only a limited number of population doublings in vitro. This exhaustion of proliferative potential, called senescence, can be triggered when telomeres--the ends of linear chromosomes-cannot fulfil their normal protective functions. Here we show that senescent human fibroblasts display molecular markers characteristic of cells bearing DNA double-strand breaks. These markers include nuclear foci of phosphorylated histone H2AX and their co-localization with DNA repair and DNA damage checkpoint factors such as 53BP1, MDC1 and NBS1. We also show that senescent cells contain activated forms of the DNA damage checkpoint kinases CHK1 and CHK2. Furthermore, by chromatin immunoprecipitation and whole-genome scanning approaches, we show that the chromosome ends of senescent cells directly contribute to the DNA damage response, and that uncapped telomeres directly associate with many, but not all, DNA damage response proteins. Finally, we show that inactivation of DNA damage checkpoint kinases in senescent cells can restore cell-cycle progression into S phase. Thus, we propose that telomere-initiated senescence reflects a DNA damage checkpoint response that is activated with a direct contribution from dysfunctional telomeres.

    Nature 2003;426;6963;194-8

  • ddbRNA: detection of conserved secondary structures in multiple alignments.

    di Bernardo D, Down T and Hubbard T

    Telethon Institute of Genetics and Medicine, Via P Castellino 111, 80133 Naples, Italy.

    Motivation: Structured non-coding RNAs (ncRNAs) have a very important functional role in the cell. No distinctive general features common to all ncRNA have yet been discovered. This makes it difficult to design computational tools able to detect novel ncRNAs in the genomic sequence.

    Results: We devised an algorithm able to detect conserved secondary structures in both pairwise and multiple DNA sequence alignments with computational time proportional to the square of the sequence length. We implemented the algorithm for the case of pairwise and three-way alignments and tested it on ncRNAs obtained from public databases. On the test sets, the pairwise algorithm has a specificity greater than 97% with a sensitivity varying from 22.26% for Blast alignments to 56.35% for structural alignments. The three-way algorithm behaves similarly. Our algorithm is able to efficiently detect a conserved secondary structure in multiple alignments.

    Funded by: Telethon: TGM03P17, TGM06S01

    Bioinformatics (Oxford, England) 2003;19;13;1606-11

  • DNA microarrays for comparative genomic hybridization based on DOP-PCR amplification of BAC and PAC clones.

    Fiegler H, Carr P, Douglas EJ, Burford DC, Hunt S, Scott CE, Smith J, Vetrie D, Gorman P, Tomlinson IP and Carter NP

    Wellcome Trust Sanger Institute/Cancer Research UK Genomic Microarray Group, Hinxton, Cambridge, CB10 1SA, United Kingdom.

    We have designed DOP-PCR primers specifically for the amplification of large insert clones for use in the construction of DNA microarrays. A bioinformatic approach was used to construct primers that were efficient in the general amplification of human DNA but were poor at amplifying E. coli DNA, a common contaminant of DNA preparations from large insert clones. We chose the three most selective primers for use in printing DNA microarrays. DNA combined from the amplification of large insert clones by use of these three primers and spotted onto glass slides showed more than a sixfold increase in the human to E. coli hybridization ratio when compared to the standard DOP-PCR primer, 6MW. The microarrays reproducibly delineated previously characterized gains and deletions in a cancer cell line and identified a small gain not detected by use of conventional CGH. We also describe a method for the bulk testing of the hybridization characteristics of chromosome-specific clones spotted on microarrays by use of DNA amplified from flow-sorted chromosomes. Finally, we describe a set of clones selected from the publicly available Golden Path of the human genome at 1-Mb intervals and a view in the Ensembl genome browser from which data required for the use of these clones in array CGH and other experiments can be downloaded across the Internet.

    Genes, chromosomes & cancer 2003;36;4;361-74

  • The complete genome sequence of Mycobacterium bovis.

    Garnier T, Eiglmeier K, Camus JC, Medina N, Mansoor H, Pryor M, Duthoy S, Grondin S, Lacroix C, Monsempe C, Simon S, Harris B, Atkin R, Doggett J, Mayes R, Keating L, Wheeler PR, Parkhill J, Barrell BG, Cole ST, Gordon SV and Hewinson RG

    Unité de Génétique Moléculaire Bactérienne and PT4 Annotation, Génopole, Institut Pasteur, 28 Rue du Docteur Roux, 75724 Paris Cedex 15, France.

    Mycobacterium bovis is the causative agent of tuberculosis in a range of animal species and man, with worldwide annual losses to agriculture of $3 billion. The human burden of tuberculosis caused by the bovine tubercle bacillus is still largely unknown. M. bovis was also the progenitor for the M. bovis bacillus Calmette-Guérin vaccine strain, the most widely used human vaccine. Here we describe the 4,345,492-bp genome sequence of M. bovis AF2122/97 and its comparison with the genomes of Mycobacterium tuberculosis and Mycobacterium leprae. Strikingly, the genome sequence of M. bovis is >99.95% identical to that of M. tuberculosis, but deletion of genetic information has led to a reduced genome size. Comparison with M. leprae reveals a number of common gene losses, suggesting the removal of functional redundancy. Cell wall components and secreted proteins show the greatest variation, indicating their potential role in host-bacillus interactions or immune evasion. Furthermore, there are no genes unique to M. bovis, implying that differential gene expression may be the key to the host tropisms of human and bovine bacilli. The genome sequence therefore offers major insight on the evolution, host preference, and pathobiology of M. bovis.

    Proceedings of the National Academy of Sciences of the United States of America 2003;100;13;7877-82

  • Rfam: an RNA family database.

    Griffiths-Jones S, Bateman A, Marshall M, Khanna A and Eddy SR

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Rfam is a collection of multiple sequence alignments and covariance models representing non-coding RNA families. Rfam is available on the web in the UK at and in the US at These websites allow the user to search a query sequence against a library of covariance models, and view multiple sequence alignments and family annotation. The database can also be downloaded in flatfile form and searched locally using the INFERNAL package ( The first release of Rfam (1.0) contains 25 families, which annotate over 50 000 non-coding RNA genes in the taxonomic divisions of the EMBL nucleotide database.

    Nucleic acids research 2003;31;1;439-41

  • The International HapMap Project.

    International HapMap Consortium

    The goal of the International HapMap Project is to determine the common patterns of DNA sequence variation in the human genome and to make this information freely available in the public domain. An international consortium is developing a map of these patterns across the genome by determining the genotypes of one million or more sequence variants, their frequencies and the degree of association between them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe. The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance our ability to choose targets for therapeutic intervention.

    Nature 2003;426;6968;789-96

  • Kaposi's sarcoma-associated herpesvirus-infected primary effusion lymphoma has a plasma cell gene expression profile.

    Jenner RG, Maillard K, Cattini N, Weiss RA, Boshoff C, Wooster R and Kellam P

    Wohl Virion Centre, Department of Immunology and Molecular Pathology, Windeyer Institute, University College London, London W1T 4JF, United Kingdom.

    Kaposi's sarcoma-associated herpesvirus is associated with three human tumors: Kaposi's sarcoma, and the B cell lymphomas, plasmablastic lymphoma associated with multicentric Castleman's disease, and primary effusion lymphoma (PEL). Epstein-Barr virus, the closest human relative of Kaposi's sarcoma-associated herpesvirus, mimics host B cell signaling pathways to direct B cell development toward a memory B cell phenotype. Epstein-Barr virus-associated B cell tumors are presumed to arise as a consequence of this virus-mediated B cell activation. The stage of B cell development represented by PEL, how this stage relates to tumor pathology, and how this information may be used to treat the disease are largely unknown. In this study we used gene expression profiling to order a range of B cell tumors by stage of development. PEL gene expression closely resembles that of malignant plasma cells, including the low expression of mature B cell genes. The unfolded protein response is partially activated in PEL, but is fully activated in plasma cell tumors, linking endoplasmic reticulum stress to plasma cell development through XBP-1. PEL cells can be defined by the overexpression of genes involved in inflammation, cell adhesion, and invasion, which may be responsible for their presentation in body cavities. Similar to malignant plasma cells, all PEL samples tested express the vitamin D receptor and are sensitive to the vitamin D analogue drug EB 1089 (Seocalcitol).

    Proceedings of the National Academy of Sciences of the United States of America 2003;100;18;10399-404

  • Systematic functional analysis of the Caenorhabditis elegans genome using RNAi.

    Kamath RS, Fraser AG, Dong Y, Poulin G, Durbin R, Gotta M, Kanapin A, Le Bot N, Moreno S, Sohrmann M, Welchman DP, Zipperlen P and Ahringer J

    Wellcome Trust/Cancer Research UK Institute and Department of Genetics, University of Cambridge, Tennis Court Road, Cambridge CB2 1QR, UK.

    A principal challenge currently facing biologists is how to connect the complete DNA sequence of an organism to its development and behaviour. Large-scale targeted-deletions have been successful in defining gene functions in the single-celled yeast Saccharomyces cerevisiae, but comparable analyses have yet to be performed in an animal. Here we describe the use of RNA interference to inhibit the function of approximately 86% of the 19,427 predicted genes of C. elegans. We identified mutant phenotypes for 1,722 genes, about two-thirds of which were not previously associated with a phenotype. We find that genes of similar functions are clustered in distinct, multi-megabase regions of individual chromosomes; genes in these regions tend to share transcriptional profiles. Our resulting data set and reusable RNAi library of 16,757 bacterial clones will facilitate systematic analyses of the connections among gene sequence, chromosomal location and gene function in C. elegans.

    Funded by: Wellcome Trust: 054523

    Nature 2003;421;6920;231-7

  • CASP5 target classification.

    Kinch LN, Qi Y, Hubbard TJ and Grishin NV

    Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas 75390-9050, USA.

    This report summarizes the Critical Assessment of Protein Structure Prediction (CASP5) target proteins, which included 67 experimental models submitted from various structural genomics efforts and independent research groups. Throughout this special issue, CASP5 targets are referred to with the identification numbers T0129-T0195. Several of these targets were excluded from the assessment for various reasons: T0164 and T0166 were cancelled by the organizers; T0131, T0144, T0158, T0163, T0171, T0175, and T0180 were not available in time; T0145 was "natively unfolded"; the T0139 structure became available before the target expired; and T0194 was solved for a different sequence than the one submitted. Table I outlines the sequence and structural information available for CASP5 proteins in the context of existing folds and evolutionary relationships. This information provided the basis for a domain-based classification of the target structures into three assessment categories: comparative modeling (CM), fold recognition (FR), and new fold (NF). The FR category was further subdivided into homologues [FR(H)] and analogs [FR(A)] based on evolutionary considerations, and the overlap between assessment categories was classified as CM/FR(H) and FR(A)/NF. CASP5 domains are illustrated in Figure 1. Examples of nontrivial links between CASP5 target domains and existing structures that support our classifications are provided.

    Proteins 2003;53 Suppl 6;340-51

  • Adult midgut expressed sequence tags from the tsetse fly Glossina morsitans morsitans and expression analysis of putative immune response genes.

    Lehane MJ, Aksoy S, Gibson W, Kerhornou A, Berriman M, Hamilton J, Soares MB, Bonaldo MF, Lehane S and Hall N

    School of Biological Sciences, University of Wales, Bangor, LL57 2UW, UK.

    Background: Tsetse flies transmit African trypanosomiasis leading to half a million cases annually. Trypanosomiasis in animals (nagana) remains a massive brake on African agricultural development. While trypanosome biology is widely studied, knowledge of tsetse flies is very limited, particularly at the molecular level. This is a serious impediment to investigations of tsetse-trypanosome interactions. We have undertaken an expressed sequence tag (EST) project on the adult tsetse midgut, the major organ system for establishment and early development of trypanosomes.

    Results: A total of 21,427 ESTs were produced from the midgut of adult Glossina morsitans morsitans and grouped into 8,876 clusters or singletons potentially representing unique genes. Putative functions were ascribed to 4,035 of these by homology. Of these, a remarkable 3,884 had their most significant matches in the Drosophila protein database. We selected 68 genes with putative immune-related functions, macroarrayed them and determined their expression profiles following bacterial or trypanosome challenge. In both infections many genes are downregulated, suggesting a malaise response in the midgut. Trypanosome and bacterial challenge result in upregulation of different genes, suggesting that different recognition pathways are involved in the two responses. The most notable block of genes upregulated in response to trypanosome challenge are a series of Toll and Imd genes and a series of genes involved in oxidative stress responses.

    Conclusions: The project increases the number of known Glossina genes by two orders of magnitude. Identification of putative immunity genes and their preliminary characterization provides a resource for the experimental dissection of tsetse-trypanosome interactions.

    Genome biology 2003;4;10;R63

  • The phusion assembler.

    Mullikin JC and Ning Z

    Informatics Department, The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    The Phusion assembler has assembled the mouse genome from the whole-genome shotgun (WGS) dataset collected by the Mouse Genome Sequencing Consortium, at ~7.5x sequence coverage, producing a high-quality draft assembly 2.6 gigabases in size, of which 90% of these bases are in 479 scaffolds. For the mouse genome, which is a large and repeat-rich genome, the input dataset was designed to include a high proportion of paired end sequences of various size selected inserts, from 2-200 kbp lengths, into various host vector templates. Phusion uses sequence data, called reads, and information about reads that share common templates, called read pairs, to drive the assembly of this large genome to highly accurate results. The preassembly stage, which clusters the reads into sensible groups, is a key element of the entire assembler, because it permits a simple approach to parallelization of the assembly stage, as each cluster can be treated independent of the others. In addition to the application of Phusion to the mouse genome, we will also present results from the WGS assembly of Caenorhabditis briggsae sequenced to about 11x coverage. The C. briggsae assembly was accessioned through EMBL,, using the series CAAC01000001-CAAC01000578, however, the Phusion mouse assembly described here was not accessioned. The mouse data was generated by the Mouse Genome Sequencing Consortium. The C. briggsae sequence was generated at The Wellcome Trust Sanger Institute and the Genome Sequencing Center, Washington University School of Medicine.

    Genome research 2003;13;1;81-90

  • The DNA sequence and analysis of human chromosome 6.

    Mungall AJ, Palmer SA, Sims SK, Edwards CA, Ashurst JL, Wilming L, Jones MC, Horton R, Hunt SE, Scott CE, Gilbert JG, Clamp ME, Bethel G, Milne S, Ainscough R, Almeida JP, Ambrose KD, Andrews TD, Ashwell RI, Babbage AK, Bagguley CL, Bailey J, Banerjee R, Barker DJ, Barlow KF, Bates K, Beare DM, Beasley H, Beasley O, Bird CP, Blakey S, Bray-Allen S, Brook J, Brown AJ, Brown JY, Burford DC, Burrill W, Burton J, Carder C, Carter NP, Chapman JC, Clark SY, Clark G, Clee CM, Clegg S, Cobley V, Collier RE, Collins JE, Colman LK, Corby NR, Coville GJ, Culley KM, Dhami P, Davies J, Dunn M, Earthrowl ME, Ellington AE, Evans KA, Faulkner L, Francis MD, Frankish A, Frankland J, French L, Garner P, Garnett J, Ghori MJ, Gilby LM, Gillson CJ, Glithero RJ, Grafham DV, Grant M, Gribble S, Griffiths C, Griffiths M, Hall R, Halls KS, Hammond S, Harley JL, Hart EA, Heath PD, Heathcott R, Holmes SJ, Howden PJ, Howe KL, Howell GR, Huckle E, Humphray SJ, Humphries MD, Hunt AR, Johnson CM, Joy AA, Kay M, Keenan SJ, Kimberley AM, King A, Laird GK, Langford C, Lawlor S, Leongamornlert DA, Leversha M, Lloyd CR, Lloyd DM, Loveland JE, Lovell J, Martin S, Mashreghi-Mohammadi M, Maslen GL, Matthews L, McCann OT, McLaren SJ, McLay K, McMurray A, Moore MJ, Mullikin JC, Niblett D, Nickerson T, Novik KL, Oliver K, Overton-Larty EK, Parker A, Patel R, Pearce AV, Peck AI, Phillimore B, Phillips S, Plumb RW, Porter KM, Ramsey Y, Ranby SA, Rice CM, Ross MT, Searle SM, Sehra HK, Sheridan E, Skuce CD, Smith S, Smith M, Spraggon L, Squares SL, Steward CA, Sycamore N, Tamlyn-Hall G, Tester J, Theaker AJ, Thomas DW, Thorpe A, Tracey A, Tromans A, Tubby B, Wall M, Wallis JM, West AP, White SS, Whitehead SL, Whittaker H, Wild A, Willey DJ, Wilmer TE, Wood JM, Wray PW, Wyatt JC, Young L, Younger RM, Bentley DR, Coulson A, Durbin R, Hubbard T, Sulston JE, Dunham I, Rogers J and Beck S

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Chromosome 6 is a metacentric chromosome that constitutes about 6% of the human genome. The finished sequence comprises 166,880,988 base pairs, representing the largest chromosome sequenced so far. The entire sequence has been subjected to high-quality manual annotation, resulting in the evidence-supported identification of 1,557 genes and 633 pseudogenes. Here we report that at least 96% of the protein-coding genes have been identified, as assessed by multi-species comparative sequence analysis, and provide evidence for the presence of further, otherwise unsupported exons/genes. Among these are genes directly implicated in cancer, schizophrenia, autoimmunity and many other diseases. Chromosome 6 harbours the largest transfer RNA gene cluster in the genome; we show that this cluster co-localizes with a region of high transcriptional activity. Within the essential immune loci of the major histocompatibility complex, we find HLA-B to be the most polymorphic gene on chromosome 6 and in the human genome.

    Nature 2003;425;6960;805-11

  • Identification of putative noncoding RNAs among the RIKEN mouse full-length cDNA collection.

    Numata K, Kanai A, Saito R, Kondo S, Adachi J, Wilming LG, Hume DA, Hayashizaki Y, Tomita M, RIKEN GER Group and GSL Members

    Graduate School of Media and Governance, Bioinformatics Program, Keio University, Fujisawa, Kanagawa 252-8520, Japan.

    With the sequencing and annotation of genomes and transcriptomes of several eukaryotes, the importance of noncoding RNA (ncRNA)-RNA molecules that are not translated to protein products-has become more evident. A subclass of ncRNA transcripts are encoded by highly regulated, multi-exon, transcriptional units, are processed like typical protein-coding mRNAs and are increasingly implicated in regulation of many cellular functions in eukaryotes. This study describes the identification of candidate functional ncRNAs from among the RIKEN mouse full-length cDNA collection, which contains 60,770 sequences, by using a systematic computational filtering approach. We initially searched for previously reported ncRNAs and found nine murine ncRNAs and homologs of several previously described nonmouse ncRNAs. Through our computational approach to filter artifact-free clones that lack protein coding potential, we extracted 4280 transcripts as the largest-candidate set. Many clones in the set had EST hits, potential CpG islands surrounding the transcription start sites, and homologies with the human genome. This implies that many candidates are indeed transcribed in a regulated manner. Our results demonstrate that ncRNAs are a major functional subclass of processed transcripts in mammals.

    Genome research 2003;13;6B;1301-6

  • Genomics: Relative pathogenic values.

    Parkhill J and Berry C

    Nature 2003;423;6935;23-5

  • Comparative analysis of the genome sequences of Bordetella pertussis, Bordetella parapertussis and Bordetella bronchiseptica.

    Parkhill J, Sebaihia M, Preston A, Murphy LD, Thomson N, Harris DE, Holden MT, Churcher CM, Bentley SD, Mungall KL, Cerdeño-Tárraga AM, Temple L, James K, Harris B, Quail MA, Achtman M, Atkin R, Baker S, Basham D, Bason N, Cherevach I, Chillingworth T, Collins M, Cronin A, Davis P, Doggett J, Feltwell T, Goble A, Hamlin N, Hauser H, Holroyd S, Jagels K, Leather S, Moule S, Norberczak H, O'Neil S, Ormond D, Price C, Rabbinowitsch E, Rutter S, Sanders M, Saunders D, Seeger K, Sharp S, Simmonds M, Skelton J, Squares R, Squares S, Stevens K, Unwin L, Whitehead S, Barrell BG and Maskell DJ

    The Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    Bordetella pertussis, Bordetella parapertussis and Bordetella bronchiseptica are closely related Gram-negative beta-proteobacteria that colonize the respiratory tracts of mammals. B. pertussis is a strict human pathogen of recent evolutionary origin and is the primary etiologic agent of whooping cough. B. parapertussis can also cause whooping cough, and B. bronchiseptica causes chronic respiratory infections in a wide range of animals. We sequenced the genomes of B. bronchiseptica RB50 (5,338,400 bp; 5,007 predicted genes), B. parapertussis 12822 (4,773,551 bp; 4,404 genes) and B. pertussis Tohama I (4,086,186 bp; 3,816 genes). Our analysis indicates that B. parapertussis and B. pertussis are independent derivatives of B. bronchiseptica-like ancestors. During the evolution of these two host-restricted species there was large-scale gene loss and inactivation; host adaptation seems to be a consequence of loss, not gain, of function, and differences in virulence may be related to loss of regulatory or control functions.

    Nature genetics 2003;35;1;32-40

  • Identification of a structurally distinct CD101 molecule encoded in the 950-kb Idd10 region of NOD mice.

    Penha-Gonçalves C, Moule C, Smink LJ, Howson J, Gregory S, Rogers J, Lyons PA, Suttie JJ, Lord CJ, Peterson LB, Todd JA and Wicker LS

    Juvenile Diabetes Research Foundation/Wellcome Trust (JDRF/WT) Diabetes and Inflammation Laboratory, Cambridge Institute for Medical Research, University of Cambridge, Addenbrooke's Hospital, Cambridge CB2 2XY, U.K.

    Genes affecting autoimmune type 1 diabetes susceptibility in the nonobese diabetic (NOD) mouse (Idd loci) have been mapped using a congenic strain breeding strategy. In the present study, we used a combination of BAC clone contig construction, polymorphism analysis of DNA from congenic strains, and sequence mining of the human orthologous region to generate an integrated map of the Idd10 region on mouse chromosome 3. We found seven genes and one pseudogene in the 950-kb Idd10 region. Although all seven genes in the interval are Idd10 candidates, we suggest the gene encoding the EWI immunoglobulin subfamily member EWI-101 (Cd101) as the most likely Idd10 candidate because of the previously reported immune-associated properties of the human CD101 molecule. Additional support for the candidacy of Cd101 is the presence of 17 exonic single-nucleotide polymorphisms that differ between the NOD and B6 sequences, 10 causing amino acid substitutions in the predicted CD101 protein. Four of these 10 substitutions are nonconservative, 2 of which could potentially alter N-linked glycosylation. Considering our results together with those previous reports that antibodies recognizing human CD101 modulate human T-cell and dendritic cell function, there is now justification to test whether the alteration of CD101 function affects autoimmune islet destruction.

    Diabetes 2003;52;6;1551-6

  • Automatic inference of protein quaternary structure from crystals

    Ponstingl, H, Kabir, T and Thornton, JM


  • The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics.

    Stein LD, Bao Z, Blasiar D, Blumenthal T, Brent MR, Chen N, Chinwalla A, Clarke L, Clee C, Coghlan A, Coulson A, D'Eustachio P, Fitch DH, Fulton LA, Fulton RE, Griffiths-Jones S, Harris TW, Hillier LW, Kamath R, Kuwabara PE, Mardis ER, Marra MA, Miner TL, Minx P, Mullikin JC, Plumb RW, Rogers J, Schein JE, Sohrmann M, Spieth J, Stajich JE, Wei C, Willey D, Wilson RK, Durbin R and Waterston RH

    Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA..

    The soil nematodes Caenorhabditis briggsae and Caenorhabditis elegans diverged from a common ancestor roughly 100 million years ago and yet are almost indistinguishable by eye. They have the same chromosome number and genome sizes, and they occupy the same ecological niche. To explore the basis for this striking conservation of structure and function, we have sequenced the C. briggsae genome to a high-quality draft stage and compared it to the finished C. elegans sequence. We predict approximately 19,500 protein-coding genes in the C. briggsae genome, roughly the same as in C. elegans. Of these, 12,200 have clear C. elegans orthologs, a further 6,500 have one or more clearly detectable C. elegans homologs, and approximately 800 C. briggsae genes have no detectable matches in C. elegans. Almost all of the noncoding RNAs (ncRNAs) known are shared between the two species. The two genomes exhibit extensive colinearity, and the rate of divergence appears to be higher in the chromosomal arms than in the centers. Operons, a distinctive feature of C. elegans, are highly conserved in C. briggsae, with the arrangement of genes being preserved in 96% of cases. The difference in size between the C. briggsae (estimated at approximately 104 Mbp) and C. elegans (100.3 Mbp) genomes is almost entirely due to repetitive sequence, which accounts for 22.4% of the C. briggsae genome in contrast to 16.5% of the C. elegans genome. Few, if any, repeat families are shared, suggesting that most were acquired after the two species diverged or are undergoing rapid evolution. Coclustering the C. elegans and C. briggsae proteins reveals 2,169 protein families of two or more members. Most of these are shared between the two species, but some appear to be expanding or contracting, and there seem to be as many as several hundred novel C. briggsae gene families. The C. briggsae draft sequence will greatly improve the annotation of the C. elegans genome. Based on similarity to C. briggsae, we found strong evidence for 1,300 new C. elegans genes. In addition, comparisons of the two genomes will help to understand the evolutionary forces that mold nematode genomes.

    Funded by: NHGRI NIH HHS: 5P01 HG00956, 5U01 HG02042, P41 HG02223; NIGMS NIH HHS: R01 GM42432, T32 GM07754-22

    PLoS biology 2003;1;2;E45

  • Sequence-based cancer genomics: progress, lessons and opportunities.

    Strausberg RL, Simpson AJ and Wooster R

    National Cancer Institute, 31 Center Drive, Room 10A07, Bethesda, Maryland 20892, USA.

    Technologies that provide a genome-wide view offer an unprecedented opportunity to scrutinize the molecular biology of the cancer cell. The information that is derived from these technologies is well suited to the development of public databases of alterations in the cancer genome and its expression. Here, we describe the synergistic efforts of research programmes in Brazil, the United Kingdom and the United States towards building integrated databases that are widely accessible to the research community, to enable basic and applied applications in cancer research.

    Nature reviews. Genetics 2003;4;6;409-18

  • Fitting the niche by genomic adaptation.

    Thomson N, Bentley S, Holden M and Parkhill J

    Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Studying microbial genomics has shown that the genomes of bacteria are extremely dynamic in evolutionary terms. Many research groups have linked the adaptation of an organism to a niche to large changes in genome size and content. A number of recent papers have underlined the degree to which the genomes of different organisms are a reflection of the opportunities and constraints imposed by their chosen niche.

    Nature reviews. Microbiology 2003;1;2;92-3

  • The value of comparison.

    Thomson N, Sebaihia M, Cerdeño-Tárraga A, Bentley S, Crossman L and Parkhill J

    Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    With the number of published microbial genomes now in excess of 100, any new genome that is sequenced is likely to have a close relative available for comparison. Indeed, it is increasingly difficult to perform any genomic analysis that is not comparative. This should, however, not be seen as a drawback; it is often the case that a large amount of information can be drawn from these comparisons, especially between closely related organisms. Several genome sequences published recently indicate the value of comparisons at the genomic level.

    Nature reviews. Microbiology 2003;1;1;11-2

  • Association of the T-cell regulatory gene CTLA4 with susceptibility to autoimmune disease.

    Ueda H, Howson JM, Esposito L, Heward J, Snook H, Chamberlain G, Rainbow DB, Hunter KM, Smith AN, Di Genova G, Herr MH, Dahlman I, Payne F, Smyth D, Lowe C, Twells RC, Howlett S, Healy B, Nutland S, Rance HE, Everett V, Smink LJ, Lam AC, Cordell HJ, Walker NM, Bordin C, Hulme J, Motzo C, Cucca F, Hess JF, Metzker ML, Rogers J, Gregory S, Allahabadia A, Nithiyananthan R, Tuomilehto-Wolf E, Tuomilehto J, Bingley P, Gillespie KM, Undlien DE, Rønningen KS, Guja C, Ionescu-Tîrgovişte C, Savage DA, Maxwell AP, Carson DJ, Patterson CC, Franklyn JA, Clayton DG, Peterson LB, Wicker LS, Todd JA and Gough SC

    Juvenile Diabetes Research Foundation/Wellcome Trust Diabetes and Inflammation Laboratory, Cambridge Institute for Medical Research, University of Cambridge, Wellcome Trust/MRC Building, Cambridge, CB2 2XY, UK.

    Genes and mechanisms involved in common complex diseases, such as the autoimmune disorders that affect approximately 5% of the population, remain obscure. Here we identify polymorphisms of the cytotoxic T lymphocyte antigen 4 gene (CTLA4)--which encodes a vital negative regulatory molecule of the immune system--as candidates for primary determinants of risk of the common autoimmune disorders Graves' disease, autoimmune hypothyroidism and type 1 diabetes. In humans, disease susceptibility was mapped to a non-coding 6.1 kb 3' region of CTLA4, the common allelic variation of which was correlated with lower messenger RNA levels of the soluble alternative splice form of CTLA4. In the mouse model of type 1 diabetes, susceptibility was also associated with variation in CTLA-4 gene splicing with reduced production of a splice form encoding a molecule lacking the CD80/CD86 ligand-binding domain. Genetic mapping of variants conferring a small disease risk can identify pathways in complex disorders, as exemplified by our discovery of inherited, quantitative alterations of CTLA4 contributing to autoimmune tissue destruction.

    Nature 2003;423;6939;506-11

  • QUAD system offers fair shares to all authors.

    Verhagen JV, Wallace KJ, Collins SC and Scott TR

    Nature 2003;426;6967;602

  • Positional cloning of a quantitative trait locus on chromosome 13q14 that influences immunoglobulin E levels and asthma.

    Zhang Y, Leaves NI, Anderson GG, Ponting CP, Broxholme J, Holt R, Edser P, Bhattacharyya S, Dunham A, Adcock IM, Pulleyn L, Barnes PJ, Harper JI, Abecasis G, Cardon L, White M, Burton J, Matthews L, Mott R, Ross M, Cox R, Moffatt MF and Cookson WO

    Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK.

    Atopic or immunoglobulin E (IgE)-mediated diseases include the common disorders of asthma, atopic dermatitis and allergic rhinitis. Chromosome 13q14 shows consistent linkage to atopy and the total serum IgE concentration. We previously identified association between total serum IgE levels and a novel 13q14 microsatellite (USAT24G1; ref. 7) and have now localized the underlying quantitative-trait locus (QTL) in a comprehensive single-nucleotide polymorphism (SNP) map. We found replicated association to IgE levels that was attributed to several alleles in a single gene, PHF11. We also found association with these variants to severe clinical asthma. The gene product (PHF11) contains two PHD zinc fingers and probably regulates transcription. Distinctive splice variants were expressed in immune tissues and cells.

    Nature genetics 2003;34;2;181-6