Sanger Institute - Publications 1998

Number of papers published in 1998: 14

  • Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships.

    Brenner SE, Chothia C and Hubbard TJ

    MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, United Kingdom. brenner@hyper.stanford.edu

    Pairwise sequence comparison methods have been assessed using proteins whose relationships are known reliably from their structures and functions, as described in the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia C. (1995) J. Mol. Biol. 247, 536-540]. The evaluation tested the programs BLAST [Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) Methods Enzymol. 266, 460-480], FASTA [Pearson, W. R. & Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444-2448], and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) J. Mol. Biol. 147, 195-197] and their scoring schemes. The error rate of all algorithms is greatly reduced by using statistical scores to evaluate matches rather than percentage identity or raw scores. The E-value statistical scores of SSEARCH and FASTA are reliable: the number of false positives found in our tests agrees well with the scores reported. However, the P-values reported by BLAST and WU-BLAST2 exaggerate significance by orders of magnitude. SSEARCH, FASTA ktup = 1, and WU-BLAST2 perform best, and they are capable of detecting almost all relationships between proteins whose sequence identities are >30%. For more distantly related proteins, they do much less well; only one-half of the relationships between proteins with 20-30% identity are found. Because many homologs have low sequence similarity, most distant relationships cannot be detected by any pairwise comparison method; however, those which are identified may be used with confidence.

    Funded by: Wellcome Trust

    Proceedings of the National Academy of Sciences of the United States of America 1998;95;11;6073-8

  • Genome sequence of the nematode C. elegans: a platform for investigating biology.

    C. elegans Sequencing Consortium

    The 97-megabase genomic sequence of the nematode Caenorhabditis elegans reveals over 19,000 genes. More than 40 percent of the predicted protein products find significant matches in other organisms. There is a variety of repeated sequences, both local and dispersed. The distinctive distribution of some repeats and highly conserved genes provides evidence for a regional organization of the chromosomes.

    Science (New York, N.Y.) 1998;282;5396;2012-8

  • Host response to EBV infection in X-linked lymphoproliferative disease results from mutations in an SH2-domain encoding gene.

    Coffey AJ, Brooksbank RA, Brandau O, Oohashi T, Howell GR, Bye JM, Cahn AP, Durham J, Heath P, Wray P, Pavitt R, Wilkinson J, Leversha M, Huckle E, Shaw-Smith CJ, Dunham A, Rhodes S, Schuster V, Porta G, Yin L, Serafini P, Sylla B, Zollo M, Franco B, Bolino A, Seri M, Lanyi A, Davis JR, Webster D, Harris A, Lenoir G, de St Basile G, Jones A, Behloradsky BH, Achatz H, Murken J, Fassler R, Sumegi J, Romeo G, Vaudin M, Ross MT, Meindl A and Bentley DR

    The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK. ajc@sanger.ac.uk

    X-linked lymphoproliferative syndrome (XLP or Duncan disease) is characterized by extreme sensitivity to Epstein-Barr virus (EBV), resulting in a complex phenotype manifested by severe or fatal infectious mononucleosis, acquired hypogammaglobulinemia and malignant lymphoma. We have identified a gene, SH2D1A, that is mutated in XLP patients and encodes a novel protein composed of a single SH2 domain. SH2D1A is expressed in many tissues involved in the immune system. The identification of SH2D1A will allow the determination of its mechanism of action as a possible regulator of the EBV-induced immune response.

    Funded by: NIAID NIH HHS: 1 R01 AI33532-OIA3; Telethon: E.0440, TGM06S01; Wellcome Trust

    Nature genetics 1998;20;2;129-35

  • Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence.

    Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, Gordon SV, Eiglmeier K, Gas S, Barry CE, Tekaia F, Badcock K, Basham D, Brown D, Chillingworth T, Connor R, Davies R, Devlin K, Feltwell T, Gentles S, Hamlin N, Holroyd S, Hornsby T, Jagels K, Krogh A, McLean J, Moule S, Murphy L, Oliver K, Osborne J, Quail MA, Rajandream MA, Rogers J, Rutter S, Seeger K, Skelton J, Squares R, Squares S, Sulston JE, Taylor K, Whitehead S and Barrell BG

    Sanger Centre, Wellcome Trust Genome Campus, Hinxton, UK. stcole@pasteur.fr

    Countless millions of people have died from tuberculosis, a chronic infectious disease caused by the tubercle bacillus. The complete genome sequence of the best-characterized strain of Mycobacterium tuberculosis, H37Rv, has been determined and analysed in order to improve our understanding of the biology of this slow-growing pathogen and to help the conception of new prophylactic and therapeutic interventions. The genome comprises 4,411,529 base pairs, contains around 4,000 genes, and has a very high guanine + cytosine content that is reflected in the biased amino-acid content of the proteins. M. tuberculosis differs radically from other bacteria in that a very large portion of its coding capacity is devoted to the production of enzymes involved in lipogenesis and lipolysis, and to two new families of glycine-rich proteins with a repetitive structure that may represent a source of antigenic variation.

    Funded by: NIAID NIH HHS: Z01 AI000783-11; Wellcome Trust

    Nature 1998;393;6685;537-44

  • A physical map of 30,000 human genes.

    Deloukas P, Schuler GD, Gyapay G, Beasley EM, Soderlund C, Rodriguez-Tomé P, Hui L, Matise TC, McKusick KB, Beckmann JS, Bentolila S, Bihoreau M, Birren BB, Browne J, Butler A, Castle AB, Chiannilkulchai N, Clee C, Day PJ, Dehejia A, Dibling T, Drouot N, Duprat S, Fizames C, Fox S, Gelling S, Green L, Harrison P, Hocking R, Holloway E, Hunt S, Keil S, Lijnzaad P, Louis-Dit-Sully C, Ma J, Mendis A, Miller J, Morissette J, Muselet D, Nusbaum HC, Peck A, Rozen S, Simon D, Slonim DK, Staples R, Stein LD, Stewart EA, Suchard MA, Thangarajah T, Vega-Czarny N, Webber C, Wu X, Hudson J, Auffray C, Nomura N, Sikela JM, Polymeropoulos MH, James MR, Lander ES, Hudson TJ, Myers RM, Cox DR, Weissenbach J, Boguski MS and Bentley DR

    Sanger Centre, Hinxton Hall, Hinxton, Cambridge CB10 1SA UK.

    A map of 30,181 human gene-based markers was assembled and integrated with the current genetic map by radiation hybrid mapping. The new gene map contains nearly twice as many genes as the previous release, includes most genes that encode proteins of known function, and is twofold to threefold more accurate than the previous version. A redesigned, more informative and functional World Wide Web site (www.ncbi.nlm.nih.gov/genemap) provides the mapping information and associated data and annotations. This resource constitutes an important infrastructure and tool for the study of complex genetic traits, the positional cloning of disease genes, the cross-referencing of mammalian genomes, and validated human transcribed sequences for large-scale studies of gene expression.

    Funded by: Wellcome Trust

    Science (New York, N.Y.) 1998;282;5389;744-6

  • SCOP, Structural Classification of Proteins database: applications to evaluation of the effectiveness of sequence alignment methods and statistics of protein structural data.

    Hubbard TJ, Ailey B, Brenner SE, Murzin AG and Chothia C

    Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, England. th@sanger.ac.uk

    The Structural Classification of Proteins (SCOP) database provides a detailed and comprehensive description of the relationships of all known protein structures. The classification is on hierarchical levels: the first two levels, family and superfamily, describe near and far evolutionary relationships; the third, fold, describes geometrical relationships. The distinction between evolutionary relationships and those that arise from the physics and chemistry of proteins is a feature that is unique to this database, so far. The database can be used as a source of data to calibrate sequence search algorithms and for the generation of population statistics on protein structures. The database and its associated files are freely accessible from a number of WWW sites mirrored from URL http://scop. mrc-lmb.cam.ac.uk/scop/.

    Acta crystallographica. Section D, Biological crystallography 1998;54;Pt 6 Pt 1;1147-54

  • GLASS: a tool to visualize protein structure prediction data in three dimensions and evaluate their consistency.

    Leplae R, Hubbard T and Tramontano A

    Istituto di Ricerche di Biologia Molecolare, P. Angeletti, Pomezia, Italy.

    When a protein sequence does not share any significant sequence similarity with a protein of known structure, homology modeling cannot be applied. However, many novel and interesting methods, such as secondary structure prediction, fold recognition, and prediction of long-range interactions, are being developed and have been shown to be reasonably successful in predicting protein structures from sequence data and evolutionary information. The a priori evaluation of the correctness of a prediction obtained by one of these methods is however often problematic. Consequently, it is important to use all available information provided by as many different methods as possible and all the available experimental data about the protein of interest, since the consistency of the results is indicative of the reliability of the prediction. Hence the need has arisen for suitable tools able to compare results provided by different methods and evaluate their consistency. We have therefore constructed GLASS, a general platform to read, visualize, compare, and evaluate prediction results from many different sources and to project these prediction results into three dimensions. In addition, GLASS allows the comparison of selected parameters calculated for a model with the distribution observed in real protein structures, thus providing an easy way to test new methods for evaluating the likelihood of different structural models. GLASS can be considered as a "workbench" for structural predictions useful to both experimentalists and theoreticians.

    Funded by: Wellcome Trust

    Proteins 1998;30;4;339-51

  • Mutations in a gene encoding a novel protein tyrosine phosphatase cause progressive myoclonus epilepsy.

    Minassian BA, Lee JR, Herbrick JA, Huizenga J, Soder S, Mungall AJ, Dunham I, Gardner R, Fong CY, Carpenter S, Jardim L, Satishchandra P, Andermann E, Snead OC, Lopes-Cendes I, Tsui LC, Delgado-Escueta AV, Rouleau GA and Scherer SW

    Department of Genetics, The Hospital for Sick Children, University of Toronto, Ontario, Canada.

    Lafora's disease (LD; OMIM 254780) is an autosomal recessive form of progressive myoclonus epilepsy characterized by seizures and cumulative neurological deterioration. Onset occurs during late childhood and usually results in death within ten years of the first symptoms. With few exceptions, patients follow a homogeneous clinical course despite the existence of genetic heterogeneity. Biopsy of various tissues, including brain, revealed characteristic polyglucosan inclusions called Lafora bodies, which suggested LD might be a generalized storage disease. Using a positional cloning approach, we have identified at chromosome 6q24 a novel gene, EPM2A, that encodes a protein with consensus amino acid sequence indicative of a protein tyrosine phosphatase (PTP). mRNA transcripts representing alternatively spliced forms of EPM2A were found in every tissue examined, including brain. Six distinct DNA sequence variations in EPM2A in nine families, and one homozygous microdeletion in another family, have been found to cosegregate with LD. These mutations are predicted to cause deleterious effects in the putative protein product, named laforin, resulting in LD.

    Funded by: NINDS NIH HHS: 5P01-NS21908; Wellcome Trust

    Nature genetics 1998;20;2;171-4

  • Inversin, a novel gene in the vertebrate left-right axis pathway, is partially deleted in the inv mouse.

    Morgan D, Turnpenny L, Goodship J, Dai W, Majumder K, Matthews L, Gardner A, Schuster G, Vien L, Harrison W, Elder FF, Penman-Splitt M, Overbeek P and Strachan T

    Department of Human Genetics, University of Newcastle upon Tyne, UK.

    Visceral left-right asymmetry occurs in all vertebrates, but the inversion of embryo turning (inv) mouse, which resulted following a random transgene insertion, is the only model in which these asymmetries are consistently reversed. We report positional cloning of the gene underlying this recessive phenotype. Although transgene insertion was accompanied by neighbouring deletion and duplication events, our YAC phenotype rescue studies indicate that the mutant phenotype results from the deletion. After extensively characterizing the 47-kb deleted region and flanking sequences from the wild-type mouse genome, we found evidence for only one gene sequence in the deleted region. We determined the full-length 5.5-kb cDNA sequence and identified 16 exons, of which exons 3-11 were eliminated by the deletion, causing a frameshift. The novel gene specifies a 1062-aa product with tandem ankyrin-like repeat sequences. Characterization of complementing and non-complementing YAC transgenic families revealed that correction of the inv mutant phenotype was concordant with integration and intact expression of this novel gene, which we have named inversin (Invs).

    Funded by: Wellcome Trust

    Nature genetics 1998;20;2;149-56

  • Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.

    Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T and Chothia C

    MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, UK.

    The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representation of the characteristics shared by related sequences. Here we describe an assessment of three of these methods: the SAM-T98 implementation of a hidden Markov model procedure; PSI-BLAST; and the intermediate sequence search (ISS) procedure. We determined the extent to which these procedures can detect evolutionary relationships between the members of the sequence database PDBD40-J. This database, derived from the structural classification of proteins (SCOP), contains the sequences of proteins of known structure whose sequence identities with each other are 40% or less. The evolutionary relationships that exist between those that have low sequence identities were found by the examination of their structural details and, in many cases, their functional features. For nine false positive predictions out of a possible 432,680, i.e. at a false positive rate of about 1/50,000, SAM-T98 found 35% of the true homologous relationships in PDBD40-J, whilst PSI-BLAST found 30% and ISS found 25%. Overall, this is about twice the number of PDBD40-J relations that can be detected by the pairwise comparison procedures FASTA (17%) and GAP-BLAST (15%). For distantly related sequences in PDBD40-J, those pairs whose sequence identity is less than 30%, SAM-T98 and PSI-BLAST detect three times the number of relationships found by the pairwise methods.

    Journal of molecular biology 1998;284;4;1201-10

  • SPEM: a parser for EMBL style flat file database entries.

    Pocock MR, Hubbard T and Birney E

    Informatics, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. mrp@sanger.ac.uk

    Summary: We present a set of Perl modules for the flexible and robust parsing and editing of EMBL/SWISS-PROT databases.

    Availability: The Web page at http://www.sanger.ac. uk/Software/PerlModule/ provides information about downloading the SPEM and PrEMBL modules, and provides links to documentation and example code.

    Bioinformatics (Oxford, England) 1998;14;9;823-4

  • Using neural networks for prediction of the subcellular location of proteins.

    Reinhardt A and Hubbard T

    The Sanger Centre, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK. areinha@sanger.ac.uk

    Neural networks have been trained to predict the subcellular location of proteins in prokaryotic or eukaryotic cells from their amino acid composition. For three possible subcellular locations in prokaryotic organisms a prediction accuracy of 81% can be achieved. Assigning a reliability index, 33% of the predictions can be made with an accuracy of 91%. For eukaryotic proteins (excluding plant sequences) an overall prediction accuracy of 66% for four locations was achieved, with 33% of the sequences being predicted with an accuracy of 82% or better. With the subcellular location restricting a protein's possible function, this method should be a useful tool for the systematic analysis of genome data and is available via a server on the world wide web.

    Funded by: Wellcome Trust

    Nucleic acids research 1998;26;9;2230-6

  • Toward a complete human genome sequence.

    Sanger Center and Genome Sequencing Center

    Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK;

    We have begun a joint program as part of a coordinated international effort to determine a complete human genome sequence. Our strategy is to map large-insert bacterial clones and to sequence each clone by a random shotgun approach followed by directed finishing. As of September 1998, we have identified the map positions of bacterial clones covering approximately 860 Mb for sequencing and completed >98 Mb ( approximately 3.3%) of the human genome sequence. Our progress and sequencing data can be accessed via the World Wide Web (http://webace.sanger.ac.uk/HGP/ or http://genome.wustl.edu/gsc/).

    Genome research 1998;8;11;1097-108

  • The Human Genome Project: reaching the finish line.

    Waterston R and Sulston JE

    Genome Sequencing Center, Washington University School of Medicine, St. Louis, MO 63108, USA. rw@genetics.wustl.edu

    Science (New York, N.Y.) 1998;282;5386;53-4

* quick link - http://q.sanger.ac.uk/r6ks9acl