Sanger Institute - Publications 1998
Number of papers published in 1998: 19
Implementation and evaluation of a voice-activated dialling system
Interactive Voice Technology for Telecommunications Applications, IEEE IVTTA '98. Proceedings. 1998
Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships.
MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, United Kingdom. email@example.com
Pairwise sequence comparison methods have been assessed using proteins whose relationships are known reliably from their structures and functions, as described in the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia C. (1995) J. Mol. Biol. 247, 536-540]. The evaluation tested the programs BLAST [Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) Methods Enzymol. 266, 460-480], FASTA [Pearson, W. R. & Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444-2448], and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) J. Mol. Biol. 147, 195-197] and their scoring schemes. The error rate of all algorithms is greatly reduced by using statistical scores to evaluate matches rather than percentage identity or raw scores. The E-value statistical scores of SSEARCH and FASTA are reliable: the number of false positives found in our tests agrees well with the scores reported. However, the P-values reported by BLAST and WU-BLAST2 exaggerate significance by orders of magnitude. SSEARCH, FASTA ktup = 1, and WU-BLAST2 perform best, and they are capable of detecting almost all relationships between proteins whose sequence identities are >30%. For more distantly related proteins, they do much less well; only one-half of the relationships between proteins with 20-30% identity are found. Because many homologs have low sequence similarity, most distant relationships cannot be detected by any pairwise comparison method; however, those which are identified may be used with confidence.
Funded by: Wellcome Trust
Proceedings of the National Academy of Sciences of the United States of America 1998;95;11;6073-8
Genome sequence of the nematode C. elegans: a platform for investigating biology.
The 97-megabase genomic sequence of the nematode Caenorhabditis elegans reveals over 19,000 genes. More than 40 percent of the predicted protein products find significant matches in other organisms. There is a variety of repeated sequences, both local and dispersed. The distinctive distribution of some repeats and highly conserved genes provides evidence for a regional organization of the chromosomes.
Science (New York, N.Y.) 1998;282;5396;2012-8
Host response to EBV infection in X-linked lymphoproliferative disease results from mutations in an SH2-domain encoding gene.
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK. firstname.lastname@example.org
X-linked lymphoproliferative syndrome (XLP or Duncan disease) is characterized by extreme sensitivity to Epstein-Barr virus (EBV), resulting in a complex phenotype manifested by severe or fatal infectious mononucleosis, acquired hypogammaglobulinemia and malignant lymphoma. We have identified a gene, SH2D1A, that is mutated in XLP patients and encodes a novel protein composed of a single SH2 domain. SH2D1A is expressed in many tissues involved in the immune system. The identification of SH2D1A will allow the determination of its mechanism of action as a possible regulator of the EBV-induced immune response.
Funded by: NIAID NIH HHS: 1 R01 AI33532-OIA3; Telethon: E.0440, TGM06S01; Wellcome Trust
Nature genetics 1998;20;2;129-35
Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence.
Sanger Centre, Wellcome Trust Genome Campus, Hinxton, UK. email@example.com
Countless millions of people have died from tuberculosis, a chronic infectious disease caused by the tubercle bacillus. The complete genome sequence of the best-characterized strain of Mycobacterium tuberculosis, H37Rv, has been determined and analysed in order to improve our understanding of the biology of this slow-growing pathogen and to help the conception of new prophylactic and therapeutic interventions. The genome comprises 4,411,529 base pairs, contains around 4,000 genes, and has a very high guanine + cytosine content that is reflected in the biased amino-acid content of the proteins. M. tuberculosis differs radically from other bacteria in that a very large portion of its coding capacity is devoted to the production of enzymes involved in lipogenesis and lipolysis, and to two new families of glycine-rich proteins with a repetitive structure that may represent a source of antigenic variation.
Funded by: Intramural NIH HHS: Z01 AI000783-11; Wellcome Trust
A theoretical model for the stick/bounce behaviour of adhesive, elastic-plastic spheres
Powder Technology 1998;99;154–162
A physical map of 30,000 human genes.
Sanger Centre, Hinxton Hall, Hinxton, Cambridge CB10 1SA UK.
A map of 30,181 human gene-based markers was assembled and integrated with the current genetic map by radiation hybrid mapping. The new gene map contains nearly twice as many genes as the previous release, includes most genes that encode proteins of known function, and is twofold to threefold more accurate than the previous version. A redesigned, more informative and functional World Wide Web site (www.ncbi.nlm.nih.gov/genemap) provides the mapping information and associated data and annotations. This resource constitutes an important infrastructure and tool for the study of complex genetic traits, the positional cloning of disease genes, the cross-referencing of mammalian genomes, and validated human transcribed sequences for large-scale studies of gene expression.
Funded by: Wellcome Trust
Science (New York, N.Y.) 1998;282;5389;744-6
SCOP, Structural Classification of Proteins database: applications to evaluation of the effectiveness of sequence alignment methods and statistics of protein structural data.
Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, England. firstname.lastname@example.org
The Structural Classification of Proteins (SCOP) database provides a detailed and comprehensive description of the relationships of all known protein structures. The classification is on hierarchical levels: the first two levels, family and superfamily, describe near and far evolutionary relationships; the third, fold, describes geometrical relationships. The distinction between evolutionary relationships and those that arise from the physics and chemistry of proteins is a feature that is unique to this database, so far. The database can be used as a source of data to calibrate sequence search algorithms and for the generation of population statistics on protein structures. The database and its associated files are freely accessible from a number of WWW sites mirrored from URL http://scop. mrc-lmb.cam.ac.uk/scop/.
Acta crystallographica. Section D, Biological crystallography 1998;54;Pt 6 Pt 1;1147-54
GLASS: a tool to visualize protein structure prediction data in three dimensions and evaluate their consistency.
Istituto di Ricerche di Biologia Molecolare, P. Angeletti, Pomezia, Italy.
When a protein sequence does not share any significant sequence similarity with a protein of known structure, homology modeling cannot be applied. However, many novel and interesting methods, such as secondary structure prediction, fold recognition, and prediction of long-range interactions, are being developed and have been shown to be reasonably successful in predicting protein structures from sequence data and evolutionary information. The a priori evaluation of the correctness of a prediction obtained by one of these methods is however often problematic. Consequently, it is important to use all available information provided by as many different methods as possible and all the available experimental data about the protein of interest, since the consistency of the results is indicative of the reliability of the prediction. Hence the need has arisen for suitable tools able to compare results provided by different methods and evaluate their consistency. We have therefore constructed GLASS, a general platform to read, visualize, compare, and evaluate prediction results from many different sources and to project these prediction results into three dimensions. In addition, GLASS allows the comparison of selected parameters calculated for a model with the distribution observed in real protein structures, thus providing an easy way to test new methods for evaluating the likelihood of different structural models. GLASS can be considered as a "workbench" for structural predictions useful to both experimentalists and theoreticians.
Funded by: Wellcome Trust
Mutations in a gene encoding a novel protein tyrosine phosphatase cause progressive myoclonus epilepsy.
Department of Genetics, The Hospital for Sick Children, University of Toronto, Ontario, Canada.
Lafora's disease (LD; OMIM 254780) is an autosomal recessive form of progressive myoclonus epilepsy characterized by seizures and cumulative neurological deterioration. Onset occurs during late childhood and usually results in death within ten years of the first symptoms. With few exceptions, patients follow a homogeneous clinical course despite the existence of genetic heterogeneity. Biopsy of various tissues, including brain, revealed characteristic polyglucosan inclusions called Lafora bodies, which suggested LD might be a generalized storage disease. Using a positional cloning approach, we have identified at chromosome 6q24 a novel gene, EPM2A, that encodes a protein with consensus amino acid sequence indicative of a protein tyrosine phosphatase (PTP). mRNA transcripts representing alternatively spliced forms of EPM2A were found in every tissue examined, including brain. Six distinct DNA sequence variations in EPM2A in nine families, and one homozygous microdeletion in another family, have been found to cosegregate with LD. These mutations are predicted to cause deleterious effects in the putative protein product, named laforin, resulting in LD.
Funded by: NINDS NIH HHS: 5P01-NS21908; Wellcome Trust
Nature genetics 1998;20;2;171-4
Inversin, a novel gene in the vertebrate left-right axis pathway, is partially deleted in the inv mouse.
Department of Human Genetics, University of Newcastle upon Tyne, UK.
Visceral left-right asymmetry occurs in all vertebrates, but the inversion of embryo turning (inv) mouse, which resulted following a random transgene insertion, is the only model in which these asymmetries are consistently reversed. We report positional cloning of the gene underlying this recessive phenotype. Although transgene insertion was accompanied by neighbouring deletion and duplication events, our YAC phenotype rescue studies indicate that the mutant phenotype results from the deletion. After extensively characterizing the 47-kb deleted region and flanking sequences from the wild-type mouse genome, we found evidence for only one gene sequence in the deleted region. We determined the full-length 5.5-kb cDNA sequence and identified 16 exons, of which exons 3-11 were eliminated by the deletion, causing a frameshift. The novel gene specifies a 1062-aa product with tandem ankyrin-like repeat sequences. Characterization of complementing and non-complementing YAC transgenic families revealed that correction of the inv mutant phenotype was concordant with integration and intact expression of this novel gene, which we have named inversin (Invs).
Funded by: Wellcome Trust
Nature genetics 1998;20;2;149-56
Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.
MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, UK.
The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representation of the characteristics shared by related sequences. Here we describe an assessment of three of these methods: the SAM-T98 implementation of a hidden Markov model procedure; PSI-BLAST; and the intermediate sequence search (ISS) procedure. We determined the extent to which these procedures can detect evolutionary relationships between the members of the sequence database PDBD40-J. This database, derived from the structural classification of proteins (SCOP), contains the sequences of proteins of known structure whose sequence identities with each other are 40% or less. The evolutionary relationships that exist between those that have low sequence identities were found by the examination of their structural details and, in many cases, their functional features. For nine false positive predictions out of a possible 432,680, i.e. at a false positive rate of about 1/50,000, SAM-T98 found 35% of the true homologous relationships in PDBD40-J, whilst PSI-BLAST found 30% and ISS found 25%. Overall, this is about twice the number of PDBD40-J relations that can be detected by the pairwise comparison procedures FASTA (17%) and GAP-BLAST (15%). For distantly related sequences in PDBD40-J, those pairs whose sequence identity is less than 30%, SAM-T98 and PSI-BLAST detect three times the number of relationships found by the pairwise methods.
Journal of molecular biology 1998;284;4;1201-10
SPEM: a parser for EMBL style flat file database entries.
Informatics, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. email@example.com
Summary: We present a set of Perl modules for the flexible and robust parsing and editing of EMBL/SWISS-PROT databases.
Availability: The Web page at http://www.sanger.ac. uk/Software/PerlModule/ provides information about downloading the SPEM and PrEMBL modules, and provides links to documentation and example code.
Bioinformatics (Oxford, England) 1998;14;9;823-4
Disintegration of weak lactose agglomerates for inhalation applications
International Journal of Pharmaceutics 1998;172;199–209
Using neural networks for prediction of the subcellular location of proteins.
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK. firstname.lastname@example.org
Neural networks have been trained to predict the subcellular location of proteins in prokaryotic or eukaryotic cells from their amino acid composition. For three possible subcellular locations in prokaryotic organisms a prediction accuracy of 81% can be achieved. Assigning a reliability index, 33% of the predictions can be made with an accuracy of 91%. For eukaryotic proteins (excluding plant sequences) an overall prediction accuracy of 66% for four locations was achieved, with 33% of the sequences being predicted with an accuracy of 82% or better. With the subcellular location restricting a protein's possible function, this method should be a useful tool for the systematic analysis of genome data and is available via a server on the world wide web.
Funded by: Wellcome Trust
Nucleic acids research 1998;26;9;2230-6
Toward a complete human genome sequence.
Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK;
We have begun a joint program as part of a coordinated international effort to determine a complete human genome sequence. Our strategy is to map large-insert bacterial clones and to sequence each clone by a random shotgun approach followed by directed finishing. As of September 1998, we have identified the map positions of bacterial clones covering approximately 860 Mb for sequencing and completed >98 Mb ( approximately 3.3%) of the human genome sequence. Our progress and sequencing data can be accessed via the World Wide Web (http://webace.sanger.ac.uk/HGP/ or http://genome.wustl.edu/gsc/).
Genome research 1998;8;11;1097-108
The Human Genome Project: reaching the finish line.
Genome Sequencing Center, Washington University School of Medicine, St. Louis, MO 63108, USA. email@example.com
Science (New York, N.Y.) 1998;282;5386;53-4
An implementation and evaluation of an on-line speaker verification system for field trials
The 5th International Conference on Spoken Language Processing, Sydney Convention Centre, Sydney, Australia, 30th November - 4th December 1998 1998
Simulation Of Flexible Fibers By Discrete Cylindrical Segments
Computational Methods for Smart Structures and Materials 1998