Sanger Institute - Publications 1999

Number of papers published in 1999: 21

  • Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins.

    Bateman A, Birney E, Durbin R, Eddy SR, Finn RD and Sonnhammer EL

    The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Pfam is a collection of multiple alignments and profile hidden Markov models of protein domain families. Release 3.1 is a major update of the Pfam database and contains 1313 families which are available on the World Wide Web in Europe at and, and in the US at Over 54% of proteins in SWISS-PROT-35 and SP-TrEMBL-5 match a Pfam family. The primary changes of Pfam since release 2.1 are that we now use the more advanced version 2 of the HMMER software, which is more sensitive and provides expectation values for matches, and that it now includes proteins from both SP-TrEMBL and SWISS-PROT.

    Funded by: Wellcome Trust

    Nucleic acids research 1999;27;1;260-2

  • From genomics to epigenomics: a loftier view of life.

    Beck S, Olek A and Walter J

    The Sanger Centre, Cambridge, UK.

    Nature biotechnology 1999;17;12;1144

  • Word Confusability Measures for Vocabulary Selection in Speech Recognition

    Beng T. Tan, Yong Gu, and Trevor Thomas

    Proc. IEEE Workshop on Automtic Speech Recognition and Understanding (ASRU), Colordo, USA 1999

  • The complete nucleotide sequence of chromosome 3 of Plasmodium falciparum.

    Bowman S, Lawson D, Basham D, Brown D, Chillingworth T, Churcher CM, Craig A, Davies RM, Devlin K, Feltwell T, Gentles S, Gwilliam R, Hamlin N, Harris D, Holroyd S, Hornsby T, Horrocks P, Jagels K, Jassal B, Kyes S, McLean J, Moule S, Mungall K, Murphy L, Oliver K, Quail MA, Rajandream MA, Rutter S, Skelton J, Squares R, Squares S, Sulston JE, Whitehead S, Woodward JR, Newbold C and Barrell BG

    Pathogen Sequencing Unit, Sanger Centre, Wellcome Trust Genome Campus, Hinxton, UK.

    Analysis of Plasmodium falciparum chromosome 3, and comparison with chromosome 2, highlights novel features of chromosome organization and gene structure. The sub-telomeric regions of chromosome 3 show a conserved order of features, including repetitive DNA sequences, members of multigene families involved in pathogenesis and antigenic variation, a number of conserved pseudogenes, and several genes of unknown function. A putative centromere has been identified that has a core region of about 2 kilobases with an extremely high (adenine + thymidine) composition and arrays of tandem repeats. We have predicted 215 protein-coding genes and two transfer RNA genes in the 1,060,106-base-pair chromosome sequence. The predicted protein-coding genes can be divided into three main classes: 52.6% are not spliced, 45.1% have a large exon with short additional 5' or 3' exons, and 2.3% have a multiple exon structure more typical of higher eukaryotes.

    Funded by: Wellcome Trust

    Nature 1999;400;6744;532-8

  • A new member of the IL-1 receptor family highly expressed in hippocampus and involved in X-linked mental retardation.

    Carrié A, Jun L, Bienvenu T, Vinet MC, McDonell N, Couvert P, Zemni R, Cardona A, Van Buggenhout G, Frints S, Hamel B, Moraine C, Ropers HH, Strom T, Howell GR, Whittaker A, Ross MT, Kahn A, Fryns JP, Beldjord C, Marynen P and Chelly J

    INSERM Unité 129-ICGM, CHU Cochin, 24 Rue du Faubourg Saint Jacques, 75014 Paris, France.

    We demonstrate here the importance of interleukin signalling pathways in cognitive function and the normal physiology of the CNS. Thorough investigation of an MRX critical region in Xp22.1-21.3 enabled us to identify a new gene expressed in brain that is responsible for a non-specific form of X-linked mental retardation. This gene encodes a 696 amino acid protein that has homology to IL-1 receptor accessory proteins. Non-overlapping deletions and a nonsense mutation in this gene were identified in patients with cognitive impairment only. Its high level of expression in post-natal brain structures involved in the hippocampal memory system suggests a specialized role for this new gene in the physiological processes underlying memory and learning abilities.

    Funded by: Wellcome Trust

    Nature genetics 1999;23;1;25-31

  • Breakage of macroporous alumina beads under compressive loading: simulation and experimental validation

    Charlotte Couroyera, Zemin Ninga, Mojtaba Ghadiria, Nathalie Brunardb, Frédéric Kolendab, Denis Bortzmeyerc and Philippe Laval

    Powder Technology 1999;105;57–65

  • Genetic definition and sequence analysis of Arabidopsis centromeres.

    Copenhaver GP, Nickel K, Kuromori T, Benito MI, Kaul S, Lin X, Bevan M, Murphy G, Harris B, Parnell LD, McCombie WR, Martienssen RA, Marra M and Preuss D

    University of Chicago, Department of Molecular Genetics and Cell Biology, 1103 East 57 Street, Chicago, IL 60637, USA.

    High-precision genetic mapping was used to define the regions that contain centromere functions on each natural chromosome in Arabidopsis thaliana. These regions exhibited dramatic recombinational repression and contained complex DNA surrounding large arrays of 180-base pair repeats. Unexpectedly, the DNA within the centromeres was not merely structural but also encoded several expressed genes. The regions flanking the centromeres were densely populated by repetitive elements yet experienced normal levels of recombination. The genetically defined centromeres were well conserved among Arabidopsis ecotypes but displayed limited sequence homology between different chromosomes, excluding repetitive DNA. This investigation provides a platform for dissecting the role of individual sequences in centromeres in higher eukaryotes.

    Science (New York, N.Y.) 1999;286;5449;2468-74

  • The DNA sequence of human chromosome 22.

    Dunham I, Shimizu N, Roe BA, Chissoe S, Hunt AR, Collins JE, Bruskiewich R, Beare DM, Clamp M, Smink LJ, Ainscough R, Almeida JP, Babbage A, Bagguley C, Bailey J, Barlow K, Bates KN, Beasley O, Bird CP, Blakey S, Bridgeman AM, Buck D, Burgess J, Burrill WD, O'Brien KP et al.

    Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.

    Knowledge of the complete genomic DNA sequence of an organism allows a systematic approach to defining its genetic components. The genomic sequence provides access to the complete structures of all genes, including those without known function, their control elements, and, by inference, the proteins they encode, as well as all other biologically important sequences. Furthermore, the sequence is a rich and permanent source of information for the design of further biological studies of the organism and for the study of evolution through cross-species sequence comparison. The power of this approach has been amply demonstrated by the determination of the sequences of a number of microbial and model organisms. The next step is to obtain the complete sequence of the entire human genome. Here we report the sequence of the euchromatic part of human chromosome 22. The sequence obtained consists of 12 contiguous segments spanning 33.4 megabases, contains at least 545 genes and 134 pseudogenes, and provides the first view of the complex chromosomal landscapes that will be found in the rest of the genome.

    Nature 1999;402;6761;489-95

  • RMS/coverage graphs: a qualitative method for comparing three-dimensional protein structure predictions.

    Hubbard TJ

    Sanger Centre, Hinxton, Cambridgeshire, United Kingdom.

    Evaluating a set of protein structure predictions is difficult as each prediction may omit different residues and different parts of the structure may have different accuracies. A method is described that captures the best results from a large number of alternative sequence-dependent structural superpositions between a prediction and the experimental structure and represents them as a single line on a graph. Applied to CASP2 and CASP3 data the best predictions stand out visually in most cases, as judged by manual inspection. The results from this method applied to CASP data are available from the URLs http:/(/)PredictionCenter. and http:/(/) approximately th/casp/.

    Funded by: Wellcome Trust

    Proteins 1999;Suppl 3;15-21

  • SCOP: a Structural Classification of Proteins database.

    Hubbard TJ, Ailey B, Brenner SE, Murzin AG and Chothia C

    Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.

    The Structural Classification of Proteins (SCOP) database provides a detailed and comprehensive description of the relationships of all known proteins structures. The classification is on hierarchical levels: the first two levels, family and superfamily, describe near and far evolutionary relationships; the third, fold, describes geometrical relationships. The distinction between evolutionary relationships and those that arise from the physics and chemistry of proteins is a feature that is unique to this database, so far. The database can be used as a source of data to calibrate sequence search algorithms and for the generation of population statistics on protein structures. The database and its associated files are freely accessible from a number of WWW sites mirrored from URL http://scop.

    Funded by: Wellcome Trust

    Nucleic acids research 1999;27;1;254-6

  • Effect of interface energy on the impact strength of agglomerates

    J. Subero, Z. Ning, M. Ghadiri and C. Thornton

    Powder Technology 1999;105;66–73

  • The chicken B locus is a minimal essential major histocompatibility complex.

    Kaufman J, Milne S, Göbel TW, Walker BA, Jacob JP, Auffray C, Zoorob R and Beck S

    Institute for Animal Health, Compton, UK.

    Here we report the sequence of the region that determines rapid allograft rejection in chickens, the chicken major histocompatibility complex (MHC). This 92-kilobase region of the B locus contains only 19 genes, making the chicken MHC roughly 20-fold smaller than the human MHC. Virtually all the genes have counterparts in the human MHC, defining a minimal essential set of MHC genes conserved over 200 million years of divergence between birds and mammals. They are organized differently, with the class III region genes located outside the class II and class I region genes. The absence of proteasome genes is unexpected and might explain unusual peptide-binding specificities of chicken class I molecules. The presence of putative natural killer receptor gene(s) is unprecedented and might explain the importance of the B locus in the response to the herpes virus responsible for Marek's diseases. The small size and simplicity of the chicken MHC allows co-evolution of genes as haplotypes over considerable periods of time, and makes it possible to study the striking MHC-determined pathogen-specific disease resistance at the molecular level.

    Funded by: Wellcome Trust

    Nature 1999;401;6756;923-5

  • Mutations in SLC19A2 cause thiamine-responsive megaloblastic anaemia associated with diabetes mellitus and deafness.

    Labay V, Raz T, Baron D, Mandel H, Williams H, Barrett T, Szargel R, McDonald L, Shalata A, Nosaka K, Gregory S and Cohen N

    Department of Genetics, Tamkin Human Molecular Genetics Research Facility, Technion-Israel Institute of Technology, Bruce Rappaport Faculty of Medicine, Haifa.

    Thiamine-responsive megaloblastic anaemia (TRMA), also known as Rogers syndrome, is an early onset, autosomal recessive disorder defined by the occurrence of megaloblastic anaemia, diabetes mellitus and sensorineural deafness, responding in varying degrees to thiamine treatment (MIM 249270). We have previously narrowed the TRMA locus from a 16-cM to a 4-cM interval on chromosomal region 1q23.3 (refs 3,4) and this region has been further refined to a 1.4-cM interval. Previous studies have suggested that deficiency in a high-affinity thiamine transporter may cause this disorder. Here we identify the TRMA gene by positional cloning. We assembled a P1-derived artificial chromosome (PAC) contig spanning the TRMA candidate region. This clarified the order of genetic markers across the TRMA locus, provided 9 new polymorphic markers and narrowed the locus to an approximately 400-kb region. Mutations in a new gene, SLC19A2, encoding a putative transmembrane protein homologous to the reduced folate carrier proteins, were found in all affected individuals in six TRMA families, suggesting that a defective thiamine transporter protein (THTR-1) may underlie the TRMA syndrome.

    Funded by: Wellcome Trust

    Nature genetics 1999;22;3;300-4

  • Data mining parasite genomes: haystack searching with a computer.

    Lawson D

    Pathogen Sequencing Unit, Sanger Centre, Hinxton, Cambridge, UK.

    A number of genomes of parasitic organisms are presently being sequenced in the public domain, including Plasmodium falciparum, Leishmania major and Trypanosoma brucei with the likelihood of at least expressed sequence tag (EST) projects for several filarial and apicomplexan species. The early and timely release of sequence data to the community via the World Wide Web (www), and the public databases, (EMBL and GENBANK), forms an invaluable resource. Data mining, or 'haystack searching' this resource is becoming more fruitful to all members of the scientific community as the volume of data, diversity of genomes sampled, and accessibility increase.

    Funded by: Wellcome Trust

    Parasitology 1999;118 Suppl;S15-8

  • Techview: DNA sequencing. Sequencing the genome, fast.

    Mullikin JC and McMurragy AA

    Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs, UK.

    Science (New York, N.Y.) 1999;283;5409;1867-9

  • A YAC-based physical map of the mouse genome.

    Nusbaum C, Slonim DK, Harris KL, Birren BW, Steen RG, Stein LD, Miller J, Dietrich WF, Nahf R, Wang V, Merport O, Castle AB, Husain Z, Farino G, Gray D, Anderson MO, Devine R, Horton LT, Ye W, Wu X, Kouyoumjian V, Zemsteva IS, Wu Y, Collymore AJ, Courtney DF, Tam J, Cadman M, Haynes AR, Heuston C, Marsland T, Southwell A, Trickett P, Strivens MA, Ross MT, Makalowski W, Xu Y, Boguski MS, Carter NP, Denny P, Brown SD, Hudson TJ and Lander ES

    Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA.

    A physical map of the mouse genome is an essential tool for both positional cloning and genomic sequencing in this key model system for biomedical research. Indeed, the construction of a mouse physical map with markers spaced at an average interval of 300 kb is one of the stated goals of the Human Genome Project. Here we report the results of a project at the Whitehead Institute/MIT Center for Genome Research to construct such a physical map of the mouse. We built the map by screening sequenced-tagged sites (STSs) against a large-insert yeast artificial chromosome (YAC) library and then integrating the STS-content information with a dense genetic map. The integrated map shows the location of 9,787 loci, providing landmarks with an average spacing of approximately 300 kb and affording YAC coverage of approximately 92% of the mouse genome. We also report the results of a project at the MRC UK Mouse Genome Centre targeted at chromosome X. The project produced a YAC-based map containing 619 loci (with 121 loci in common with the Whitehead map and 498 additional loci), providing especially dense coverage of this sex chromosome. The YAC-based physical map directly facilitates positional cloning of mouse mutations by providing ready access to most of the genome. More generally, use of this map in addition to a newly constructed radiation hybrid (RH) map provides a comprehensive framework for mouse genomic studies.

    Funded by: Wellcome Trust

    Nature genetics 1999;22;4;388-93

  • Analysis and assessment of ab initio three-dimensional prediction, secondary structure, and contacts prediction.

    Orengo CA, Bray JE, Hubbard T, LoConte L and Sillitoe I

    Department of Biochemistry and Molecular Biology, University College, London, United Kingdom.

    CASP3 saw a substantial increase in the volume of ab initio 3D prediction data, with 507 datasets for fifteen selected targets and sixty-one groups participating. As with CASP2, methods ranged from computationally intensive strategies that attempt to recreate the physical and chemical forces involved in protein folding to the more recent knowledge-based approaches. These exploit information from the structure databases, extracting potentially similar fragments and/or distance constraints derived from multiple sequence alignments. The knowledge-based approaches generally gave more consistently successful predictions across the range of targets, particularly that of the Baker group (Bystroff and Baker, J Mol Biol 1998;281:565-577; Simons et al. Proteins Suppl 1999;3:171-176), which used a fragment library. In the secondary structure prediction category, the most successful approaches built on the concepts used in PHD (Rost et al. Comput Appl Biosci 1994;10:53-60), an accepted standard in this field. Like PHD, they exploit neural networks but have different strategies for incorporating multiple sequence data or position-dependent weight matrices for training the networks. Analysis of the contact data, for which only six groups participated, suggested that as yet this data provides a rather weak signal. However, in combination with other types of prediction data it can sometimes be a useful constraint for identifying the correct structure.

    Funded by: Wellcome Trust

    Proteins 1999;Suppl 3;149-70

  • Sequencing. Gels and genomes.

    Rogers J

    The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.

    Science (New York, N.Y.) 1999;286;5439;429

  • Nonmethylated transposable elements and methylated genes in a chordate genome.

    Simmen MW, Leitgeb S, Charlton J, Jones SJ, Harris BR, Clark VH and Bird A

    Institute of Cell and Molecular Biology, University of Edinburgh, The King's Buildings, Edinburgh EH9 3JR, UK.

    The genome of the invertebrate chordate Ciona intestinalis was found to be a stable mosaic of methylated and nonmethylated domains. Multiple copies of an apparently active long terminal repeat retrotransposon and a long interspersed element are nonmethylated and a large fraction of abundant short interspersed elements are also methylation free. Genes, by contrast, are predominantly methylated. These data are incompatible with the genome defense model, which proposes that DNA methylation in animals is primarily targeted to endogenous transposable elements. Cytosine methylation in this urochordate may be preferentially directed to genes.

    Funded by: Wellcome Trust

    Science (New York, N.Y.) 1999;283;5405;1164-7

  • A Hybrid Score Measurement for HMM-Based Speaker Verification

    Yong Gu & Threvor Thomas

    Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference 1999;Volume:1;317 - 320 vol.1

  • Complete sequence and gene map of a human major histocompatibility complex. The MHC sequencing consortium.

    No authors listed

    Here we report the first complete sequence and gene map of a human major histocompatibility complex (MHC), a region on chromosome 6 which is essential to the immune system. When it was discovered over 50 years ago the region was thought to specify histocompatibility genes, but their nature has been resolved only in the last two decades. Although many of the 224 identified gene loci (128 predicted to be expressed) are still of unknown function, we estimate that about 40% of the expressed genes have immune system function. Over 50% of the MHC has been sequenced twice, in different haplotypes, giving insight into the extraordinary polymorphism and evolution of this region. Several genes, particularly of the MHC class II and III regions, can be traced by sequence similarity and synteny to over 700 million years ago, clearly predating the emergence of the adaptive immune system some 400 million years ago. The sequence is expected to be invaluable for the identification of many common disease loci. In the past, the search for these loci has been hampered by the complexity of high gene density and linkage disequilibrium.

    Funded by: Wellcome Trust

    Nature 1999;401;6756;921-3