Sanger Institute - Publications 2001
Number of papers published in 2001: 27
Word Level Confidence Measures Using N -Best Sub-Hypotheses Likelihood Ratio
7th European Conference on Speech Communication and Technology, Aalborg, Denmark, September 3-7, 2001 2001
The physical maps for sequencing human chromosomes 1, 6, 9, 10, 13, 20 and X.
The Sanger Centre, Hinxton, Cambridge, UK. firstname.lastname@example.org
We constructed maps for eight chromosomes (1, 6, 9, 10, 13, 20, X and (previously) 22), representing one-third of the genome, by building landmark maps, isolating bacterial clones and assembling contigs. By this approach, we could establish the long-range organization of the maps early in the project, and all contig extension, gap closure and problem-solving was simplified by containment within local regions. The maps currently represent more than 94% of the euchromatic (gene-containing) regions of these chromosomes in 176 contigs, and contain 96% of the chromosome-specific markers in the human gene map. By measuring the remaining gaps, we can assess chromosome length and coverage in sequenced clones.
Mining the draft human genome.
The European Bioinformatics Institute, Hinxton, Cambridge, UK. email@example.com
Now that the draft human genome sequence is available, everyone wants to be able to use it. However, we have perhaps become complacent about our ability to turn new genomes into lists of genes. The higher volume of data associated with a larger genome is accompanied by a much greater increase in complexity. We need to appreciate both the scale of the challenge of vertebrate genome analysis and the limitations of current gene prediction methods and understanding.
An SSLP marker-anchored BAC framework map of the mouse genome.
Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, Texas 77030, USA.
We have constructed a BAC framework map of the mouse genome consisting of 2,808 PCR-confirmed BAC clusters, using a previously described method. Fingerprints of BACs from selected clusters confirm the accuracy of the map. Combined with BAC fingerprint data, the framework map covers 37% of the mouse genome.
Nature genetics 2001;29;2;133-4
Integration of cytogenetic landmarks into the draft sequence of the human genome.
Department of Pediatrics, University of Pennsylvania, The Children's Hospital of Philadelphia, 19104, USA.
We have placed 7,600 cytogenetically defined landmarks on the draft sequence of the human genome to help with the characterization of genes altered by gross chromosomal aberrations that cause human disease. The landmarks are large-insert clones mapped to chromosome bands by fluorescence in situ hybridization. Each clone contains a sequence tag that is positioned on the genomic sequence. This genome-wide set of sequence-anchored clones allows structural and functional analyses of the genome. This resource represents the first comprehensive integration of cytogenetic, radiation hybrid, linkage and sequence maps of the human genome; provides an independent validation of the sequence map and framework for contig order and orientation; surveys the genome for large-scale duplications, which are likely to require special attention during sequence assembly; and allows a stringent assessment of sequence differences between the dark and light bands of chromosomes. It also provides insight into large-scale chromatin structure and the evolution of chromosomes and gene families and will accelerate our understanding of the molecular bases of human disease and cancer.
Disruption of an imprinted gene cluster by a targeted chromosomal translocation in mice.
Howard Hughes Medical Institute and Department of Molecular Biology, Princeton University, Princeton, New Jersey, USA.
Genomic imprinting is an epigenetic process in which the activity of a gene is determined by its parent of origin. Mechanisms governing genomic imprinting are just beginning to be understood. However, the tendency of imprinted genes to exist in chromosomal clusters suggests a sharing of regulatory elements. To better understand imprinted gene clustering, we disrupted a cluster of imprinted genes on mouse distal chromosome 7 using the Cre/loxP recombination system. In mice carrying a site-specific translocation separating Cdkn1c and Kcnq1, imprinting of the genes retained on chromosome 7, including Kcnq1, Kcnq1ot1, Ascl2, H19 and Igf2, is unaffected, demonstrating that these genes are not regulated by elements near or telomeric to Cdkn1c. In contrast, expression and imprinting of the translocated Cdkn1c, Slc22a1l and Tssc3 on chromosome 11 are affected, consistent with the hypothesis that elements regulating both expression and imprinting of these genes lie within or proximal to Kcnq1. These data support the proposal that chromosomal abnormalities, including translocations, within KCNQ1 that are associated with the human disease Beckwith-Wiedemann syndrome (BWS) may disrupt CDKN1C expression. These results underscore the importance of gene clustering for the proper regulation of imprinted genes.
Nature genetics 2001;29;1;78-82
Massive gene decay in the leprosy bacillus.
Unité de Génétique Moléculaire Bactérienne, Institut Pasteur, Paris, France. firstname.lastname@example.org
Leprosy, a chronic human neurological disease, results from infection with the obligate intracellular pathogen Mycobacterium leprae, a close relative of the tubercle bacillus. Mycobacterium leprae has the longest doubling time of all known bacteria and has thwarted every effort at culture in the laboratory. Comparing the 3.27-megabase (Mb) genome sequence of an armadillo-derived Indian isolate of the leprosy bacillus with that of Mycobacterium tuberculosis (4.41 Mb) provides clear explanations for these properties and reveals an extreme case of reductive evolution. Less than half of the genome contains functional genes but pseudogenes, with intact counterparts in M. tuberculosis, abound. Genome downsizing and the current mosaic arrangement appear to have resulted from extensive recombination events between dispersed repetitive sequences. Gene deletion and decay have eliminated many important metabolic activities including siderophore production, part of the oxidative and most of the microaerophilic and anaerobic respiratory chains, and numerous catabolic systems and their regulatory circuits.
Contact Mechanics and Coefficients of Restitution
Lecture Notes in Physics 2001;564;184-194
A SNP resource for human chromosome 22: extracting dense clusters of SNPs from the genomic sequence.
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
The recent publication of the complete sequence of human chromosome 22 provides a platform from which to investigate genomic sequence variation. We report the identification and characterization of 12,267 potential variants (SNPs and other small insertions/deletions) of human chromosome 22, discovered in the overlaps of 460 clones used for the chromosome sequencing. We found, on average, 1 potential variant every 1.07 kb and approximately 18% of the potential variants involve insertions/deletions. The SNPs have been positioned both relative to each other, and to genes, predicted genes, repeat sequences, other genetic markers, and the 2730 SNPs previously identified on the chromosome. A subset of the SNPs were verified experimentally using either PCR-RFLP or genomic Invader assays. These experiments confirmed 92% of the potential variants in a panel of 92 individuals. [Details of the SNPs and RFLP assays can be found at http://www.sanger.ac.uk and in dbSNP.]
Funded by: Wellcome Trust
Genome research 2001;11;1;170-8
A superfamily of variant genes encoded in the subtelomeric region of Plasmodium vivax.
Departamento de Parasitologia, Instituto de Ciências Biomédicas, Universidade de São Paulo, Av. Lineu Prestes 1374, São Paulo, SP 05508-900, Brazil. email@example.com
The malarial parasite Plasmodium vivax causes disease in humans, including chronic infections and recurrent relapses, but the course of infection is rarely fatal, unlike that caused by Plasmodium falciparum. To investigate differences in pathogenicity between P. vivax and P. falciparum, we have compared the subtelomeric domains in the DNA of these parasites. In P. falciparum, subtelomeric domains are conserved and contain ordered arrays of members of multigene families, such as var, rif and stevor, encoding virulence determinants of cytoadhesion and antigenic variation. Here we identify, through the analysis of a continuous 155,711-base-pair sequence of a P. vivax chromosome end, a multigene family called vir, which is specific to P. vivax. The vir genes are present at about 600-1,000 copies per haploid genome and encode proteins that are immunovariant in natural infections, indicating that they may have a functional role in establishing chronic infection through antigenic variation.
The DNA sequence and comparative analysis of human chromosome 20.
The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK. firstname.lastname@example.org
The finished sequence of human chromosome 20 comprises 59,187,298 base pairs (bp) and represents 99.4% of the euchromatic DNA. A single contig of 26 megabases (Mb) spans the entire short arm, and five contigs separated by gaps totalling 320 kb span the long arm of this metacentric chromosome. An additional 234,339 bp of sequence has been determined within the pericentromeric region of the long arm. We annotated 727 genes and 168 pseudogenes in the sequence. About 64% of these genes have a 5' and a 3' untranslated region and a complete open reading frame. Comparative analysis of the sequence of chromosome 20 to whole-genome shotgun-sequence data of two other vertebrates, the mouse Mus musculus and the puffer fish Tetraodon nigroviridis, provides an independent measure of the efficiency of gene annotation, and indicates that this analysis may account for more than 95% of all coding exons and almost all genes.
Cancer and genomics.
Cancer Genome Project, Sanger Centre, Cambridge, UK.
Identification of the genes that cause oncogenesis is a central aim of cancer research. We searched the proteins predicted from the draft human genome sequence for paralogues of known tumour suppressor genes, but no novel genes were identified. We then assessed whether it was possible to search directly for oncogenic sequence changes in cancer cells by comparing cancer genome sequences against the draft genome. Apparently chimaeric transcripts (from oncogenic fusion genes generated by chromosomal translocations, the ends of which mapped to different genomic locations) were detected to the same degree in both normal and neoplastic tissues, indicating a significant level of false positives. Our experiment underscores the limited amount and variable quality of DNA sequence from cancer cells that is currently available.
Functional annotation of a full-length mouse cDNA collection.
Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center, Yokohama Institute, Kanagawa, Japan.
The RIKEN Mouse Gene Encyclopaedia Project, a systematic approach to determining the full coding potential of the mouse genome, involves collection and sequencing of full-length complementary DNAs and physical mapping of the corresponding genes to the mouse genome. We organized an international functional annotation meeting (FANTOM) to annotate the first 21,076 cDNAs to be analysed in this project. Here we describe the first RIKEN clone collection, which is one of the largest described for any organism. Analysis of these cDNAs extends known gene families and identifies new ones.
Initial sequencing and analysis of the human genome.
Whitehead Institute for Biomedical Research, Center for Genome Research, Cambridge, MA 02142, USA. email@example.com
The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.
Assessment of novel fold targets in CASP4: predictions of three-dimensional structures, secondary structures, and interresidue contacts.
Department of Haematology, University of Cambridge Clinical School, Cambridge Institute for Medical Research, Cambridge, United Kingdom. firstname.lastname@example.org
In the Novel Fold category, three types of predictions were assessed: three-dimensional structures, secondary structures, and residue-residue contacts. For predictions of three-dimensional models, CASP4 targets included 5 domains or structures with novel folds, and 13 on the borderline between Novel Fold and Fold Recognition categories. These elicited 1863 predictions of these and other targets by methods more general than comparative modeling or fold recognition techniques. The group of Bonneau, Tsai, Ruczinski, and Baker stood out as performing well with the greatest consistency. In many cases, several groups were able to predict fragments of the target correctly-often at a level somewhat larger than standard supersecondary structures-but were not able to assemble fragments into a correct global topology. The methods of Bonneau, Tsai, Ruczinski, and Baker have been successful in addressing the fragment assembly problem for many but not all the target structures.
Proteins 2001;Suppl 5;98-118
Tbx1 haploinsufficieny in the DiGeorge syndrome region causes aortic arch defects in mice.
Department of Pediatrics, Baylor College of Medicine, Houston, Texas 77030, USA.
DiGeorge syndrome is characterized by cardiovascular, thymus and parathyroid defects and craniofacial anomalies, and is usually caused by a heterozygous deletion of chromosomal region 22q11.2 (del22q11) (ref. 1). A targeted, heterozygous deletion, named Df(16)1, encompassing around 1 megabase of the homologous region in mouse causes cardiovascular abnormalities characteristic of the human disease. Here we have used a combination of chromosome engineering and P1 artificial chromosome transgenesis to localize the haploinsufficient gene in the region, Tbx1. We show that Tbx1, a member of the T-box transcription factor family, is required for normal development of the pharyngeal arch arteries in a gene dosage-dependent manner. Deletion of one copy of Tbx1 affects the development of the fourth pharyngeal arch arteries, whereas homozygous mutation severely disrupts the pharyngeal arch artery system. Our data show that haploinsufficiency of Tbx1 is sufficient to generate at least one important component of the DiGeorge syndrome phenotype in mice, and demonstrate the suitability of the mouse for the genetic dissection of microdeletion syndromes.
Critical assessment of methods of protein structure prediction (CASP): round IV.
Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, Rockville, Maryland 20850, USA. email@example.com
Funded by: NIGMS NIH HHS: GM/DK61967; NLM NIH HHS: LM07085
Proteins 2001;Suppl 5;2-7
Prediction targets of CASP4.
Centre for Protein Engineering, MRC Centre, Cambridge, UK.
Proteins 2001;Suppl 5;8-12
Breast cancer genetics: what we know and what we need.
Abramson Family Cancer Research Institute, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
Breast cancer results from genetic and environmental factors leading to the accumulation of mutations in essential genes. Genetic predisposition may have a strong, almost singular effect, as with BRCA1 and BRCA2, or may represent the cumulative effects of multiple low-penetrance susceptibility alleles. Here we review high- and low-penetrance breast-cancer-susceptibility alleles and discuss ongoing efforts to identify additional susceptibility genes. Ultimately these discoveries will lead to individualized breast cancer risk assessment and a reduction in breast cancer incidence.
Nature medicine 2001;7;5;552-6
SSAHA: a fast search method for large DNA databases.
Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
We describe an algorithm, SSAHA (Sequence Search and Alignment by Hashing Algorithm), for performing fast searches on databases containing multiple gigabases of DNA. Sequences in the database are preprocessed by breaking them into consecutive k-tuples of k contiguous bases and then using a hash table to store the position of each occurrence of each k-tuple. Searching for a query sequence in the database is done by obtaining from the hash table the "hits" for each k-tuple in the query sequence and then performing a sort on the results. We discuss the effect of the tuple length k on the search speed, memory usage, and sensitivity of the algorithm and present the results of computational experiments which show that SSAHA can be three to four orders of magnitude faster than BLAST or FASTA, while requiring less memory than suffix tree methods. The SSAHA algorithm is used for high-throughput single nucleotide polymorphism (SNP) detection and very large scale sequence assembly. Also, it provides Web-based sequence search facilities for Ensembl projects.
Genome research 2001;11;10;1725-9
Complete genome sequence of a multiple drug resistant Salmonella enterica serovar Typhi CT18.
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. firstname.lastname@example.org
Salmonella enterica serovar Typhi (S. typhi) is the aetiological agent of typhoid fever, a serious invasive bacterial disease of humans with an annual global burden of approximately 16 million cases, leading to 600,000 fatalities. Many S. enterica serovars actively invade the mucosal surface of the intestine but are normally contained in healthy individuals by the local immune defence mechanisms. However, S. typhi has evolved the ability to spread to the deeper tissues of humans, including liver, spleen and bone marrow. Here we have sequenced the 4,809,037-base pair (bp) genome of a S. typhi (CT18) that is resistant to multiple drugs, revealing the presence of hundreds of insertions and deletions compared with the Escherichia coli genome, ranging in size from single genes to large islands. Notably, the genome sequence identifies over two hundred pseudogenes, several corresponding to genes that are known to contribute to virulence in Salmonella typhimurium. This genetic degradation may contribute to the human-restricted host range for S. typhi. CT18 harbours a 218,150-bp multiple-drug-resistance incH1 plasmid (pHCM1), and a 106,516-bp cryptic plasmid (pHCM2), which shows recent common ancestry with a virulence plasmid of Yersinia pestis.
Genome sequence of Yersinia pestis, the causative agent of plague.
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. email@example.com
The Gram-negative bacterium Yersinia pestis is the causative agent of the systemic invasive infectious disease classically referred to as plague, and has been responsible for three human pandemics: the Justinian plague (sixth to eighth centuries), the Black Death (fourteenth to nineteenth centuries) and modern plague (nineteenth century to the present day). The recent identification of strains resistant to multiple drugs and the potential use of Y. pestis as an agent of biological warfare mean that plague still poses a threat to human health. Here we report the complete genome sequence of Y. pestis strain CO92, consisting of a 4.65-megabase (Mb) chromosome and three plasmids of 96.2 kilobases (kb), 70.3 kb and 9.6 kb. The genome is unusually rich in insertion sequences and displays anomalies in GC base-composition bias, indicating frequent intragenomic recombination. Many genes seem to have been acquired from other bacteria and viruses (including adhesins, secretion systems and insecticidal toxins). The genome contains around 150 pseudogenes, many of which are remnants of a redundant enteropathogenic lifestyle. The evidence of ongoing genome fluidity, expansion and decay suggests Y. pestis is a pathogen that has undergone large-scale genetic flux and provides a unique insight into the ways in which new and highly virulent pathogens evolve.
A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms.
Cold Spring Harbor, New York 11724, USA.
We describe a map of 1.42 million single nucleotide polymorphisms (SNPs) distributed throughout the human genome, providing an average density on available sequence of one SNP every 1.9 kilobases. These SNPs were primarily discovered by two projects: The SNP Consortium and the analysis of clone overlaps by the International Human Genome Sequencing Consortium. The map integrates all publicly available SNPs with described genes and other genomic features. We estimate that 60,000 SNPs fall within exon (coding and untranslated regions), and 85% of exons are within 5 kb of the nearest SNP. Nucleotide diversity varies greatly across the genome, in a manner broadly consistent with a standard population genetic model of human history. This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.
Granular powders and solids: Insights from numerical simulations
Powders and Solids : Developments in Handling and Processing Technologies 2001
A Text-Independent Speaker Verification System Using Support Vector Machines Classifier
7th European Conference on Speech Communication and Technology, Aalborg, Denmark, September 3-7, 2001 2001
Comparison of human genetic and sequence-based physical maps.
Center for Medical Genetics, Marshfield Medical Research Foundation, Wisconsin 54449, USA.
Recombination is the exchange of information between two homologous chromosomes during meiosis. The rate of recombination per nucleotide, which profoundly affects the evolution of chromosomal segments, is calculated by comparing genetic and physical maps. Human physical maps have been constructed using cytogenetics, overlapping DNA clones and radiation hybrids; but the ultimate and by far the most accurate physical map is the actual nucleotide sequence. The completion of the draft human genomic sequence provides us with the best opportunity yet to compare the genetic and physical maps. Here we describe our estimates of female, male and sex-average recombination rates for about 60% of the genome. Recombination rates varied greatly along each chromosome, from 0 to at least 9 centiMorgans per megabase (cM Mb(-1)). Among several sequence and marker parameters tested, only relative marker position along the metacentric chromosomes in males correlated strongly with recombination rate. We identified several chromosomal regions up to 6 Mb in length with particularly low (deserts) or high (jungles) recombination rates. Linkage disequilibrium was much more common and extended for greater distances in the deserts than in the jungles.
Engineering chromosomal rearrangements in mice.
Program in Developmental Biology, Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, Texas 77030, USA.firstname.lastname@example.org
The combination of gene-targeting techniques in mouse embryonic stem cells and the Cre/loxP site-specific recombination system has resulted in the emergence of chromosomal-engineering technology in mice. This advance has opened up new opportunities for modelling human diseases that are associated with chromosomal rearrangements. It has also led to the generation of visibly marked deletions and balancer chromosomes in mice, which provide essential reagents for maximizing the efficiency of large-scale mutagenesis efforts and which will accelerate the functional annotation of mammalian genomes, including the human genome.
Nature reviews. Genetics 2001;2;10;780-90