Sanger Institute - Publications 2004
Number of papers published in 2004: 73
Mutagenic insertion and chromosome engineering resource (MICER).
The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambs, CB10 1SA, UK.
Embryonic stem cell technology revolutionized biology by providing a means to assess mammalian gene function in vivo. Although it is now routine to generate mice from embryonic stem cells, one of the principal methods used to create mutations, gene targeting, is a cumbersome process. Here we describe the indexing of 93,960 ready-made insertional targeting vectors from two libraries. 5,925 of these vectors can be used directly to inactivate genes with an average targeting efficiency of 28%. Combinations of vectors from the two libraries can be used to disrupt both alleles of a gene or engineer larger genomic changes such as deletions, duplications, translocations or inversions. These indexed vectors constitute a public resource (Mutagenic Insertion and Chromosome Engineering Resource; MICER) for high-throughput, targeted manipulation of the mouse genome.
Nature genetics 2004;36;8;867-71
SCOP database in 2004: refinements integrate structure and sequence family data.
MRC Centre for Protein Engineering, Hills Road, Cambridge CB2 2QH, UK.
The Structural Classification of Proteins (SCOP) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. Protein domains in SCOP are hierarchically classified into families, superfamilies, folds and classes. The continual accumulation of sequence and structural data allows more rigorous analysis and provides important information for understanding the protein world and its evolutionary repertoire. SCOP participates in a project that aims to rationalize and integrate the data on proteins held in several sequence and structure databases. As part of this project, starting with release 1.63, we have initiated a refinement of the SCOP classification, which introduces a number of changes mostly at the levels below superfamily. The pending SCOP reclassification will be carried out gradually through a number of future releases. In addition to the expanded set of static links to external resources, available at the level of domain entries, we have started modernization of the interface capabilities of SCOP allowing more dynamic links with other databases. SCOP can be accessed at http://scop.mrc-lmb.cam.ac.uk/scop.
Nucleic acids research 2004;32;Database issue;D226-9
Chromosome 21 and down syndrome: from genomics to pathophysiology.
Department of Genetic Medicine and Development, University of Geneva Medical School and University Hospitals of Geneva, 1 rue Michel-Servet, 1211 Geneva, Switzerland. Stylianos.Antonarakis@medecine.unige.ch
The sequence of chromosome 21 was a turning point for the understanding of Down syndrome. Comparative genomics is beginning to identify the functional components of the chromosome and that in turn will set the stage for the functional characterization of the sequences. Animal models combined with genome-wide analytical methods have proved indispensable for unravelling the mysteries of gene dosage imbalance.
Nature reviews. Genetics 2004;5;10;725-38
Domain insertions in protein structures.
The Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
Domains are the structural, functional or evolutionary units of proteins. Proteins can comprise a single domain or a combination of domains. In multi-domain proteins, the domains almost always occur end-to-end, i.e., one domain follows the C-terminal end of another domain. However, there are exceptions to this common pattern, where multi-domain proteins are formed by insertion of one domain (insert) into another domain (parent). Here, we provide a quantitative description of known insertions in the Protein Data Bank (PDB). We found that 9% of domain combinations observed in non-redundant PDB are insertions. Although 90% of all insertions involve only one insert, proteins can clearly have multiple (nested, two-domain and three-domain) inserts. We also observed correlations between the structure and function of a domain and its tendency to be found as a parent or an insert. There is a bias in insert position towards the C terminus of parents. We observed that the atomic distance between the N and C terminus of an insert is significantly smaller when compared to the N-to-C distance in a parent context or a single domain context. Insertions are found always to occur in loop regions of parent domains. Our observations regarding the relationship between domain insertions and the structure, function and evolution of proteins have implications for protein engineering.
Journal of molecular biology 2004;338;4;633-41
The knockout mouse project.
National Human Genome Research Institute, National Institutes of Health, Building 31, Room 4B09, 31 Center Drive, Bethesda, Maryland 20892, USA. email@example.com
Mouse knockout technology provides a powerful means of elucidating gene function in vivo, and a publicly available genome-wide collection of mouse knockouts would be significantly enabling for biomedical discovery. To date, published knockouts exist for only about 10% of mouse genes. Furthermore, many of these are limited in utility because they have not been made or phenotyped in standardized ways, and many are not freely available to researchers. It is time to harness new technologies and efficiencies of production to mount a high-throughput international effort to produce and phenotype knockouts for all mouse genes, and place these resources into the public domain.
Funded by: Wellcome Trust: 077188
Nature genetics 2004;36;9;921-4
The European dimension for the mouse genome mutagenesis program.
Mouse Clinical Institute (MCI), Illkirch, Strasbourg, France [corrected].
The European Mouse Mutagenesis Consortium is the European initiative contributing to the international effort on functional annotation of the mouse genome. Its objectives are to establish and integrate mutagenesis platforms, gene expression resources, phenotyping units, storage and distribution centers and bioinformatics resources. The combined efforts will accelerate our understanding of gene function and of human health and disease.
Funded by: Medical Research Council: MC_U127527203; Telethon: TGM03S01, TGM06S01; Wellcome Trust: 077188
Nature genetics 2004;36;9;925-7
The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website.
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.
The discovery of mutations in cancer genes has advanced our understanding of cancer. These results are dispersed across the scientific literature and with the availability of the human genome sequence will continue to accrue. The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website have been developed to store somatic mutation data in a single location and display the data and other information related to human cancer. To populate this resource, data has currently been extracted from reports in the scientific literature for somatic mutations in four genes, BRAF, HRAS, KRAS2 and NRAS. At present, the database holds information on 66 634 samples and reports a total of 10 647 mutations. Through the web pages, these data can be queried, displayed as figures or tables and exported in a number of formats. COSMIC is an ongoing project that will continue to curate somatic mutation data and release it through the website.
British journal of cancer 2004;91;2;355-8
Bioinformatics of proteases in the MEROPS database.
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 ISA, UK. firstname.lastname@example.org
Proteolytic enzymes represent approximately approximately 2% of the total number of proteins present in all types of organisms. Many of these enzymes are of medical importance, and those that are of potential interest as drug targets can be divided into the endogenous enzymes encoded in the human genome, and the exogenous proteases encoded in the genomes of disease-causing organisms. There are also naturally occurring inhibitors of proteases, some of which have pharmaceutical relevance. The MEROPS database provides a rich source of information on proteases and their inhibitors. Storage and retrieval of this information is facilitated by the use of a hierarchical classification system (which was pioneered by the compilers of the database) in which homologous proteases and their inhibitors are divided into clans and families.
Current opinion in drug discovery & development 2004;7;3;334-41
Genome sequence of the enterobacterial phytopathogen Erwinia carotovora subsp. atroseptica and characterization of virulence factors.
The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.
The bacterial family Enterobacteriaceae is notable for its well studied human pathogens, including Salmonella, Yersinia, Shigella, and Escherichia spp. However, it also contains several plant pathogens. We report the genome sequence of a plant pathogenic enterobacterium, Erwinia carotovora subsp. atroseptica (Eca) strain SCRI1043, the causative agent of soft rot and blackleg potato diseases. Approximately 33% of Eca genes are not shared with sequenced enterobacterial human pathogens, including some predicted to facilitate unexpected metabolic traits, such as nitrogen fixation and opine catabolism. This proportion of genes also contains an overrepresentation of pathogenicity determinants, including possible horizontally acquired gene clusters for putative type IV secretion and polyketide phytotoxin synthesis. To investigate whether these gene clusters play a role in the disease process, an arrayed set of insertional mutants was generated, and mutations were identified. Plant bioassays showed that these mutants were significantly reduced in virulence, demonstrating both the presence of novel pathogenicity determinants in Eca, and the impact of functional genomics in expanding our understanding of phytopathogenicity in the Enterobacteriaceae.
Proceedings of the National Academy of Sciences of the United States of America 2004;101;30;11105-10
Genomes for medicine.
The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK. email@example.com
We have the human genome sequence. It is freely available, accurate and nearly complete. But is the genome ready for medicine? The new resource is already changing genetic research strategies to find information of medical value. Now we need high-quality annotation of all the functionally important sequences and the variations within them that contribute to health and disease. To achieve this, we need more genome sequences, systematic experimental analyses, and extensive information on human phenotypes. Flexible and user-friendly access to well-annotated genomes will create an environment for innovation, and the potential for unlimited use of sequencing in biomedical research and practice.
Genomic pot pourri.
Nature reviews. Microbiology 2004;2;12;928-9
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
The Ensembl (http://www.ensembl.org/) database project provides a bioinformatics framework to organize biology around the sequences of large genomes. It is a comprehensive and integrated source of annotation of large genome sequences, available via interactive website, web services or flat files. As well as being one of the leading sources of genome annotation, Ensembl is an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements. The facilities of the system range from sequence analysis to data storage and visualization and installations exist around the world both in companies and at academic sites. With a total of nine genome sequences available from Ensembl and more genomes to follow, recent developments have focused mainly on closer integration between genomes and external data.
Nucleic acids research 2004;32;Database issue;D468-70
An overview of Ensembl.
EMBL European Bioinformatics Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. firstname.lastname@example.org
Ensembl (http://www.ensembl.org/) is a bioinformatics project to organize biological information around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of individual genomes, and of the synteny and orthology relationships between them. It is also a framework for integration of any biological data that can be mapped onto features derived from the genomic sequence. Ensembl is available as an interactive Web site, a set of flat files, and as a complete, portable open source software system for handling genomes. All data are provided without restriction, and code is freely available. Ensembl's aims are to continue to "widen" this biological integration to include other model organisms relevant to understanding human biology as they become available; to "deepen" this integration to provide an ever more seamless linkage between equivalent components in different species; and to provide further classification of functional elements in the genome that have been previously elusive.
Funded by: Wellcome Trust: 062023
Genome research 2004;14;5;925-8
As normal as normal can be?
Two papers report that large-scale copy-number variations, ranging in size from 100 kb to 2 Mb, are distributed widely throughout the human genome, and that a high proportion of them encompass known genes. This unexpected level of genome variation has implications for our view of human genetic diversity and phenotypic variation.
Nature genetics 2004;36;9;931-2
New environments, versatile genomes.
The Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. email@example.com
Nature reviews. Microbiology 2004;2;6;446-7
Pathogens in decay.
Nature reviews. Microbiology 2004;2;10;774-5
Improved techniques for the identification of pseudogenes.
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. firstname.lastname@example.org
Motivation: Pseudogenes are the remnants of genomic sequences of genes which are no longer functional. They are frequent in most eukaryotic genomes, and an important resource for comparative genomics. However, pseudogenes are often mis-annotated as functional genes in sequence databases. Current methods for identifying pseudogenes include methods which rely on the presence of stop codons and frameshifts, as well as methods based on the ratio of non-silent to silent nucleotide substitution rates (dN/dS). A recent survey concluded that 50% of human pseudogenes have no detectable truncation in their pseudo-coding regions, indicating that the former methods lack sensitivity. The latter methods have been used to find sets of genes enriched for pseudogenes, but are not specific enough to accurately separate pseudogenes from expressed genes.
Results: We introduce a program called pseudogene inference from loss of constraint (PSILC) which incorporates novel methods for separating pseudogenes from functional genes. The methods calculate the log-odds score that evolution along the final branch of the gene tree to the query gene has been according to the following constraints: A neutral nucleotide model compared to a Pfam domain encoding model (PSILC(nuc/dom)); A protein coding model compared to a Pfam domain encoding model (PSILC(prot/dom)). Using the manual annotation of human chromosome 6, we show that both these methods result in a more accurate classification of pseudogenes than dN/dS when a Pfam domain alignment is available.
Availability: PSILC is available from http://www.sanger.ac.uk/Software/PSILC
Funded by: Wellcome Trust
Bioinformatics (Oxford, England) 2004;20 Suppl 1;i94-100
Differential requirements for COPI transport during vertebrate early development.
Division of Developmental Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, United Kingdom.
The coatomer vesicular coat complex is essential for normal Golgi and secretory activities in eukaryotic cells. Through positional cloning of genes controlling zebrafish notochord development, we found that the sneezy, happy, and dopey loci encode the alpha, beta, and beta' subunits of the coatomer complex. Export from mutant endoplasmic reticulum is blocked, Golgi structure is disrupted, and mutant embryos eventually degenerate due to widespread apoptosis. The early embryonic phenotype, however, demonstrates that despite its "housekeeping" functions, coatomer activity is specifically and cell autonomously required for normal chordamesoderm differentiation, perinotochordal basement membrane formation, and melanophore pigmentation. Hence, differential requirements for coatomer activity among embryonic tissues lead to tissue-specific developmental defects. Moreover, we note that the mRNA encoding alpha coatomer is strikingly upregulated in notochord progenitors, and we present data suggesting that alpha coatomer transcription is tuned to activity- and cell type-specific secretory loads.
Developmental cell 2004;7;4;547-58
Chalk and cheese.
Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. email@example.com
Nature reviews. Microbiology 2004;2;7;528-9
Sequencing the environment.
The Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
Nature reviews. Microbiology 2004;2;3;184-5
The Ensembl automatic gene annotation system.
The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.
As more genomes are sequenced, there is an increasing need for automated first-pass annotation which allows timely access to important genomic information. The Ensembl gene-building system enables fast automated annotation of eukaryotic genomes. It annotates genes based on evidence derived from known protein, cDNA, and EST sequences. The gene-building system rests on top of the core Ensembl (MySQL) database schema and Perl Application Programming Interface (API), and the data generated are accessible through the Ensembl genome browser (http://www.ensembl.org). To date, the Ensembl predicted gene sets are available for the A. gambiae, C. briggsae, zebrafish, mouse, rat, and human genomes and have been heavily relied upon in the publication of the human, mouse, rat, and A. gambiae genome sequence analysis. Here we describe in detail the gene-building system and the algorithms involved. All code and data are freely available from http://www.ensembl.org.
Genome research 2004;14;5;942-50
The DNA sequence and comparative analysis of human chromosome 10.
The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK. firstname.lastname@example.org
The finished sequence of human chromosome 10 comprises a total of 131,666,441 base pairs. It represents 99.4% of the euchromatic DNA and includes one megabase of heterochromatic sequence within the pericentromeric region of the short and long arm of the chromosome. Sequence annotation revealed 1,357 genes, of which 816 are protein coding, and 430 are pseudogenes. We observed widespread occurrence of overlapping coding genes (either strand) and identified 67 antisense transcripts. Our analysis suggests that both inter- and intrachromosomal segmental duplications have impacted on the gene count on chromosome 10. Multispecies comparative analysis indicated that we can readily annotate the protein-coding genes with current resources. We estimate that over 95% of all coding exons were identified in this study. Assessment of single base changes between the human chromosome 10 and chimpanzee sequence revealed nonsense mutations in only 21 coding genes with respect to the human sequence.
The Hotdog fold: wrapping up a superfamily of thioesterases and dehydratases.
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK. email@example.com
Background: The Hotdog fold was initially identified in the structure of Escherichia coli FabA and subsequently in 4-hydroxybenzoyl-CoA thioesterase from Pseudomonas sp. strain CBS. Since that time structural determinations have shown a number of other apparently unrelated proteins also share the Hotdog fold.
Results: Using sequence analysis we unify a large superfamily of HotDog domains. Membership includes numerous prokaryotic, archaeal and eukaryotic proteins involved in several related, but distinct, catalytic activities, from metabolic roles such as thioester hydrolysis in fatty acid metabolism, to degradation of phenylacetic acid and the environmental pollutant 4-chlorobenzoate. The superfamily also includes FapR, a non-catalytic bacterial homologue that is involved in transcriptional regulation of fatty acid biosynthesis. We have defined 17 subfamilies, with some characterisation. Operon analysis has revealed numerous HotDog domain-containing proteins to be fusion proteins, where two genes, once separate but adjacent open-reading frames, have been fused into one open-reading frame to give a protein with two functional domains. Finally we have generated a Hidden Markov Model library from our analysis, which can be used as a tool for predicting the occurrence of HotDog domains in any protein sequence.
Conclusions: The HotDog domain is both an ancient and ubiquitous motif, with members found in the three branches of life.
BMC bioinformatics 2004;5;109
What can we learn from noncoding regions of similarity between genomes?
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. firstname.lastname@example.org
Background: In addition to known protein-coding genes, large amounts of apparently non-coding sequence are conserved between the human and mouse genomes. It seems reasonable to assume that these conserved regions are more likely to contain functional elements than less-conserved portions of the genome.
Methods: Here we used a motif-oriented machine learning method based on the Relevance Vector Machine algorithm to extract the strongest signal from a set of non-coding conserved sequences.
Results: We successfully fitted models to reflect the non-coding sequences, and showed that the results were quite consistent for repeated training runs. Using the learned models to scan genomic sequence, we found that they often made predictions close to the start of annotated genes. We compared this method with other published promoter-prediction systems, and showed that the set of promoters which are detected by this method is substantially similar to that detected by existing methods.
Conclusions: The results presented here indicate that the promoter signal is the strongest single motif-based signal in the non-coding functional fraction of the genome. They also lend support to the belief that there exists a substantial subset of promoter regions which share several common features including, but not restricted to, a relative abundance of CpG dinucleotides. This subset is detectable by a variety of distinct computational methods.
BMC bioinformatics 2004;5;131
The DNA sequence and analysis of human chromosome 13.
The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK. email@example.com
Chromosome 13 is the largest acrocentric human chromosome. It carries genes involved in cancer including the breast cancer type 2 (BRCA2) and retinoblastoma (RB1) genes, is frequently rearranged in B-cell chronic lymphocytic leukaemia, and contains the DAOA locus associated with bipolar disorder and schizophrenia. We describe completion and analysis of 95.5 megabases (Mb) of sequence from chromosome 13, which contains 633 genes and 296 pseudogenes. We estimate that more than 95.4% of the protein-coding genes of this chromosome have been identified, on the basis of comparison with other vertebrate genome sequences. Additionally, 105 putative non-coding RNA genes were found. Chromosome 13 has one of the lowest gene densities (6.5 genes per Mb) among human chromosomes, and contains a central region of 38 Mb where the gene density drops to only 3.1 genes per Mb.
The ENCODE (ENCyclopedia Of DNA Elements) Project.
The ENCyclopedia Of DNA Elements (ENCODE) Project aims to identify all functional elements in the human genome sequence. The pilot phase of the Project is focused on a specified 30 megabases (approximately 1%) of the human genome sequence and is organized as an international consortium of computational and laboratory-based scientists working to develop and apply high-throughput approaches for detecting all sequence elements that confer biological function. The results of this pilot phase will guide future efforts to analyze the entire human genome.
Science (New York, N.Y.) 2004;306;5696;636-40
RNA interference: human genes hit the big screen.
A census of human cancer genes.
Cancer Genome Project, Human Genome Analysis Group and Pfam Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambs, CB10 1SA, UK.
Nature reviews. Cancer 2004;4;3;177-83
A family with severe insulin resistance and diabetes due to a mutation in AKT2.
Department of Clinical Biochemistry, University of Cambridge, Addenbrooke's Hospital, Hills Road, Cambridge CB2 2QQ, UK.
Inherited defects in signaling pathways downstream of the insulin receptor have long been suggested to contribute to human type 2 diabetes mellitus. Here we describe a mutation in the gene encoding the protein kinase AKT2/PKBbeta in a family that shows autosomal dominant inheritance of severe insulin resistance and diabetes mellitus. Expression of the mutant kinase in cultured cells disrupted insulin signaling to metabolic end points and inhibited the function of coexpressed, wild-type AKT. These findings demonstrate the central importance of AKT signaling to insulin sensitivity in humans.
Funded by: Wellcome Trust: 078986
Science (New York, N.Y.) 2004;304;5675;1325-8
Genome sequence of the Brown Norway rat yields insights into mammalian evolution.
Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, MS BCM226, One Baylor Plaza, Houston, Texas 77030, USA <http://www.hgsc.bcm.tmc.edu>.
The laboratory rat (Rattus norvegicus) is an indispensable tool in experimental medicine and drug development, having made inestimable contributions to human health. We report here the genome sequence of the Brown Norway (BN) rat strain. The sequence represents a high-quality 'draft' covering over 90% of the genome. The BN rat sequence is the third complete mammalian genome to be deciphered, and three-way comparisons with the human and mouse genomes resolve details of mammalian evolution. This first comprehensive analysis includes genes and proteins and their relation to human disease, repeated sequences, comparative genome-wide studies of mammalian orthologous chromosomal regions and rearrangement breakpoints, reconstruction of ancestral karyotypes and the events leading to existing species, rates of variation, and lineage-specific and lineage-independent evolutionary events such as expansion of gene families, orthology relations and protein evolution.
Funded by: NHGRI NIH HHS: R01 HG002939, U01 HG002137, U01 HG002137-02S2, U54 HG003273; NHLBI NIH HHS: R01 HL064541
Chromatin architecture of the human genome: gene-rich domains are enriched in open chromatin fibers.
MRC Human Genetics Unit, Edinburgh, EH4 2XU, Scotland.
We present an analysis of chromatin fiber structure across the human genome. Compact and open chromatin fiber structures were separated by sucrose sedimentation and their distributions analyzed by hybridization to metaphase chromosomes and genomic microarrays. We show that compact chromatin fibers originate from some sites of heterochromatin (C-bands), and G-bands (euchromatin). Open chromatin fibers correlate with regions of highest gene density, but not with gene expression since inactive genes can be in domains of open chromatin, and active genes in regions of low gene density can be embedded in compact chromatin fibers. Moreover, we show that chromatin fiber structure impacts on further levels of chromatin condensation. Regions of open chromatin fibers are cytologically decondensed and have a distinctive nuclear organization. We suggest that domains of open chromatin may create an environment that facilitates transcriptional activation and could provide an evolutionary constraint to maintain clusters of genes together along chromosomes.
Mismatch repair genes identified using genetic screens in Blm-deficient embryonic stem cells.
The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.
Phenotype-driven recessive genetic screens in diploid organisms require a strategy to render the mutation homozygous. Although homozygous mutant mice can be generated by breeding, a reliable method to make homozygous mutations in cultured cells has not been available, limiting recessive screens in culture. Cultured embryonic stem (ES) cells provide access to all of the genes required to elaborate the fundamental components and physiological systems of a mammalian cell. Here we have exploited the high rate of mitotic recombination in Bloom's syndrome protein (Blm)-deficient ES cells to generate a genome-wide library of homozygous mutant cells from heterozygous mutations induced with a revertible gene trap retrovirus. We have screened this library for cells with defects in DNA mismatch repair (MMR), a system that detects and repairs base-base mismatches. We demonstrate the recovery of cells with homozygous mutations in known and novel MMR genes. We identified Dnmt1(ref. 5) as a novel MMR gene and confirmed that Dnmt1-deficient ES cells exhibit micro-satellite instability, providing a mechanistic explanation for the role of Dnmt1 in cancer. The combination of insertional mutagenesis in Blm-deficient ES cells establishes a new approach for phenotype-based recessive genetic screens in ES cells.
Institute of Medicine, Law and Bioethics, School of Law, University of Manchester, Oxford Road, Manchester M13 9PL, UK.
This paper proposes, elaborates and defends a principle of genetic equity. In doing so it articulates, explains and justifies what might be meant by the concept of 'human dignity' in a way that is clear, defensible and consistent with, but by no means the same as, the plethora of appeals to human dignity found in contemporary bioethics, and more particularly in international instruments on bioethics. We propose the following principle of genetic equity: humans are born equal; they are entitled to freedom from discrimination and equality of opportunity to flourish; genetic information may not be used to limit that equality.
Nature reviews. Genetics 2004;5;10;796-800
Pathogenomics of non-pathogens.
Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
Analysing the genomes of non-pathogenic microorganisms, in addition to its basic and applied scientific interest, can also shed considerable light on the study of pathogenic microorganisms. Two of the three microorganisms described here are rarely pathogenic, but carry genetic determinants that have previously been identified as being important for the pathogenicity of other microorganisms. This underlines the growing understanding that many so-called 'virulence genes' are probably involved in more general interactions between the microorganism and the host or the environment.
Nature reviews. Microbiology 2004;2;2;91
Complete genomes of two clinical Staphylococcus aureus strains: evidence for the rapid evolution of virulence and drug resistance.
The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.
Staphylococcus aureus is an important nosocomial and community-acquired pathogen. Its genetic plasticity has facilitated the evolution of many virulent and drug-resistant strains, presenting a major and constantly changing clinical challenge. We sequenced the approximately 2.8-Mbp genomes of two disease-causing S. aureus strains isolated from distinct clinical settings: a recent hospital-acquired representative of the epidemic methicillin-resistant S. aureus EMRSA-16 clone (MRSA252), a clinically important and globally prevalent lineage; and a representative of an invasive community-acquired methicillin-susceptible S. aureus clone (MSSA476). A comparative-genomics approach was used to explore the mechanisms of evolution of clinically important S. aureus genomes and to identify regions affecting virulence and drug resistance. The genome sequences of MRSA252 and MSSA476 have a well conserved core region but differ markedly in their accessory genetic elements. MRSA252 is the most genetically diverse S. aureus strain sequenced to date: approximately 6% of the genome is novel compared with other published genomes, and it contains several unique genetic elements. MSSA476 is methicillin-susceptible, but it contains a novel Staphylococcal chromosomal cassette (SCC) mec-like element (designated SCC(476)), which is integrated at the same site on the chromosome as SCCmec elements in MRSA strains but encodes a putative fusidic acid resistance protein. The crucial role that accessory elements play in the rapid evolution of S. aureus is clearly illustrated by comparing the MSSA476 genome with that of an extremely closely related MRSA community-acquired strain; the differential distribution of large mobile elements carrying virulence and drug-resistance determinants may be responsible for the clinically important phenotypic differences in these strains.
Proceedings of the National Academy of Sciences of the United States of America 2004;101;26;9786-91
Genomic plasticity of the causative agent of melioidosis, Burkholderia pseudomallei.
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.
Burkholderia pseudomallei is a recognized biothreat agent and the causative agent of melioidosis. This Gram-negative bacterium exists as a soil saprophyte in melioidosis-endemic areas of the world and accounts for 20% of community-acquired septicaemias in northeastern Thailand where half of those affected die. Here we report the complete genome of B. pseudomallei, which is composed of two chromosomes of 4.07 megabase pairs and 3.17 megabase pairs, showing significant functional partitioning of genes between them. The large chromosome encodes many of the core functions associated with central metabolism and cell growth, whereas the small chromosome carries more accessory functions associated with adaptation and survival in different niches. Genomic comparisons with closely and more distantly related bacteria revealed a greater level of gene order conservation and a greater number of orthologous genes on the large chromosome, suggesting that the two replicons have distinct evolutionary origins. A striking feature of the genome was the presence of 16 genomic islands (GIs) that together made up 6.1% of the genome. Further analysis revealed these islands to be variably present in a collection of invasive and soil isolates but entirely absent from the clonally related organism B. mallei. We propose that variable horizontal gene acquisition by B. pseudomallei is an important feature of recent genetic evolution and that this has resulted in a genetically diverse pathogenic species.
Proceedings of the National Academy of Sciences of the United States of America 2004;101;39;14240-5
Gene map of the extended human MHC.
Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
The major histocompatibility complex (MHC) is the most important region in the vertebrate genome with respect to infection and autoimmunity, and is crucial in adaptive and innate immunity. Decades of biomedical research have revealed many MHC genes that are duplicated, polymorphic and associated with more diseases than any other region of the human genome. The recent completion of several large-scale studies offers the opportunity to assimilate the latest data into an integrated gene map of the extended human MHC. Here, we present this map and review its content in relation to paralogy, polymorphism, immune function and disease.
Funded by: Multiple Sclerosis Society: 588
Nature reviews. Genetics 2004;5;12;889-99
A new trade framework for global healthcare R&D.
Wellcome Trust Sanger Institute in Hinxton, United Kingdom.
PLoS biology 2004;2;2;E52
DNA sequence and analysis of human chromosome 9.
The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. firstname.lastname@example.org
Chromosome 9 is highly structurally polymorphic. It contains the largest autosomal block of heterochromatin, which is heteromorphic in 6-8% of humans, whereas pericentric inversions occur in more than 1% of the population. The finished euchromatic sequence of chromosome 9 comprises 109,044,351 base pairs and represents >99.6% of the region. Analysis of the sequence reveals many intra- and interchromosomal duplications, including segmental duplications adjacent to both the centromere and the large heterochromatic block. We have annotated 1,149 genes, including genes implicated in male-to-female sex reversal, cancer and neurodegenerative disease, and 426 pseudogenes. The chromosome contains the largest interferon gene cluster in the human genome. There is also a region of exceptionally high gene and G + C content including genes paralogous to those in the major histocompatibility complex. We have also detected recently duplicated genes that exhibit different rates of sequence divergence, presumably reflecting natural selection.
Origins of chromosomal rearrangement hotspots in the human genome: evidence from the AZFa deletion hotspots.
Molecular Genetics Laboratory, McDonald Institute for Archaeological Research, University of Cambridge, Downing Street, Cambridge, CB2 3ER, UK. email@example.com
Background: The origins of the recombination hotspots that are a common feature of both allelic and non-allelic homologous recombination in the human genome are poorly understood. We have investigated, by comparative sequencing, the evolution of two hotspots of non-allelic homologous recombination on the Y chromosome that lie within paralogous sequences known to sponsor deletions resulting in male infertility.
Results: These recombination hotspots are characterized by signatures of concerted evolution, which indicate that gene conversion between paralogs has been predominant in shaping their recent evolution. By contrast, the paralogous sequences that surround the hotspots exhibit little evidence of gene conversion. A second feature of these rearrangement hotspots is the extreme interspecific sequence divergence (around 2.5%) that places them among the most divergent orthologous sequences between humans and chimpanzees.
Conclusions: Several hominid-specific gene conversion events have rendered these hotspots better substrates for chromosomal rearrangements in humans than in chimpanzees or gorillas. Monte Carlo simulations of sequence evolution suggest that extreme sequence divergence is a direct consequence of gene conversion between paralogs. We propose that the coincidence of signatures of concerted evolution and recurrent breakpoints of chromosomal rearrangement (mapped at the sequence level) may enable the identification of putative rearrangement hotspots from analysis of comparative sequences from great apes.
Genome biology 2004;5;8;R55
Integrative annotation of 21,037 human genes validated by full-length cDNA clones.
Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan.
The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.
Funded by: NHLBI NIH HHS: R01 HL064541
PLoS biology 2004;2;6;e162
Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution.
Genome Sequencing Center, Washington University School of Medicine, Campus Box 8501, 4444 Forest Park Avenue, St Louis, Missouri 63108, USA.
We present here a draft genome sequence of the red jungle fowl, Gallus gallus. Because the chicken is a modern descendant of the dinosaurs and the first non-mammalian amniote to have its genome sequenced, the draft sequence of its genome--composed of approximately one billion base pairs of sequence and an estimated 20,000-23,000 genes--provides a new perspective on vertebrate genome evolution, while also improving the annotation of mammalian genomes. For example, the evolutionary distance between chicken and human provides high specificity in detecting functional elements, both non-coding and coding. Notably, many conserved non-coding sequences are far from genes and cannot be assigned to defined functional classes. In coding regions the evolutionary dynamics of protein domains and orthologous groups illustrate processes that distinguish the lineages leading to birds and mammals. The distinctive properties of avian microchromosomes, together with the inferred patterns of conserved synteny, provide additional insights into vertebrate chromosome architecture.
Funded by: Biotechnology and Biological Sciences Research Council: BBS/B/13462
Finishing the euchromatic sequence of the human genome.
The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers approximately 99% of the euchromatic genome and is accurate to an error rate of approximately 1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human genome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead.
Funded by: NHGRI NIH HHS: U54 HG003273
The impact of SNP density on fine-scale patterns of linkage disequilibrium.
Wellcome Trust Centre for Human Genetics, University of Oxford, UK.
Linkage disequilibrium (LD) is a measure of the degree of association between alleles in a population. The detection of disease-causing variants by association with neighbouring single nucleotide polymorphisms (SNPs) depends on the existence of strong LD between them. Previous studies have indicated that the extent of LD is highly variable in different chromosome regions and different populations, demonstrating the importance of genome-wide accurate measurement of LD at high resolution throughout the human genome. A uniform feature of these studies has been the inability to detect LD in regions of low marker density. To investigate the dependence of LD patterns on marker selection we performed a high-resolution study in African-American, Asian and UK Caucasian populations. We selected over 5000 SNPs with an average spacing of approximately 1 SNP per 2 kb after validating ca 12 000 SNPs derived from a dense SNP collection (1 SNP per 0.3 kb on average). Applications of different statistical methods of LD assessment highlight similar areas of high and low LD. However, at high resolution, features such as overall sequence coverage in LD blocks and block boundaries vary substantially with respect to marker density. Model-based linkage disequilibrium unit (LDU) maps appear robust to marker density and consistently influenced by marker allele frequency. The results suggest that very dense marker sets will be required to yield stable views of fine-scale LD in the human genome.
Funded by: NEI NIH HHS: EY-126562
Human molecular genetics 2004;13;6;577-88
The Wnt co-receptors Lrp5 and Lrp6 are essential for gastrulation in mice.
Department of Molecular and Cell Biology, University of California at Berkeley, Berkeley, CA 94720-3200, USA.
Recent work has identified LDL receptor-related family members, Lrp5 and Lrp6, as co-receptors for the transduction of Wnt signals. Our analysis of mice carrying mutations in both Lrp5 and Lrp6 demonstrates that the functions of these genes are redundant and are essential for gastrulation. Lrp5;Lrp6 double homozygous mutants fail to establish a primitive streak, although the anterior visceral endoderm and anterior epiblast fates are specified. Thus, Lrp5 and Lrp6 are required for posterior patterning of the epiblast, consistent with a role in transducing Wnt signals in the early embryo. Interestingly, Lrp5(+/-);Lrp6(-/-) embryos die shortly after gastrulation and exhibit an accumulation of cells at the primitive streak and a selective loss of paraxial mesoderm. A similar phenotype is observed in Fgf8 and Fgfr1 mutant embryos and provides genetic evidence in support of a molecular link between the Fgf and Wnt signaling pathways in patterning nascent mesoderm. Lrp5(+/-);Lrp6(-/-) embryos also display an expansion of anterior primitive streak derivatives and anterior neurectoderm that correlates with increased Nodal expression in these embryos. The effect of reducing, but not eliminating, Wnt signaling in Lrp5(+/-);Lrp6(-/-) mutant embryos provides important insight into the interplay between Wnt, Fgf and Nodal signals in patterning the early mouse embryo.
Development (Cambridge, England) 2004;131;12;2803-15
5,000 RNAi experiments on a chip.
Nature methods 2004;1;2;103-4
A map of the interactome network of the metazoan C. elegans.
Dana-Farber Cancer Institute and Department of Genetics, Harvard Medical School, 44 Binney Street, Boston, MA 02115, USA.
To initiate studies on how protein-protein interaction (or "interactome") networks relate to multicellular functions, we have mapped a large fraction of the Caenorhabditis elegans interactome network. Starting with a subset of metazoan-specific proteins, more than 4000 interactions were identified from high-throughput, yeast two-hybrid (HT=Y2H) screens. Independent coaffinity purification assays experimentally validated the overall quality of this Y2H data set. Together with already described Y2H interactions and interologs predicted in silico, the current version of the Worm Interactome (WI5) map contains approximately 5500 interactions. Topological and biological features of this interactome network, as well as its integration with phenome and transcriptome data sets, lead to numerous biological hypotheses.
Funded by: NIA NIH HHS: R01 AG011085; NIGMS NIH HHS: R01 GM034059, R01 GM034059-18
Science (New York, N.Y.) 2004;303;5657;540-3
Genomic and genetic analysis of Bordetella bacteriophages encoding reverse transcriptase-mediated tropism-switching cassettes.
Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, Los Angeles, California 90095, USA.
Liu et al. recently described a group of related temperate bacteriophages that infect Bordetella subspecies and undergo a unique template-dependent, reverse transcriptase-mediated tropism switching phenomenon (Liu et al., Science 295: 2091-2094, 2002). Tropism switching results from the introduction of single nucleotide substitutions at defined locations in the VR1 (variable region 1) segment of the mtd (major tropism determinant) gene, which determines specificity for receptors on host bacteria. In this report, we describe the complete nucleotide sequences of the 42.5- to 42.7-kb double-stranded DNA genomes of three related phage isolates and characterize two additional regions of variability. Forty-nine coding sequences were identified. Of these coding sequences, bbp36 contained VR2 (variable region 2), which is highly dynamic and consists of a variable number of identical 19-bp repeats separated by one of three 5-bp spacers, and bpm encodes a DNA adenine methylase with unusual site specificity and a homopolymer tract that functions as a hotspot for frameshift mutations. Morphological and sequence analysis suggests that these Bordetella phage are genetic hybrids of P22 and T7 family genomes, lending further support to the idea that regions encoding protein domains, single genes, or blocks of genes are readily exchanged between bacterial and phage genomes. Bordetella bacteriophages are capable of transducing genetic markers in vitro, and by using animal models, we demonstrated that lysogenic conversion can take place in the mouse respiratory tract during infection.
Funded by: NIAID NIH HHS: 2-T32-AI07323, AI38417, R01 AI038417, T32 AI007323; NIGMS NIH HHS: GM-08042, T32 GM008042
Journal of bacteriology 2004;186;5;1503-17
Organization and evolution of a gene-rich region of the mouse genome: a 12.7-Mb region deleted in the Del(13)Svea36H mouse.
Medical Research Council Mammalian Genetics Unit, Harwell, Oxfordshire, United Kingdom.
Del(13)Svea36H (Del36H) is a deletion of approximately 20% of mouse chromosome 13 showing conserved synteny with human chromosome 6p22.1-6p22.3/6p25. The human region is lost in some deletion syndromes and is the site of several disease loci. Heterozygous Del36H mice show numerous phenotypes and may model aspects of human genetic disease. We describe 12.7 Mb of finished, annotated sequence from Del36H. Del36H has a higher gene density than the draft mouse genome, reflecting high local densities of three gene families (vomeronasal receptors, serpins, and prolactins) which are greatly expanded relative to human. Transposable elements are concentrated near these gene families. We therefore suggest that their neighborhoods are gene factories, regions of frequent recombination in which gene duplication is more frequent. The gene families show different proportions of pseudogenes, likely reflecting different strengths of purifying selection and/or gene conversion. They are also associated with relatively low simple sequence concentrations, which vary across the region with a periodicity of approximately 5 Mb. Del36H contains numerous evolutionarily conserved regions (ECRs). Many lie in noncoding regions, are detectable in species as distant as Ciona intestinalis, and therefore are candidate regulatory sequences. This analysis will facilitate functional genomic analysis of Del36H and provides insights into mouse genome evolution.
Genome research 2004;14;10A;1888-901
The fine-scale structure of recombination rate variation in the human genome.
Department of Statistics, University of Oxford, Oxford OX1 3TG, UK. firstname.lastname@example.org
The nature and scale of recombination rate variation are largely unknown for most species. In humans, pedigree analysis has documented variation at the chromosomal level, and sperm studies have identified specific hotspots in which crossing-over events cluster. To address whether this picture is representative of the genome as a whole, we have developed and validated a method for estimating recombination rates from patterns of genetic variation. From extensive single-nucleotide polymorphism surveys in European and African populations, we find evidence for extreme local rate variation spanning four orders in magnitude, in which 50% of all recombination events take place in less than 10% of the sequence. We demonstrate that recombination hotspots are a ubiquitous feature of the human genome, occurring on average every 200 kilobases or less, but recombination occurs preferentially outside genes.
Science (New York, N.Y.) 2004;304;5670;581-4
Interaction between differentially methylated regions partitions the imprinted genes Igf2 and H19 into parent-specific chromatin loops.
Laboratory of Developmental Genetics and Imprinting, Developmental Genetics Programme, The Babraham Institute, Cambridge CB2 4AT, UK. email@example.com
Imprinted genes are expressed from only one of the parental alleles and are marked epigenetically by DNA methylation and histone modifications. The paternally expressed gene insulin-like growth-factor 2 (Igf2) is separated by approximately 100 kb from the maternally expressed noncoding gene H19 on mouse distal chromosome 7. Differentially methylated regions in Igf2 and H19 contain chromatin boundaries, silencers and activators and regulate the reciprocal expression of the two genes in a methylation-sensitive manner by allowing them exclusive access to a shared set of enhancers. Various chromatin models have been proposed that separate Igf2 and H19 into active and silent domains. Here we used a GAL4 knock-in approach as well as the chromosome conformation capture technique to show that the differentially methylated regions in the imprinted genes Igf2 and H19 interact in mice. These interactions are epigenetically regulated and partition maternal and paternal chromatin into distinct loops. This generates a simple epigenetic switch for Igf2 through which it moves between an active and a silent chromatin domain.
Nature genetics 2004;36;8;889-93
Eukaryotes: not beyond compare.
Nature reviews. Microbiology 2004;2;11;856-7
Strength in diversity.
Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. firstname.lastname@example.org
Nature reviews. Microbiology 2004;2;5;358-9
A transcriptomic analysis of the phylum Nematoda.
Hospital for Sick Children, 555 University Avenue, Departments of Biochemistry and Medical Genetics and Microbiology, University of Toronto, Toronto, Ontario M5G 1X8, Canada. email@example.com
The phylum Nematoda occupies a huge range of ecological niches, from free-living microbivores to human parasites. We analyzed the genomic biology of the phylum using 265,494 expressed-sequence tag sequences, corresponding to 93,645 putative genes, from 30 species, including 28 parasites. From 35% to 70% of each species' genes had significant similarity to proteins from the model nematode Caenorhabditis elegans. More than half of the putative genes were unique to the phylum, and 23% were unique to the species from which they were derived. We have not yet come close to exhausting the genomic diversity of the phylum. We identified more than 2,600 different known protein domains, some of which had differential abundances between major taxonomic groups of nematodes. We also defined 4,228 nematode-specific protein families from nematode-restricted genes: this class of genes probably underpins species- and higher-level taxonomic disparity. Nematode-specific families are particularly interesting as drug and vaccine targets.
Nature genetics 2004;36;12;1259-67
The bordetellae: lessons from genomics.
Department of Microbiology, University of Guelph, Guelph, Ontario N1G 2W1, Canada. firstname.lastname@example.org
Nature reviews. Microbiology 2004;2;5;379-90
DNA methylation profiling of the human major histocompatibility complex: a pilot study for the human epigenome project.
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom.
The Human Epigenome Project aims to identify, catalogue, and interpret genome-wide DNA methylation phenomena. Occurring naturally on cytosine bases at cytosine-guanine dinucleotides, DNA methylation is intimately involved in diverse biological processes and the aetiology of many diseases. Differentially methylated cytosines give rise to distinct profiles, thought to be specific for gene activity, tissue type, and disease state. The identification of such methylation variable positions will significantly improve our understanding of genome biology and our ability to diagnose disease. Here, we report the results of the pilot study for the Human Epigenome Project entailing the methylation analysis of the human major histocompatibility complex. This study involved the development of an integrated pipeline for high-throughput methylation analysis using bisulphite DNA sequencing, discovery of methylation variable positions, epigenotyping by matrix-assisted laser desorption/ionisation mass spectrometry, and development of an integrated public database available at http://www.epigenome.org. Our analysis of DNA methylation levels within the major histocompatibility complex, including regulatory exonic and intronic regions associated with 90 genes in multiple tissues and individuals, reveals a bimodal distribution of methylation profiles (i.e., the vast majority of the analysed regions were either hypo- or hypermethylated), tissue specificity, inter-individual variation, and correlation with independent gene expression data.
PLoS biology 2004;2;12;e405
Identification of mammalian microRNA host genes and transcription units.
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.
To derive a global perspective on the transcription of microRNAs (miRNAs) in mammals, we annotated the genomic position and context of this class of noncoding RNAs (ncRNAs) in the human and mouse genomes. Of the 232 known mammalian miRNAs, we found that 161 overlap with 123 defined transcription units (TUs). We identified miRNAs within introns of 90 protein-coding genes with a broad spectrum of molecular functions, and in both introns and exons of 66 mRNA-like noncoding RNAs (mlncRNAs). In addition, novel families of miRNAs based on host gene identity were identified. The transcription patterns of all miRNA host genes were curated from a variety of sources illustrating spatial, temporal, and physiological regulation of miRNA expression. These findings strongly suggest that miRNAs are transcribed in parallel with their host transcripts, and that the two different transcription classes of miRNAs ('exonic' and 'intronic') identified here may require slightly different mechanisms of biogenesis.
Genome research 2004;14;10A;1902-10
Periodic gene expression program of the fission yeast cell cycle.
The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.
Cell-cycle control of transcription seems to be universal, but little is known about its global conservation and biological significance. We report on the genome-wide transcriptional program of the Schizosaccharomyces pombe cell cycle, identifying 407 periodically expressed genes of which 136 show high-amplitude changes. These genes cluster in four major waves of expression. The forkhead protein Sep1p regulates mitotic genes in the first cluster, including Ace2p, which activates transcription in the second cluster during the M-G1 transition and cytokinesis. Other genes in the second cluster, which are required for G1-S progression, are regulated by the MBF complex independently of Sep1p and Ace2p. The third cluster coincides with S phase and a fourth cluster contains genes weakly regulated during G2 phase. Despite conserved cell-cycle transcription factors, differences in regulatory circuits between fission and budding yeasts are evident, revealing evolutionary plasticity of transcriptional control. Periodic transcription of most genes is not conserved between the two yeasts, except for a core set of approximately 40 genes that seem to be universally regulated during the eukaryotic cell cycle and may have key roles in cell-cycle progression.
Funded by: Cancer Research UK: A6517; Wellcome Trust: 077118
Nature genetics 2004;36;8;809-17
Methylation of histone H4 lysine 20 controls recruitment of Crb2 to sites of DNA damage.
The Wellcome Trust/Cancer Research UK Gurdon Institute and Department of Pathology, Tennis Court Road, Cambridge CB2 1QN, United Kingdom.
Histone lysine methylation is a key regulator of gene expression and heterochromatin function, but little is known as to how this modification impinges on other chromatin activities. Here we demonstrate that a previously uncharacterized SET domain protein, Set9, is responsible for H4-K20 methylation in the fission yeast Schizosaccharomyces pombe. Surprisingly, H4-K20 methylation does not have any apparent role in the regulation of gene expression or heterochromatin function. Rather, we find the modification has a role in DNA damage response. Loss of Set9 activity or mutation of H4-K20 markedly impairs cell survival after genotoxic challenge and compromises the ability of cells to maintain checkpoint mediated cell cycle arrest. Genetic experiments link Set9 to Crb2, a homolog of the mammalian checkpoint protein 53BP1, and the enzyme is required for Crb2 localization to sites of DNA damage. These results argue that H4-K20 methylation functions as a "histone mark" required for the recruitment of the checkpoint protein Crb2.
Funded by: Cancer Research UK: A6517; Wellcome Trust: 077118
The otter annotation system.
The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.
With the completion of the human genome sequence and genome sequence available for other vertebrate genomes, the task of manual annotation at the large genome scale has become a priority. Possibly even more important, is the requirement to curate and improve this annotation in the light of future data. For this to be possible, there is a need for tools to access and manage the annotation. Ensembl provides an excellent means for storing gene structures, genome features, and sequence, but it does not support the extra textual data necessary for manual annotation. We have extended Ensembl to create the Otter manual annotation system. This comprises a relational database schema for storing the manual annotation data, an application-programming interface (API) to access it, an extensible markup language (XML) format to allow transfer of the data, and a server to allow multiuser/multimachine access to the data. We have also written a data-adaptor plugin for the Apollo Browser/Editor to enable it to utilize an Otter server. The otter database is currently used by the Vertebrate Genome Annotation (VEGA) site (http://vega.sanger.ac.uk), which provides access to manually curated human chromosomes. Support is also being developed for using the AceDB annotation editor, FMap, via a perl wrapper called Lace. The Human and Vertebrate Annotation (HAVANA) group annotators at the Sanger center are using this to annotate human chromosomes 1 and 20.
Genome research 2004;14;5;963-70
Microarray based comparative genomic hybridisation (array-CGH) detects submicroscopic chromosomal deletions and duplications in patients with learning disability/mental retardation and dysmorphic features.
University of Cambridge Department of Medical Genetics, Addenbrooke's Hospital, Hills Road, Cambridge, UK.
The underlying causes of learning disability and dysmorphic features in many patients remain unidentified despite extensive investigation. Routine karyotype analysis is not sensitive enough to detect subtle chromosome rearrangements (less than 5 Mb). The presence of subtle DNA copy number changes was investigated by array-CGH in 50 patients with learning disability and dysmorphism, employing a DNA microarray constructed from large insert clones spaced at approximately 1 Mb intervals across the genome. Twelve copy number abnormalities were identified in 12 patients (24% of the total): seven deletions (six apparently de novo and one inherited from a phenotypically normal parent) and five duplications (one de novo and four inherited from phenotypically normal parents). Altered segments ranged in size from those involving a single clone to regions as large as 14 Mb. No recurrent deletion or duplication was identified within this cohort of patients. On the basis of these results, we anticipate that array-CGH will become a routine method of genome-wide screening for imbalanced rearrangements in children with learning disability.
Journal of medical genetics 2004;41;4;241-8
A public gene trap resource for mouse functional genomics.
Funded by: Wellcome Trust: 077188
Nature genetics 2004;36;6;543-4
The Ensembl Web site: mechanics of a genome browser.
The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK. JWS@sanger.ac.uk
The Ensembl Web site (http://www.ensembl.org/) is the principal user interface to the data of the Ensembl project, and currently serves >500,000 pages (approximately 2.5 million hits) per week, providing access to >80 GB (gigabyte) of data to users in more than 80 countries. Built atop an open-source platform comprising Apache/mod_perl and the MySQL relational database management system, it is modular, extensible, and freely available. It is being actively reused and extended in several different projects, and has been downloaded and installed in companies and academic institutions worldwide. Here, we describe some of the technical features of the site, with particular reference to its dynamic configuration that enables it to handle disparate data from multiple species.
Genome research 2004;14;5;951-5
TILLING--a high-throughput harvest for functional genomics.
Vertebrate Development and Genetics (Team 31), Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. email@example.com
Nature reviews. Genetics 2004;5;2;145-50
Lung cancer: intragenic ERBB2 kinase mutations in tumours.
Cancer Genome Project, Wellcome Trust Sanger Institute, Hinxton CB10 1SA, UK.
The protein-kinase family is the most frequently mutated gene family found in human cancer and faulty kinase enzymes are being investigated as promising targets for the design of antitumour therapies. We have sequenced the gene encoding the transmembrane protein tyrosine kinase ERBB2 (also known as HER2 or Neu) from 120 primary lung tumours and identified 4% that have mutations within the kinase domain; in the adenocarcinoma subtype of lung cancer, 10% of cases had mutations. ERBB2 inhibitors, which have so far proved to be ineffective in treating lung cancer, should now be clinically re-evaluated in the specific subset of patients with lung cancer whose tumours carry ERBB2 mutations.
Complete MHC haplotype sequencing for common disease gene mapping.
Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.
The future systematic mapping of variants that confer susceptibility to common diseases requires the construction of a fully informative polymorphism map. Ideally, every base pair of the genome would be sequenced in many individuals. Here, we report 4.75 Mb of contiguous sequence for each of two common haplotypes of the major histocompatibility complex (MHC), to which susceptibility to >100 diseases has been mapped. The autoimmune disease-associated-haplotypes HLA-A3-B7-Cw7-DR15 and HLA-A1-B8-Cw7-DR3 were sequenced in their entirety through a bacterial artificial chromosome (BAC) cloning strategy using the consanguineous cell lines PGF and COX, respectively. The two sequences were annotated to encompass all described splice variants of expressed genes. We defined the complete variation content of the two haplotypes, revealing >18,000 variations between them. Average SNP densities ranged from less than one SNP per kilobase to >60. Acquisition of complete and accurate sequence data over polymorphic regions such as the MHC from large-insert cloned DNA provides a definitive resource for the construction of informative genetic maps, and avoids the limitation of chromosome regions that are refractory to PCR amplification.
Funded by: Multiple Sclerosis Society: 588
Genome research 2004;14;6;1176-87
Cancer: understanding the target.
Autocatalytic RNA cleavage in the human beta-globin pre-mRNA promotes transcription termination.
Sir William Dunn School of Pathology, University of Oxford, South Parks Road, Oxford OX1 3RE, UK.
New evidence indicates that termination of transcription is an important regulatory step, closely related to transcriptional interference and even transcriptional initiation. However, how this occurs is poorly understood. Recently, in vivo analysis of transcriptional termination for the human beta-globin gene revealed a new phenomenon--co-transcriptional cleavage (CoTC). This primary cleavage event within beta-globin pre-messenger RNA, downstream of the poly(A) site, is critical for efficient transcriptional termination by RNA polymerase II. Here we show that the CoTC process in the human beta-globin gene involves an RNA self-cleaving activity. We characterize the autocatalytic core of the CoTC ribozyme and show its functional role in efficient termination in vivo. The identified core CoTC is highly conserved in the 3' flanking regions of other primate beta-globin genes. Functionally, it resembles the 3' processive, self-cleaving ribozymes described for the protein-encoding genes from the myxomycetes Didymium iridis and Physarum polycephalum, indicating evolutionary conservation of this molecular process. We predict that regulated autocatalytic cleavage elements within pre-mRNAs may be a general phenomenon and that functionally it may provide the entry point for exonucleases involved in mRNA maturation, turnover and, in particular, transcriptional termination.
The use of genome annotation data and its impact on biological conclusions.
Nature genetics 2004;36;10;1028-9
Mechanism of activation of the RAF-ERK signaling pathway by oncogenic mutations of B-RAF.
Section of Structural Biology, The Institute of Cancer Research, Chester Beatty Laboratories, 237 Fulham Road, London SW3 6JB, UK.
Over 30 mutations of the B-RAF gene associated with human cancers have been identified, the majority of which are located within the kinase domain. Here we show that of 22 B-RAF mutants analyzed, 18 have elevated kinase activity and signal to ERK in vivo. Surprisingly, three mutants have reduced kinase activity towards MEK in vitro but, by activating C-RAF in vivo, signal to ERK in cells. The structures of wild type and oncogenic V599EB-RAF kinase domains in complex with the RAF inhibitor BAY43-9006 show that the activation segment is held in an inactive conformation by association with the P loop. The clustering of most mutations to these two regions suggests that disruption of this interaction converts B-RAF into its active conformation. The high activity mutants signal to ERK by directly phosphorylating MEK, whereas the impaired activity mutants stimulate MEK by activating endogenous C-RAF, possibly via an allosteric or transphosphorylation mechanism.
Polybromo protein BAF180 functions in mammalian cardiac chamber maturation.
Department of Molecular and Cell Biology, University of California, Berkeley, California 94720-3204, USA.
BAF and PBAF are two related mammalian chromatin remodeling complexes essential for gene expression and development. PBAF, but not BAF, is able to potentiate transcriptional activation in vitro mediated by nuclear receptors, such as RXRalpha, VDR, and PPARgamma. Here we show that the ablation of PBAF-specific subunit BAF180 in mouse embryos results in severe hypoplastic ventricle development and trophoblast placental defects, similar to those found in mice lacking RXRalpha and PPARgamma. Embryonic aggregation analyses reveal that in contrast to PPARgamma-deficient mice, the heart defects are likely a direct result of BAF180 ablation, rather than an indirect consequence of trophoblast placental defects. We identified potential target genes for BAF180 in heart development, such as S100A13 as well as retinoic acid (RA)-induced targets RARbeta2 and CRABPII. Importantly, BAF180 is recruited to the promoter of these target genes and BAF180 deficiency affects the RA response for CRABPII and RARbeta2. These studies reveal unique functions of PBAF in cardiac chamber maturation.
Genes & development 2004;18;24;3106-16
Fine mapping, gene content, comparative sequencing, and expression analyses support Ctla4 and Nramp1 as candidates for Idd5.1 and Idd5.2 in the nonobese diabetic mouse.
Juvenile Diabetes Research Foundation/Wellcome Trust Diabetes and Inflammation Laboratory, Department of Medical Genetics, Cambridge Institute for Medical Research, University of Cambridge, Cambridge CB2 2XY, UK. firstname.lastname@example.org
At least two loci that determine susceptibility to type 1 diabetes in the NOD mouse have been mapped to chromosome 1, Idd5.1 (insulin-dependent diabetes 5.1) and Idd5.2. In this study, using a series of novel NOD.B10 congenic strains, Idd5.1 has been defined to a 2.1-Mb region containing only four genes, Ctla4, Icos, Als2cr19, and Nrp2 (neuropilin-2), thereby excluding a major candidate gene, Cd28. Genomic sequence comparison of the two functional candidate genes, Ctla4 and Icos, from the B6 (resistant at Idd5.1) and the NOD (susceptible at Idd5.1) strains revealed 62 single nucleotide polymorphisms (SNPs), only two of which were in coding regions. One of these coding SNPs, base 77 of Ctla4 exon 2, is a synonymous SNP and has been correlated previously with type 1 diabetes susceptibility and differential expression of a CTLA-4 isoform. Additional expression studies in this work support the hypothesis that this SNP in exon 2 is the genetic variation causing the biological effects of Idd5.1. Analysis of additional congenic strains has also localized Idd5.2 to a small region (1.52 Mb) of chromosome 1, but in contrast to the Idd5.1 interval, Idd5.2 contains at least 45 genes. Notably, the Idd5.2 region still includes the functionally polymorphic Nramp1 gene. Future experiments to test the identity of Idd5.1 and Idd5.2 as Ctla4 and Nramp1, respectively, can now be justified using approaches to specifically alter or mimic the candidate causative SNPs.
Journal of immunology (Baltimore, Md. : 1950) 2004;173;1;164-73
The SSAHA trace server
Computational Systems Bioinformatics Conference, 2004. CSB 2004. Proceedings. 2004 IEEE 2004;544 - 545