Sanger Institute - Publications 2003

Number of papers published in 2003: 113

  • The human TREM gene cluster at 6p21.1 encodes both activating and inhibitory single IgV domain receptors and includes NKp44.

    Allcock RJ, Barrow AD, Forbes S, Beck S and Trowsdale J

    Cambridge Institute for Medical Research, Wellcome Trust/MRC building, Addenbrookes Hospital, Cambridge, GB.

    We have characterized a cluster of single immunoglobulin variable (IgV) domain receptors centromeric of the major histocompatibility complex (MHC) on human chromosome 6. In addition to triggering receptor expressed on myeloid cells (TREM)-1 and TREM2, the cluster contains NKp44, a triggering receptor whose expression is limited to NK cells. We identified three new related genes and two gene fragments within a cluster of approximately 200 kb. Two of the three new genes lack charged residues in their transmembrane domain tails. Further, one of the genes contains two potential immunotyrosine Inhibitory motifs in its cytoplasmic tail, suggesting that it delivers inhibitory signals. The human and mouse TREM clusters appear to have diverged such that there are unique sequences in each species. Finally, each gene in the TREM cluster was expressed in a different range of cell types.

    European journal of immunology 2003;33;2;567-77

  • Co-duplication of olfactory receptor and MHC class I genes in the mouse major histocompatibility complex.

    Amadou C, Younger RM, Sims S, Matthews LH, Rogers J, Kumanovics A, Ziegler A, Beck S and Lindahl KF

    Howard Hughes Medical Institute and Center for Immunology, University of Texas Southwestern Medical Center, Dallas, 75390-9050, USA.

    We report the 897 kb sequence of a cluster of olfactory receptor (OR) genes located at the distal end of the major histocompatibility complex (MHC) class I region on mouse chromosome 17 of strain 129/SvJ (H2bc). With additional information from the mouse genome draft sequence, we identified 59 OR loci (approximately 20% pseudogenes) in contrast to only 25 OR loci (approximately 50% pseudogenes) in the corresponding centromeric OR cluster that is part of the 'extended MHC class I region' on human chromosome 6. Comparative analysis leads to three major observations: (i) most of the OR subfamilies have evolved independently in the two species, expanding more in the mouse, and resulting in co-orthologs--subfamilies of highly similar paralogs that keep orthologous relationships with their human counterparts; (ii) three of the mouse OR subfamilies have no orthologs in humans; and (iii) MHC class I loci are interspersed in the OR cluster in mouse but not in human, and were subjected to co-duplication with OR genes. Screening of our sequence against the available sequences of other strains/haplotypes revealed that most of the OR loci are polymorphic and that the number of OR loci may vary among strains/haplotypes. Our findings that MHC-linked OR loci share duplication with MHC class I loci, have duplicated extensively and are polymorphic revives questions about potential reciprocal influences acting on the dynamics and evolution of the H2 region and the H2-linked OR loci.

    Human molecular genetics 2003;12;22;3025-40

  • Gene annotation: prediction and testing.

    Ashurst JL and Collins JE

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.

    Fifty years after the publication of DNA structure, the whole human genome sequence will be officially finished. This achievement marks the beginning of the task to catalogue every human gene and identify each of their function expression patterns. Currently, researchers estimate that there are about 30,000 human genes and approximately 70% of these can be automatically predicted using a combination of ab initio and similarity-based programs. However, to experimentally investigate every gene's function, the research community requires a high-quality annotation of alternative splicing, pseudogenes, and promoter regions that can only be provided by manual intervention. Manual curation of the human genome will be a long-term project as experimental data are continually produced to confirm or refine the predictions, and new features such as noncoding RNAs and enhancers have not been fully identified. Such a highly curated human gene-set made publicly available will be a great asset for the experimental community and for future comparative genome projects.

    Annual review of genomics and human genetics 2003;4;69-88

  • Managing peptidases in the genomic era.

    Barrett AJ, Tolle DP and Rawlings ND

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxon, Cambridgeshire CB10 1SA, UK.

    The enzymes that hydrolyse peptide bonds, called peptidases or proteases, are very important to mankind and are also very numerous. The many scientists working on these enzymes are rapidly acquiring new data, and they need good methods to store it and retrieve it. The storage and retrieval require effective systems of classification and nomenclature, and it is the design and implementation of these that we mean by 'managing' peptidases. Ten years ago Rawlings and Barrett proposed the first comprehensive system for the classification of peptidases, which included a set of simple names for the families. In the present article we describe how the system has developed since then. The peptidase classification has now been adopted for use by many other databases, and provides the structure around which the MEROPS protease database ( is built.

    Biological chemistry 2003;384;6;873-82

  • Candidate gene association study in type 2 diabetes indicates a role for genes involved in beta-cell function as well as insulin action.

    Barroso I, Luan J, Middelberg RP, Harding AH, Franks PW, Jakes RW, Clayton D, Schafer AJ, O'Rahilly S and Wareham NJ

    Incyte, Palo Alto, California, USA.

    Type 2 diabetes is an increasingly common, serious metabolic disorder with a substantial inherited component. It is characterised by defects in both insulin secretion and action. Progress in identification of specific genetic variants predisposing to the disease has been limited. To complement ongoing positional cloning efforts, we have undertaken a large-scale candidate gene association study. We examined 152 SNPs in 71 candidate genes for association with diabetes status and related phenotypes in 2,134 Caucasians in a case-control study and an independent quantitative trait (QT) cohort in the United Kingdom. Polymorphisms in five of 15 genes (33%) encoding molecules known to primarily influence pancreatic beta-cell function-ABCC8 (sulphonylurea receptor), KCNJ11 (KIR6.2), SLC2A2 (GLUT2), HNF4A (HNF4alpha), and INS (insulin)-significantly altered disease risk, and in three genes, the risk allele, haplotype, or both had a biologically consistent effect on a relevant physiological trait in the QT study. We examined 35 genes predicted to have their major influence on insulin action, and three (9%)-INSR, PIK3R1, and SOS1-showed significant associations with diabetes. These results confirm the genetic complexity of Type 2 diabetes and provide evidence that common variants in genes influencing pancreatic beta-cell function may make a significant contribution to the inherited component of this disease. This study additionally demonstrates that the systematic examination of panels of biological candidate genes in large, well-characterised populations can be an effective complement to positional cloning approaches. The absence of large single-gene effects and the detection of multiple small effects accentuate the need for the study of larger populations in order to reliably identify the size of effect we now expect for complex diseases.

    PLoS biology 2003;1;1;E20

  • The TROVE module: a common element in Telomerase, Ro and Vault ribonucleoproteins.

    Bateman A and Kickhoefer V

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    Background: Ribonucleoproteins carry out a variety of important tasks in the cell. In this study we show that a number of these contain a novel module, that we speculate mediates RNA-binding.

    Results: The TROVE module--Telomerase, Ro and Vault module--is found in TEP1 and Ro60 the protein components of three ribonucleoprotein particles. This novel module, consisting of one or more domains, may be involved in binding the RNA components of the three RNPs, which are telomerase RNA, Y RNA and vault RNA. A second conserved region in these proteins is shown to be a member of the vWA domain family. The vWA domain in TEP1 is closely related to the previously recognised vWA domain in VPARP a second component of the vault particle. This vWA domain may mediate interactions between these vault components or bind as yet unidentified components of the RNPs.

    Conclusions: This work suggests that a number of ribonucleoprotein components use a common RNA-binding module. The TROVE module is also found in bacterial ribonucleoproteins suggesting an ancient origin for these ribonucleoproteins.

    BMC bioinformatics 2003;4;49

  • The CHAP domain: a large family of amidases including GSP amidase and peptidoglycan hydrolases.

    Bateman A and Rawlings ND

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK CB10 1SA.

    Cleavage of peptidoglycan plays an important role in bacterial cell division, cell growth and cell lysis. Here, we reveal that several known peptidoglycan amidases fall into a family, which includes many proteins of previously unknown function. The family includes two different peptidoglycan cleavage activities: L-muramoyl-L-alanine amidase and D-alanyl-glycyl endopeptidase activity. The family includes the amidase portion of the bifunctional glutathionylspermidine synthase/amidase enzyme from bacteria and pathogenic trypanosomes. The glutathionylspermidine synthase is thought to be a key component of the alternative pathway in trypanosomes for protection from oxygen-radical damage and has been proposed as a potential drug target. The CHAP (cysteine, histidine-dependent amidohydrolases/peptidases) domain is often found in association with other domains that cleave peptidoglycan. The large number of multifunctional hydrolases suggests that they might act in a cooperative manner to cleave specialized substrates.

    Trends in biochemical sciences 2003;28;5;234-7

  • Revisiting the mouse mitochondrial DNA sequence.

    Bayona-Bafaluy MP, Acín-Pérez R, Mullikin JC, Park JS, Moreno-Loshuertos R, Hu P, Pérez-Martos A, Fernández-Silva P, Bai Y and Enríquez JA

    Departamento de Bioquímica y Biología Molecular y Celular, Universidad de Zaragoza, Miguel Servet 177, Zaragoza 50013, Spain.

    The existence of reliable mtDNA reference sequences for each species is of great relevance in a variety of fields, from phylogenetic and population genetics studies to pathogenetic determination of mtDNA variants in humans or in animal models of mtDNA-linked diseases. We present compelling evidence for the existence of sequencing errors on the current mouse mtDNA reference sequence. This includes the deletion of a full codon in two genes, the substitution of one amino acid on five occasions and also the involvement of tRNA and rRNA genes. The conclusions are supported by: (i) the re-sequencing of the original cell line used by Bibb and Clayton, the LA9 cell line, (ii) the sequencing of a second L-derivative clone (L929), and (iii) the comparison with 12 other mtDNA sequences from live mice, 10 of them maternally related with the mouse from which the L cells were generated. Two of the latest sequences are reported for the first time in this study (Balb/cJ and C57BL/6J). In addition, we found that both the LA9 and L929 mtDNAs also contain private clone polymorphic variants that, at least in the case of L929, promote functional impairment of the oxidative phosphorylation system. Consequently, the mtDNA of the strain used for the mouse genome project (C57BL/6J) is proposed as the new standard for the mouse mtDNA sequence.

    Nucleic acids research 2003;31;18;5349-55

  • DNA sequence variation of Homo sapiens.

    Bentley DR

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom.

    Cold Spring Harbor symposia on quantitative biology 2003;68;55-63

  • Sequencing and analysis of the genome of the Whipple's disease bacterium Tropheryma whipplei.

    Bentley SD, Maiwald M, Murphy LD, Pallen MJ, Yeats CA, Dover LG, Norbertczak HT, Besra GS, Quail MA, Harris DE, von Herbay A, Goble A, Rutter S, Squares R, Squares S, Barrell BG, Parkhill J and Relman DA

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.

    Background: Whipple's disease is a rare multisystem chronic infection, involving the intestinal tract as well as various other organs. The causative agent, Tropheryma whipplei, is a Gram-positive bacterium about which little is known. Our aim was to investigate the biology of this organism by generating and analysing the complete DNA sequence of its genome.

    Methods: We isolated and propagated T whipplei strain TW08/27 from the cerebrospinal fluid of a patient diagnosed with Whipple's disease. We generated the complete sequence of the genome by the whole genome shotgun method, and analysed it with a combination of automatic and manual bioinformatic techniques.

    Findings: Sequencing revealed a condensed 925938 bp genome with a lack of key biosynthetic pathways and a reduced capacity for energy metabolism. A family of large surface proteins was identified, some associated with large amounts of non-coding repetitive DNA, and an unexpected degree of sequence variation.

    Interpretation: The genome reduction and lack of metabolic capabilities point to a host-restricted lifestyle for the organism. The sequence variation indicates both known and novel mechanisms for the elaboration and variation of surface structures, and suggests that immune evasion and host interaction play an important part in the lifestyle of this persistent bacterial pathogen.

    Funded by: NIDDK NIH HHS: DK56339

    Lancet 2003;361;9358;637-44

  • The devil is in the detail.

    Bentley SD, Thomson NR, Sebaihia M, Crossman LC and Parkhill J

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    Trends in microbiology 2003;11;6;256-8

  • Ensembl: a genome infrastructure.

    Birney E and Ensembl Team

    EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.

    Cold Spring Harbor symposia on quantitative biology 2003;68;213-5

  • The intestinal protozoan parasite Entamoeba histolytica contains 20 cysteine protease genes, of which only a small subset is expressed during in vitro cultivation.

    Bruchhaus I, Loftus BJ, Hall N and Tannich E

    Bernhard Nocht Institute for Tropical Medicine, 20359 Hamburg, Germany.

    Cysteine proteases are known to be important pathogenicity factors of the protozoan parasite Entamoeba histolytica. So far, a total of eight genes coding for cysteine proteases have been identified in E. histolytica, two of which are absent in the closely related nonpathogenic species E. dispar. However, present knowledge is restricted to enzymes expressed during in vitro cultivation of the parasite, which might represent only a subset of the entire repertoire. Taking advantage of the current E. histolytica genome-sequencing efforts, we analyzed databases containing more than 99% of all ameba gene sequences for the presence of cysteine protease genes. A total of 20 full-length genes was identified (including all eight genes previously reported), which show 10 to 86% sequence identity. The various genes obviously originated from two separate ancestors since they form two distinct clades. Despite cathepsin B-like substrate specificities, all of the ameba polypeptides are structurally related to cathepsin L-like enzymes. None of the previously described enzymes but 7 of the 12 newly identified proteins are unique compared to cathepsins of higher eukaryotes in that they are predicted to have transmembrane or glycosylphosphatidylinositol anchor attachment domains. Southern blot analysis revealed that orthologous sequences for all of the newly identified proteases are present in E. dispar. Interestingly, the majority of the various cysteine protease genes are not expressed in E. histolytica or E. dispar trophozoites during in vitro cultivation. Therefore, it is likely that at least some of these enzymes are required for infection of the human host and/or for completion of the parasite life cycle.

    Eukaryotic cell 2003;2;3;501-9

  • Novel consensus DNA-binding sequence for BRCA1 protein complexes.

    Cable PL, Wilson CA, Calzone FJ, Rauscher FJ, Scully R, Livingston DM, Li L, Blackwell CB, Futreal PA and Afshari CA

    Laboratory of Molecular Carcinogenesis, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina, USA.

    Increasing evidence continues to emerge supporting the early hypothesis that BRCA1 might be involved in transcriptional processes. BRCA1 physically associates with more than 15 different proteins involved in transcription and is paradoxically involved in both transcriptional activation and repression. However, the underlying mechanism by which BRCA1 affects the gene expression of various genes remains speculative. In this study, we provide evidence that BRCA1 protein complexes interact with specific DNA sequences. We provide data showing that the upstream stimulatory factor 2 (USF2) physically associates with BRCA1 and is a component of this DNA-binding complex. Interestingly, these DNA-binding complexes are downregulated in breast cancer cell lines containing wild-type BRCA1, providing a critical link between modulations of BRCA1 function in sporadic breast cancers that do not involve germline BRCA1 mutations. The functional specificity of BRCA1 tumor suppression for breast and ovarian tissues is supported by our experiments, which demonstrate that BRCA1 DNA-binding complexes are modulated by serum and estrogen. Finally, functional analysis indicates that missense mutations in BRCA1 that lead to subsequent cancer susceptibility may result in improper gene activation. In summary, these findings establish a role for endogenous BRCA1 protein complexes in transcription via a defined DNA-binding sequence and indicate that one function of BRCA1 is to co-regulate the expression of genes involved in various cellular processes.

    Molecular carcinogenesis 2003;38;2;85-96

  • The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro.

    Camon E, Magrane M, Barrell D, Binns D, Fleischmann W, Kersey P, Mulder N, Oinn T, Maslen J, Cox A and Apweiler R

    EMBL Outstation-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

    Gene Ontology Annotation (GOA) is a project run by the European Bioinformatics Institute (EBI) that aims to provide assignments of terms from the Gene Ontology (GO) resource to gene products in a number of its databases ( In the first stage of this project, GO assignments have been applied to a data set representing the complete human proteome by a combination of electronic mappings and manual curation. This vocabulary has also been applied to the nonredundant proteome sets for all other completely sequenced organisms as well as to proteins from a wide range of organisms where the proteome is not yet complete.

    Funded by: NHGRI NIH HHS: 1R01HGO2273-01

    Genome research 2003;13;4;662-72

  • A matter of fitness.

    Cerdeño-Tárraga A, Crossman L and Parkhill J

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    Trends in microbiology 2003;11;3;111-2

  • The complete genome sequence and analysis of Corynebacterium diphtheriae NCTC13129.

    Cerdeño-Tárraga AM, Efstratiou A, Dover LG, Holden MT, Pallen M, Bentley SD, Besra GS, Churcher C, James KD, De Zoysa A, Chillingworth T, Cronin A, Dowd L, Feltwell T, Hamlin N, Holroyd S, Jagels K, Moule S, Quail MA, Rabbinowitsch E, Rutherford KM, Thomson NR, Unwin L, Whitehead S, Barrell BG and Parkhill J

    The Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Corynebacterium diphtheriae is a Gram-positive, non-spore forming, non-motile, pleomorphic rod belonging to the genus Corynebacterium and the actinomycete group of organisms. The organism produces a potent bacteriophage-encoded protein exotoxin, diphtheria toxin (DT), which causes the symptoms of diphtheria. This potentially fatal infectious disease is controlled in many developed countries by an effective immunisation programme. However, the disease has made a dramatic return in recent years, in particular within the Eastern European region. The largest, and still on-going, outbreak since the advent of mass immunisation started within Russia and the newly independent states of the former Soviet Union in the 1990s. We have sequenced the genome of a UK clinical isolate (biotype gravis strain NCTC13129), representative of the clone responsible for this outbreak. The genome consists of a single circular chromosome of 2 488 635 bp, with no plasmids. It provides evidence that recent acquisition of pathogenicity factors goes beyond the toxin itself, and includes iron-uptake systems, adhesins and fimbrial proteins. This is in contrast to Corynebacterium's nearest sequenced pathogenic relative, Mycobacterium tuberculosis, where there is little evidence of recent horizontal DNA acquisition. The genome itself shows an unusually extreme large-scale compositional bias, being noticeably higher in G+C near the origin than at the terminus.

    Nucleic acids research 2003;31;22;6516-23

  • BRAF and KRAS mutations in colorectal hyperplastic polyps and serrated adenomas.

    Chan TL, Zhao W, Leung SY, Yuen ST and Cancer Genome Project

    Department of Pathology, The University of Hong Kong, Queen Mary Hospital, Hong Kong.

    Colorectal cancer is believed to progress through an adenoma-carcinoma sequence. However, recent evidence increasingly supports the existence of an alternative route for colorectal carcinogenesis through serrated polyps, a group that encompasses a morphological spectrum, including hyperplastic polyp (HP), admixed hyperplastic polyp/adenoma (HP/AD), and serrated adenoma (SA; the latter two manifest epithelial dysplasia). We have studied a large series of serrated polyps for BRAF and KRAS mutations. BRAF mutations were detected in 18 of 50 (36%) HPs, 2 of 10 (20%) HP/ADs, and 9 of 9 (100%) SAs. Twenty-six of 29 mutations caused amino acid substitutions at valine 599, the known hotspot. KRAS mutations were detected in 9 of 50 (18%) HPs, 6 of 10 (60%) HP/ADs, and 0 of 9 (0%) SAs. BRAF and KRAS mutations are mutually exclusive (P = 0.001). The associations of BRAF mutations with SAs (P < 0.001) and KRAS mutations with HP/ADs (P = 0.005) are statistically significant. A majority (90%) of the serrated polyps showing dysplasia had mutations in either BRAF or KRAS, significantly different from those without dysplasia (54%; P = 0.014). Our data highlight the important role of activation of the RAS-RAF-mitogen-activated protein/extracellular signal-regulated kinase kinase-extracellular signal-regulated kinase-mitogen-activated protein kinase pathway in the initiation and progression of serrated neoplasms. Acquisition of a BRAF mutation appears to be associated with the progression of HP to SA, whereas progression to HP/AD is predominantly associated with acquisition of a KRAS mutation. The high incidence of BRAF mutations in HPs and SAs is consistent with the notion that the group of colorectal cancers carrying BRAF mutations may harbor most that have progressed through the HP-SA-carcinoma pathway.

    Cancer research 2003;63;16;4878-81

  • Comparative and functional analyses of LYL1 loci establish marsupial sequences as a model for phylogenetic footprinting.

    Chapman MA, Charchar FJ, Kinston S, Bird CP, Grafham D, Rogers J, Grützner F, Graves JA, Green AR and Göttgens B

    Department of Haematology, Cambridge Institute for Medical Research, Cambridge University, Hills Road, Cambridge CB2 2XY, UK.

    Comparative genomic sequence analysis is a powerful technique for identifying regulatory regions in genomic DNA. However, its utility largely depends on the evolutionary distances between the species involved. Here we describe the screening of a genomic BAC library from the stripe-faced dunnart, Sminthopsis macroura, formerly known as the narrow-footed marsupial mouse. We isolated a clone containing the LYL1 locus, completely sequenced the 60.6-kb insert, and compared it with orthologous human and mouse sequences. Noncoding homology was substantially reduced in the human/dunnart analysis compared with human/mouse, yet we could readily identify all promoters and exons. Human/mouse/dunnart alignments of the LYL1 candidate promoter allowed us to identify putative transcription factor binding sites, revealing a pattern highly reminiscent of critical regulatory regions of the LYL1 paralogue, SCL. This newly identified LYL1 promoter showed strong activity in myeloid progenitor cells and was bound in vivo by Fli1, Elf1, and Gata2-transcription factors all previously shown to bind to the SCL stem cell enhancer. This study represents the first large-scale comparative analysis involving marsupial genomic sequence and demonstrates that such comparisons provide a powerful approach to characterizing mammalian regulatory elements.

    Genomics 2003;81;3;249-59

  • Global transcriptional responses of fission yeast to environmental stress.

    Chen D, Toone WM, Mata J, Lyne R, Burns G, Kivinen K, Brazma A, Jones N and Bähler J

    The Wellcome Trust Sanger Institute, Cambridge CB10 1SA, United Kingdom.

    We explored transcriptional responses of the fission yeast Schizosaccharomyces pombe to various environmental stresses. DNA microarrays were used to characterize changes in expression profiles of all known and predicted genes in response to five stress conditions: oxidative stress caused by hydrogen peroxide, heavy metal stress caused by cadmium, heat shock caused by temperature increase to 39 degrees C, osmotic stress caused by sorbitol, and DNA damage caused by the alkylating agent methylmethane sulfonate. We define a core environmental stress response (CESR) common to all, or most, stresses. There was a substantial overlap between CESR genes of fission yeast and the genes of budding yeast that are stereotypically regulated during stress. CESR genes were controlled primarily by the stress-activated mitogen-activated protein kinase Sty1p and the transcription factor Atf1p. S. pombe also activated gene expression programs more specialized for a given stress or a subset of stresses. In general, these "stress-specific" responses were less dependent on the Sty1p mitogen-activated protein kinase pathway and may involve specific regulatory factors. Promoter motifs associated with some of the groups of coregulated genes were identified. We compare and contrast global regulation of stress genes in fission and budding yeasts and discuss evolutionary implications.

    Funded by: Cancer Research UK: A6517; Wellcome Trust: 077118

    Molecular biology of the cell 2003;14;1;214-29

  • Ensembl 2002: accommodating comparative genomics.

    Clamp M, Andrews D, Barker D, Bevan P, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Hubbard T, Kasprzyk A, Keefe D, Lehvaslaiho H, Iyer V, Melsopp C, Mongin E, Pettett R, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I and Birney E

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.

    The Ensembl ( database project provides a bioinformatics framework to organise biology around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of human, mouse and other genome sequences, available as either an interactive web site or as flat files. Ensembl also integrates manually annotated gene structures from external sources where available. As well as being one of the leading sources of genome annotation, Ensembl is an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements. These range from sequence analysis to data storage and visualisation and installations exist around the world in both companies and at academic sites. With both human and mouse genome sequences available and more vertebrate sequences to follow, many of the recent developments in Ensembl have focusing on developing automatic comparative genome analysis and visualisation.

    Nucleic acids research 2003;31;1;38-42

  • Genome-wide screening for complete genetic loss in prostate cancer by comparative hybridization onto cDNA microarrays.

    Clark J, Edwards S, Feber A, Flohr P, John M, Giddings I, Crossland S, Stratton MR, Wooster R, Campbell C and Cooper CS

    Molecular Carcinogenesis Section, Male Urological Cancer Research Center, Institute of Cancer Research, Sutton, Surrey, UK.

    We demonstrate that comparative genomic hybridization (CGH) onto cDNA microarrays may be used to carry out genome-wide screens for regions of genetic loss, including homozygous (complete) deletions that may represent the possible location of tumour suppressor genes in human cancer. Screening of the prostate cancer cell lines LNCaP, PC3 and DU145 allowed the mapping of specific regions where genome copy number appeared altered and led to the identification of two novel regions of complete loss at 17q21.31 (500 kb spanning STAT3) and at 10q23.1 (50-350 kb spanning SFTPA2) in the PC3 cell line.

    Oncogene 2003;22;8;1247-52

  • Enhanced protein domain discovery by using language modeling techniques from speech recognition.

    Coin L, Bateman A and Durbin R

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SA, United Kingdom.

    Most modern speech recognition uses probabilistic models to interpret a sequence of sounds. Hidden Markov models, in particular, are used to recognize words. The same techniques have been adapted to find domains in protein sequences of amino acids. To increase word accuracy in speech recognition, language models are used to capture the information that certain word combinations are more likely than others, thus improving detection based on context. However, to date, these context techniques have not been applied to protein domain discovery. Here we show that the application of statistical language modeling methods can significantly enhance domain recognition in protein sequences. As an example, we discover an unannotated Tf_Otx Pfam domain on the cone rod homeobox protein, which suggests a possible mechanism for how the V242M mutation on this protein causes cone-rod dystrophy.

    Proceedings of the National Academy of Sciences of the United States of America 2003;100;8;4516-20

  • Reevaluating human gene annotation: a second-generation analysis of chromosome 22.

    Collins JE, Goward ME, Cole CG, Smink LJ, Huckle EJ, Knowles S, Bye JM, Beare DM and Dunham I

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    We report a second-generation gene annotation of human chromosome 22. Using expressed sequence databases, comparative sequence analysis, and experimental verification, we have extended genes, fused previously fragmented structures, and identified new genes. The total length in exons of annotation was increased by 74% over our previously published annotation and includes 546 protein-coding genes and 234 pseudogenes. Thirty-two potential protein-coding annotations are partial copies of other genes, and may represent duplications on an evolutionary path to change or loss of function. We also identified 31 non-protein-coding transcripts, including 16 possible antisense RNAs. By extrapolation, we estimate the human genome contains 29,000-36,000 protein-coding genes, 21,300 pseudogenes, and 1500 antisense RNAs. We suggest that our revised annotation criteria provide a paradigm for future annotation of the human genome.

    Genome research 2003;13;1;27-36

  • Boudicca, a retrovirus-like long terminal repeat retrotransposon from the genome of the human blood fluke Schistosoma mansoni.

    Copeland CS, Brindley PJ, Heyers O, Michael SF, Johnston DA, Williams DL, Ivens AC and Kalinna BH

    Department of Tropical Medicine, School of Public Health and Tropical Medicine, Tulane University Health Sciences Center, New Orleans, Louisiana 70112, USA.

    The genome of Schistosoma mansoni contains a proviral form of a retrovirus-like long terminal repeat (LTR) retrotransposon, designated BOUDICCA: Sequence and structural characterization of the new mobile genetic element, which was found in bacterial artificial chromosomes prepared from S. mansoni genomic DNA, revealed the presence of three putative open reading frames (ORFs) bounded by direct LTRs of 328 bp in length. ORF1 encoded a retrovirus-like major homology region and a Cys/His box motif, also present in Gag polyproteins of related retrotransposons and retroviruses. ORF2 encoded enzymatic domains and motifs characteristic of a retrovirus-like polyprotein, including aspartic protease, reverse transcriptase, RNase H, and integrase, in that order, a domain order similar to that of the gypsy/Ty3 retrotransposons. An additional ORF at the 3' end of the retrotransposon may encode an envelope protein. Phylogenetic comparison based on the reverse transcriptase domain of ORF2 confirmed that Boudicca was a gypsy-like retrotransposon and showed that it was most closely related to CsRn1 from the Oriental liver fluke Clonorchis sinensis and to kabuki from Bombyx mori. Bioinformatics approaches together with Southern hybridization analysis of genomic DNA of S. mansoni and the screening of a bacterial artificial chromosome library representing approximately 8-fold coverage of the S. mansoni genome revealed that numerous copies of Boudicca were interspersed throughout the schistosome genome. By reverse transcription-PCR, mRNA transcripts were detected in the sporocyst, cercaria, and adult developmental stages of S. mansoni, indicating that Boudicca is actively transcribed in this trematode.

    Journal of virology 2003;77;11;6153-66

  • The not-so-humble worm.

    Crombie C, Junio A and Fraser A

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.

    Genome biology 2003;5;1;301

  • Pathogenomics.

    Crossman L, Cerdeño-Tárraga A, Bentley S and Parkhill J

    Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    The genomes described this month reflect the overall historical bias of microbial genomics towards pathogenic bacteria. Although the balance is now being redressed to some extent, especially through the study of extremophiles, it is still the case that the opportunities provided by genomic studies are primarily taken up by those who study bacterial pathogenicity. This part of the field is, however, being broadened by including the study of pathogens of animals, insects and plants alongside those that afflict humans.

    Nature reviews. Microbiology 2003;1;3;176-7

  • A DNA damage checkpoint response in telomere-initiated senescence.

    d'Adda di Fagagna F, Reaper PM, Clay-Farrace L, Fiegler H, Carr P, Von Zglinicki T, Saretzki G, Carter NP and Jackson SP

    [1] The Wellcome Trust/Cancer Research UK Institute of Cancer and Developmental Biology, University of Cambridge, Cambridge CB2 1QR, UK [2] Present address: IFOM-FIRC Institute of Molecular Oncology, via Adamello 16, 20139 Milan, Italy.

    Most human somatic cells can undergo only a limited number of population doublings in vitro. This exhaustion of proliferative potential, called senescence, can be triggered when telomeres--the ends of linear chromosomes-cannot fulfil their normal protective functions. Here we show that senescent human fibroblasts display molecular markers characteristic of cells bearing DNA double-strand breaks. These markers include nuclear foci of phosphorylated histone H2AX and their co-localization with DNA repair and DNA damage checkpoint factors such as 53BP1, MDC1 and NBS1. We also show that senescent cells contain activated forms of the DNA damage checkpoint kinases CHK1 and CHK2. Furthermore, by chromatin immunoprecipitation and whole-genome scanning approaches, we show that the chromosome ends of senescent cells directly contribute to the DNA damage response, and that uncapped telomeres directly associate with many, but not all, DNA damage response proteins. Finally, we show that inactivation of DNA damage checkpoint kinases in senescent cells can restore cell-cycle progression into S phase. Thus, we propose that telomere-initiated senescence reflects a DNA damage checkpoint response that is activated with a direct contribution from dysfunctional telomeres.

    Nature 2003;426;6963;194-8

  • ddbRNA: detection of conserved secondary structures in multiple alignments.

    di Bernardo D, Down T and Hubbard T

    Telethon Institute of Genetics and Medicine, Via P Castellino 111, 80133 Naples, Italy.

    Motivation: Structured non-coding RNAs (ncRNAs) have a very important functional role in the cell. No distinctive general features common to all ncRNA have yet been discovered. This makes it difficult to design computational tools able to detect novel ncRNAs in the genomic sequence.

    Results: We devised an algorithm able to detect conserved secondary structures in both pairwise and multiple DNA sequence alignments with computational time proportional to the square of the sequence length. We implemented the algorithm for the case of pairwise and three-way alignments and tested it on ncRNAs obtained from public databases. On the test sets, the pairwise algorithm has a specificity greater than 97% with a sensitivity varying from 22.26% for Blast alignments to 56.35% for structural alignments. The three-way algorithm behaves similarly. Our algorithm is able to efficiently detect a conserved secondary structure in multiple alignments.

    Funded by: Telethon: TGM03P17, TGM06S01

    Bioinformatics (Oxford, England) 2003;19;13;1606-11

  • Recruitment of heterogeneous nuclear ribonucleoprotein A1 in vivo to the LMP/TAP region of the major histocompatibility complex.

    Donev R, Horton R, Beck S, Doneva T, Vatcheva R, Bowen WR and Sheer D

    Human Cytogenetics Laboratory, Cancer Research, UK London Research Institute, Lincoln's Inn Fields Laboratories, 44 Lincoln's Inn Fields, London WC2A 3PX, United Kingdom.

    Sequences containing the matrix recognition signature were identified adjacent to the LMP/TAP gene cluster in the human and mouse major histocompatibility complex class II region. These sequences were shown to function as nuclear matrix attachment regions (MARs). Three of the five human MARs and the single mouse MAR recruit heterogeneous nuclear ribonucleoprotein A1 (hnRNP-A1) in vivo during transcriptional up-regulation of the major histocompatibility complex class II genes. The timing of this recruitment correlates with a rise in mature TAP1 mRNA. Two of the human MARs bind hnRNP-A1 in vitro directly within a 35-bp sequence that shows over 90% similarity to certain Alu repeat sequences. This study shows that MARs recruit and bind hnRNP-A1 upon transcriptional up-regulation.

    Funded by: Cancer Research UK: A3585

    The Journal of biological chemistry 2003;278;7;5214-26

  • Highly parallel SNP genotyping.

    Fan JB, Oliphant A, Shen R, Kermani BG, Garcia F, Gunderson KL, Hansen M, Steemers F, Butler SL, Deloukas P, Galver L, Hunt S, McBride C, Bibikova M, Rubano T, Chen J, Wickham E, Doucet D, Chang W, Campbell D, Zhang B, Kruglyak S, Bentley D, Haas J, Rigault P, Zhou L, Stuelpnagel J and Chee MS

    llumina, Inc., San Diego, California 92121, USA.

    Funded by: NCI NIH HHS: R43 CA-81952; NHGRI NIH HHS: HG-002753, R44 HG-02003

    Cold Spring Harbor symposia on quantitative biology 2003;68;69-78

  • DNA microarrays for comparative genomic hybridization based on DOP-PCR amplification of BAC and PAC clones.

    Fiegler H, Carr P, Douglas EJ, Burford DC, Hunt S, Scott CE, Smith J, Vetrie D, Gorman P, Tomlinson IP and Carter NP

    Wellcome Trust Sanger Institute/Cancer Research UK Genomic Microarray Group, Hinxton, Cambridge, CB10 1SA, United Kingdom.

    We have designed DOP-PCR primers specifically for the amplification of large insert clones for use in the construction of DNA microarrays. A bioinformatic approach was used to construct primers that were efficient in the general amplification of human DNA but were poor at amplifying E. coli DNA, a common contaminant of DNA preparations from large insert clones. We chose the three most selective primers for use in printing DNA microarrays. DNA combined from the amplification of large insert clones by use of these three primers and spotted onto glass slides showed more than a sixfold increase in the human to E. coli hybridization ratio when compared to the standard DOP-PCR primer, 6MW. The microarrays reproducibly delineated previously characterized gains and deletions in a cancer cell line and identified a small gain not detected by use of conventional CGH. We also describe a method for the bulk testing of the hybridization characteristics of chromosome-specific clones spotted on microarrays by use of DNA amplified from flow-sorted chromosomes. Finally, we describe a set of clones selected from the publicly available Golden Path of the human genome at 1-Mb intervals and a view in the Ensembl genome browser from which data required for the use of these clones in array CGH and other experiments can be downloaded across the Internet.

    Genes, chromosomes & cancer 2003;36;4;361-74

  • Array painting: a method for the rapid analysis of aberrant chromosomes using DNA microarrays.

    Fiegler H, Gribble SM, Burford DC, Carr P, Prigmore E, Porter KM, Clegg S, Crolla JA, Dennis NR, Jacobs P and Carter NP

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Objective: The authors describe a method, termed array painting, which allows the rapid, high resolution analysis of the content and breakpoints of aberrant chromosomes.

    Methods: Array painting is similar in concept to reverse chromosome painting and involves the hybridisation of probes generated by PCR of small numbers of flow sorted chromosomes on large insert genomic clone DNA microarrays.

    Results: and Conclusions: By analysing patients with cytogenetically balanced chromosome rearrangements, the authors show the effectiveness of array painting as a method to map breakpoints prior to cloning and sequencing chromosome rearrangements.

    Journal of medical genetics 2003;40;9;664-70

  • Identifying protein domains with the Pfam database.

    Finn R, Griffiths-Jones S and Bateman A

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambs, United Kingdom.

    Pfam is a database of such protein domain families, with each family represented by multiple sequence alignments and profile hidden Markov models (HMMs). In addition, each family has associated annotation, literature references and links to other databases. The entries in Pfam are available via the worldwide web and in flatfile format. This unit contains detailed information on how to access and utilise the information present in the Pfam database, namely the families, multiple alignments and annotation. Details on running Pfam, both remotely and locally are presented.

    Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.] 2003;Chapter 2;Unit 2.5

  • Worms in L.A.

    Fraser A

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    Nature genetics 2003;35;1;3-5

  • The complete genome sequence of Mycobacterium bovis.

    Garnier T, Eiglmeier K, Camus JC, Medina N, Mansoor H, Pryor M, Duthoy S, Grondin S, Lacroix C, Monsempe C, Simon S, Harris B, Atkin R, Doggett J, Mayes R, Keating L, Wheeler PR, Parkhill J, Barrell BG, Cole ST, Gordon SV and Hewinson RG

    Unité de Génétique Moléculaire Bactérienne and PT4 Annotation, Génopole, Institut Pasteur, 28 Rue du Docteur Roux, 75724 Paris Cedex 15, France.

    Mycobacterium bovis is the causative agent of tuberculosis in a range of animal species and man, with worldwide annual losses to agriculture of $3 billion. The human burden of tuberculosis caused by the bovine tubercle bacillus is still largely unknown. M. bovis was also the progenitor for the M. bovis bacillus Calmette-Guérin vaccine strain, the most widely used human vaccine. Here we describe the 4,345,492-bp genome sequence of M. bovis AF2122/97 and its comparison with the genomes of Mycobacterium tuberculosis and Mycobacterium leprae. Strikingly, the genome sequence of M. bovis is >99.95% identical to that of M. tuberculosis, but deletion of genetic information has led to a reduced genome size. Comparison with M. leprae reveals a number of common gene losses, suggesting the removal of functional redundancy. Cell wall components and secreted proteins show the greatest variation, indicating their potential role in host-bacillus interactions or immune evasion. Furthermore, there are no genes unique to M. bovis, implying that differential gene expression may be the key to the host tropisms of human and bovine bacilli. The genome sequence therefore offers major insight on the evolution, host preference, and pathobiology of M. bovis.

    Proceedings of the National Academy of Sciences of the United States of America 2003;100;13;7877-82

  • Synapse signalling complexes and networks: machines underlying cognition.

    Grant SG

    Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, CB1D 1SA, UK.

    All thoughts and actions are encoded in patterns of neuronal electrical activity. Circuits of nerve cells connected by synapses are dedicated to processing information in these patterns. Information is not only transmitted across the synapse but also monitored by postsynaptic molecular machines. These machines are macromolecular complexes of approximately 100 proteins organised into a network of protein interactions. The network can be mathematically described as a scale-free network. Components of the complexes are necessary for decoding the neural code and converting electrical information into biochemical changes. The network properties of these complexes may explain many of the features of neuronal plasticity and cognitive function in rodents. Importantly, these multiprotein complexes and their network properties shed new light on the basis of human cognitive diseases including schizophrenia, autism, Huntington's disease and mental retardation. Supplementary material for this article can be found on the BioEssays website

    BioEssays : news and reviews in molecular, cellular and developmental biology 2003;25;12;1229-35

  • Systems biology in neuroscience: bridging genes to cognition.

    Grant SG

    Division of Neuroscience, 1 George Square, Edinburgh EH8 9JZ, UK.

    Systems biology is a new branch of biology aimed at understanding biological complexity. Genomic and proteomic methods integrated with cellular and organismal analyses allow modelling of physiological processes. Progress in understanding synapse composition and new experimental and bioinformatics methods indicate the synapse is an excellent starting point for global systems biology of the brain. A neuroscience systems biology programme, organized as a consortium, is proposed.

    Current opinion in neurobiology 2003;13;5;577-82

  • Molecular cytogenetics of polycythaemia vera: lack of occult rearrangements detectable by 20q LSP screening, CGH, and M-FISH.

    Gribble SM, Reid AG, Bench AJ, Huntly BJ, Grace C, Green AR and Nacheva EP

    Leukemia : official journal of the Leukemia Society of America, Leukemia Research Fund, U.K 2003;17;7;1419-21

  • Rfam: an RNA family database.

    Griffiths-Jones S, Bateman A, Marshall M, Khanna A and Eddy SR

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Rfam is a collection of multiple sequence alignments and covariance models representing non-coding RNA families. Rfam is available on the web in the UK at and in the US at These websites allow the user to search a query sequence against a library of covariance models, and view multiple sequence alignments and family annotation. The database can also be downloaded in flatfile form and searched locally using the INFERNAL package ( The first release of Rfam (1.0) contains 25 families, which annotate over 50 000 non-coding RNA genes in the taxonomic divisions of the EMBL nucleotide database.

    Nucleic acids research 2003;31;1;439-41

  • Genomic sequence and transcriptional profile of the boundary between pericentromeric satellites and genes on human chromosome arm 10p.

    Guy J, Hearn T, Crosier M, Mudge J, Viggiano L, Koczan D, Thiesen HJ, Bailey JA, Horvath JE, Eichler EE, Earthrowl ME, Deloukas P, French L, Rogers J, Bentley D and Jackson MS

    The Institute of Human Genetics, The International Centre for Life, University of Newcastle upon Tyne, Newcastle upon Tyne NE1 3BZ, UK.

    Contiguous finished sequence from highly duplicated pericentromeric regions of human chromosomes is needed if we are to understand the role of pericentromeric instability in disease, and in gene and karyotype evolution. Here, we have constructed a BAC contig spanning the transition from pericentromeric satellites to genes on the short arm of human chromosome 10, and used this to generate 1.4 Mb of finished genomic sequence. Combining RT-PCR, in silico gene prediction, and paralogy analysis, we can identify two domains within the sequence. The proximal 600 kb consists of satellite-rich pericentromerically duplicated DNA which is transcript poor, containing only three unspliced transcripts. In contrast, the distal 850 kb contains four known genes (ZNF248, ZNF25, ZNF33A, and ZNF37A) and up to 32 additional transcripts of unknown function. This distal region also contains seven out of the eight intrachromosomal duplications within the sequence, including the p arm copy of the approximately 250-kb duplication which gave rise to ZNF33A and ZNF33B. By sequencing orthologs of the duplicated ZNF33 genes we have established that ZNF33A has diverged significantly at residues critical for DNA binding but ZNF33B has not, indicating that ZNF33B has remained constrained by selection for ancestral gene function. These results provide further evidence of gene formation within intrachromosomal duplications, but indicate that recent interchromosomal duplications at this centromere have involved transcriptionally inert, satellite rich DNA, which is likely to be heterochromatic. This suggests that any novel gene structures formed by these interchromosomal events would require relocation to a more open chromatin environment to be expressed.

    Genome research 2003;13;2;159-72

  • Mutations in the gene encoding capillary morphogenesis protein 2 cause juvenile hyaline fibromatosis and infantile systemic hyalinosis.

    Hanks S, Adams S, Douglas J, Arbour L, Atherton DJ, Balci S, Bode H, Campbell ME, Feingold M, Keser G, Kleijer W, Mancini G, McGrath JA, Muntoni F, Nanda A, Teare MD, Warman M, Pope FM, Superti-Furga A, Futreal PA and Rahman N

    Section of Cancer Genetics, Institute of Cancer Research, Sutton, Surrey, United Kingdom.

    Juvenile hyaline fibromatosis (JHF) and infantile systemic hyalinosis (ISH) are autosomal recessive conditions characterized by multiple subcutaneous skin nodules, gingival hypertrophy, joint contractures, and hyaline deposition. We previously mapped the gene for JHF to chromosome 4q21. We now report the identification of 15 different mutations in the gene encoding capillary morphogenesis protein 2 (CMG2) in 17 families with JHF or ISH. CMG2 is a transmembrane protein that is induced during capillary morphogenesis and that binds laminin and collagen IV via a von Willebrand factor type A (vWA) domain. Of interest, CMG2 also functions as a cellular receptor for anthrax toxin. Preliminary genotype-phenotype analyses suggest that abrogation of binding by the vWA domain results in severe disease typical of ISH, whereas in-frame mutations affecting a novel, highly conserved cytoplasmic domain result in a milder phenotype. These data (1) demonstrate that JHF and ISH are allelic conditions and (2) implicate perturbation of basement-membrane matrix assembly as the cause of the characteristic perivascular hyaline deposition seen in these conditions.

    American journal of human genetics 2003;73;4;791-800

  • Refined mapping of the HMSNR critical gene region--construction of a high-density integrated genetic and physical map.

    Hantke J, Rogers T, French L, Tournev I, Guergueltcheva V, Urtizberea JA, Colomer J, Corches A, Lupu C, Merlini L, Thomas PK and Kalaydjieva L

    Western Australian Institute for Medical Research and Centre for Medical Research, University of Western Australia, Perth, Australia.

    Hereditary motor and sensory neuropathy russe, a form of autosomal recessive Charcot-Marie-Tooth disease, is a rare disorder found in several Roma families from Europe. The gene has been mapped to a 1Mb region on 10q22. Detailed analysis led to the exclusion of 22 candidate genes and the assembly of a high-density genetic map comprising 141 polymorphic markers. Extensive genotyping in an extended sample of affected families resulted in a 10-fold reduction of the critical hereditary motor and sensory neuropathy russe gene region, which is now contained within a single completely sequenced BAC clone. The fact that no sequence variant has been detected in the known genes in the critical region indicates that the hereditary motor and sensory neuropathy russe mutation affects a novel gene that remains to be identified.

    Neuromuscular disorders : NMD 2003;13;9;729-36

  • WormBase: a cross-species database for comparative genomics.

    Harris TW, Lee R, Schwarz E, Bradnam K, Lawson D, Chen W, Blasier D, Kenny E, Cunningham F, Kishore R, Chan J, Muller HM, Petcherski A, Thorisson G, Day A, Bieri T, Rogers A, Chen CK, Spieth J, Sternberg P, Durbin R and Stein LD

    Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY 11724, USA.

    WormBase ( is a web-accessible central data repository for information about Caenorhabditis elegans and related nematodes. The past two years have seen a significant expansion in the biological scope of WormBase, including the integration of large-scale, genome-wide data sets, the inclusion of genome sequence and gene predictions from related species and active literature curation. This expansion of data has also driven the development and refinement of user interfaces and operability, including a new Genome Browser, new searches and facilities for data access and the inclusion of extensive documentation. These advances have expanded WormBase beyond the obvious target audience of C. elegans researchers, to include researchers wishing to explore problems in functional and comparative genomics within the context of a powerful genetic system.

    Funded by: NHGRI NIH HHS: P41-HG02223

    Nucleic acids research 2003;31;1;133-7

  • BACE1 (beta-secretase) transgenic and knockout mice: identification of neurochemical deficits and behavioral changes.

    Harrison SM, Harper AJ, Hawkins J, Duddy G, Grau E, Pugh PL, Winter PH, Shilliam CS, Hughes ZA, Dawson LA, Gonzalez MI, Upton N, Pangalos MN and Dingwall C

    Department of Comparative Genomics, GlaxoSmithKline, New Frontiers Science Park (North), Third Avenue, Harlow, Essex CM19 5AW, UK.

    BACE1 is a key enzyme in the generation of Abeta, the major component of senile plaques in the brains of Alzheimer's disease patients. We have generated transgenic mice expressing human BACE1 with the Cam Kinase II promoter driving neuronal-specific expression. The transgene contains the full-length coding sequence of human BACE1 preceding an internal ribosome entry site element followed by a LacZ reporter gene. These animals exhibit a bold, exploratory behavior and show elevated 5-hydroxytryptamine turnover. We have also generated a knockout mouse in which LacZ replaces the first exon of murine BACE1. Interestingly these animals show a contrasting behavior, being timid and less exploratory. Despite these clear differences both mouse lines are viable and fertile with no changes in morbidity. These results suggest an unexpected role for BACE1 in neurotransmission, perhaps through changes in amyloid precursor protein processing and Abeta levels.

    Molecular and cellular neurosciences 2003;24;3;646-55

  • Streptomyces coelicolor A3(2) plasmid SCP2*: deductions from the complete sequence.

    Haug I, Weissenborn A, Brolle D, Bentley S, Kieser T and Altenbuchner J

    Institut für Industrielle Genetik, Universität Stuttgart, Allmandring 31, 70569 Stuttgart, Germany.

    Plasmid SCP2* is a 31 kb, circular, low-copy-number plasmid originally identified in Streptomyces coelicolor A3(2) as a fertility factor. The plasmid was completely sequenced. The analysis of the 31 317 bp sequence revealed 34 ORFs encoding putative proteins from 31 to 710 aa long, most of them lacking similarity to known proteins. Three functional regions had been identified previously: the replication region, the transfer and spreading region, and the stability region. Three genes were identified in the stability region which contribute to the stability of SCP2 as shown by plasmid stability testing. The first gene, mrpA, encodes a new member of the lambda integrase family of site-specific recombinases. The two genes downstream of mrpA were called parA and parB. The gene product, ParA, shows similarity to a family of ATPases involved in plasmid partition. An increase of plasmid stability could be seen only when both genes were present. By deletion analysis, the replication region could be narrowed down to a 1.6 kb region, consisting of a 650 bp non-coding region and two genes, repI and repII, encoding proteins of 161 and 131 aa. Only RepI exhibits similarities to DNA binding elements and contains a putative helix-turn-helix motif. The traA gene that is essential for DNA transfer and pock formation was identified previously. Upstream of traA, 10 ORFs were found in the same orientation as traA which might be involved in conjugation and DNA spreading, together with one gene in the opposite orientation with similarities to transcriptional regulators of DNA transfer. Two transposable elements were found on SCP2*. IS1648 belongs to the IS3 family of insertion sequences. The second element, Tn5417, shows the highest similarity to the Tn4811 element located in the terminal inverted repeats of the Streptomyces lividans chromosome.

    Microbiology (Reading, England) 2003;149;Pt 2;505-13

  • The magnificent seven.

    Holden M, Bentley S, Sebaihia M, Thomson N, Cerdeño-Tárraga A and Parkhill J

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Trends in microbiology 2003;11;1;12-4

  • Accessing HLA sequencing data through the 6ace database.

    Horton R and Beck S

    Sanger Centre, Hinxton, United Kingdom.

    Methods in molecular biology (Clifton, N.J.) 2003;210;23-42

  • EMSY links the BRCA2 pathway to sporadic breast and ovarian cancer.

    Hughes-Davies L, Huntsman D, Ruas M, Fuks F, Bye J, Chin SF, Milner J, Brown LA, Hsu F, Gilks B, Nielsen T, Schulzer M, Chia S, Ragaz J, Cahn A, Linger L, Ozdag H, Cattaneo E, Jordanova ES, Schuuring E, Yu DS, Venkitaraman A, Ponder B, Doherty A, Aparicio S, Bentley D, Theillet C, Ponting CP, Caldas C and Kouzarides T

    Cancer Research UK/Wellcome Trust Institute and Department of Pathology, Tennis Court Road, Cambridge CB2 1QR, United Kingdom.

    The BRCA2 gene is mutated in familial breast and ovarian cancer, and its product is implicated in DNA repair and transcriptional regulation. Here we identify a protein, EMSY, which binds BRCA2 within a region (exon 3) deleted in cancer. EMSY is capable of silencing the activation potential of BRCA2 exon 3, associates with chromatin regulators HP1beta and BS69, and localizes to sites of repair following DNA damage. EMSY maps to chromosome 11q13.5, a region known to be involved in breast and ovarian cancer. We show that the EMSY gene is amplified almost exclusively in sporadic breast cancer (13%) and higher-grade ovarian cancer (17%). In addition, EMSY amplification is associated with worse survival, particularly in node-negative breast cancer, suggesting that it may be of prognostic value. The remarkable clinical overlap between sporadic EMSY amplification and familial BRCA2 deletion implicates a BRCA2 pathway in sporadic breast and ovarian cancer.

    Cell 2003;115;5;523-35

  • The International HapMap Project.

    International HapMap Consortium

    The goal of the International HapMap Project is to determine the common patterns of DNA sequence variation in the human genome and to make this information freely available in the public domain. An international consortium is developing a map of these patterns across the genome by determining the genotypes of one million or more sequence variants, their frequencies and the degree of association between them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe. The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance our ability to choose targets for therapeutic intervention.

    Nature 2003;426;6968;789-96

  • Kaposi's sarcoma-associated herpesvirus-infected primary effusion lymphoma has a plasma cell gene expression profile.

    Jenner RG, Maillard K, Cattini N, Weiss RA, Boshoff C, Wooster R and Kellam P

    Wohl Virion Centre, Department of Immunology and Molecular Pathology, Windeyer Institute, University College London, London W1T 4JF, United Kingdom.

    Kaposi's sarcoma-associated herpesvirus is associated with three human tumors: Kaposi's sarcoma, and the B cell lymphomas, plasmablastic lymphoma associated with multicentric Castleman's disease, and primary effusion lymphoma (PEL). Epstein-Barr virus, the closest human relative of Kaposi's sarcoma-associated herpesvirus, mimics host B cell signaling pathways to direct B cell development toward a memory B cell phenotype. Epstein-Barr virus-associated B cell tumors are presumed to arise as a consequence of this virus-mediated B cell activation. The stage of B cell development represented by PEL, how this stage relates to tumor pathology, and how this information may be used to treat the disease are largely unknown. In this study we used gene expression profiling to order a range of B cell tumors by stage of development. PEL gene expression closely resembles that of malignant plasma cells, including the low expression of mature B cell genes. The unfolded protein response is partially activated in PEL, but is fully activated in plasma cell tumors, linking endoplasmic reticulum stress to plasma cell development through XBP-1. PEL cells can be defined by the overexpression of genes involved in inflammation, cell adhesion, and invasion, which may be responsible for their presentation in body cavities. Similar to malignant plasma cells, all PEL samples tested express the vitamin D receptor and are sensitive to the vitamin D analogue drug EB 1089 (Seocalcitol).

    Proceedings of the National Academy of Sciences of the United States of America 2003;100;18;10399-404

  • Systematic functional analysis of the Caenorhabditis elegans genome using RNAi.

    Kamath RS, Fraser AG, Dong Y, Poulin G, Durbin R, Gotta M, Kanapin A, Le Bot N, Moreno S, Sohrmann M, Welchman DP, Zipperlen P and Ahringer J

    Wellcome Trust/Cancer Research UK Institute and Department of Genetics, University of Cambridge, Tennis Court Road, Cambridge CB2 1QR, UK.

    A principal challenge currently facing biologists is how to connect the complete DNA sequence of an organism to its development and behaviour. Large-scale targeted-deletions have been successful in defining gene functions in the single-celled yeast Saccharomyces cerevisiae, but comparable analyses have yet to be performed in an animal. Here we describe the use of RNA interference to inhibit the function of approximately 86% of the 19,427 predicted genes of C. elegans. We identified mutant phenotypes for 1,722 genes, about two-thirds of which were not previously associated with a phenotype. We find that genes of similar functions are clustered in distinct, multi-megabase regions of individual chromosomes; genes in these regions tend to share transcriptional profiles. Our resulting data set and reusable RNAi library of 16,757 bacterial clones will facilitate systematic analyses of the connections among gene sequence, chromosomal location and gene function in C. elegans.

    Funded by: Wellcome Trust: 054523

    Nature 2003;421;6920;231-7

  • CASP5 target classification.

    Kinch LN, Qi Y, Hubbard TJ and Grishin NV

    Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas 75390-9050, USA.

    This report summarizes the Critical Assessment of Protein Structure Prediction (CASP5) target proteins, which included 67 experimental models submitted from various structural genomics efforts and independent research groups. Throughout this special issue, CASP5 targets are referred to with the identification numbers T0129-T0195. Several of these targets were excluded from the assessment for various reasons: T0164 and T0166 were cancelled by the organizers; T0131, T0144, T0158, T0163, T0171, T0175, and T0180 were not available in time; T0145 was "natively unfolded"; the T0139 structure became available before the target expired; and T0194 was solved for a different sequence than the one submitted. Table I outlines the sequence and structural information available for CASP5 proteins in the context of existing folds and evolutionary relationships. This information provided the basis for a domain-based classification of the target structures into three assessment categories: comparative modeling (CM), fold recognition (FR), and new fold (NF). The FR category was further subdivided into homologues [FR(H)] and analogs [FR(A)] based on evolutionary considerations, and the overlap between assessment categories was classified as CM/FR(H) and FR(A)/NF. CASP5 domains are illustrated in Figure 1. Examples of nontrivial links between CASP5 target domains and existing structures that support our classifications are provided.

    Proteins 2003;53 Suppl 6;340-51

  • Serial BLAST searching.

    Korf I

    The Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, CB10 1SA, UK.

    MOTIVATION: The translating BLAST algorithms are powerful tools for finding protein-coding genes because they identify amino acid similarities in nucleotide sequences. Unfortunately, these kinds of searches are computationally intensive and often represent bottlenecks in sequence analysis pipelines. Tuning parameters for speed can make the searches much faster, but one risks losing low-scoring alignments. However, high scoring alignments are relatively resistant to such changes in parameters, and this fact makes it possible to use a serial strategy where a fast, insensitive search is used to pre-screen a database for similar sequences, and a slow, sensitive search is used to produce the sequence alignments. RESULTS: Serial BLAST searches improve both the speed and sensitivity.

    Bioinformatics (Oxford, England) 2003;19;12;1492-6

  • The multifaceted C. elegans major sperm protein: an ephrin signaling antagonist in oocyte maturation.

    Kuwabara PE

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK CB10 1SA.

    Genes & development 2003;17;2;155-61

  • Molecular classification of synovial sarcomas, leiomyosarcomas and malignant fibrous histiocytomas by gene expression profiling.

    Lee YF, John M, Edwards S, Clark J, Flohr P, Maillard K, Edema M, Baker L, Mangham DC, Grimer R, Wooster R, Thomas JM, Fisher C, Judson I and Cooper CS

    The Male Urological Cancer Research Centre, Institute of Cancer Research, Sutton, Surrey, UK.

    In this study, we have used genome-wide expression profiling to categorise synovial sarcomas, leiomyosarcomas and malignant fibrous histiocytomas (MFHs). Following hierarchical clustering analysis of the expression data, the best match between tumour clusters and conventional diagnosis was observed for synovial sarcomas. Eight of nine synovial sarcomas examined formed a cluster that was characterised by higher expression of a set of 48 genes. In contrast, sarcomas conventionally classified as leiomyosarcomas and MFHs did not match the clusters defined by hierarchical clustering analysis. One major cluster contained a mixture of both leiomyosarcomas and MFHs and was defined by the lower expression of a set of 202 genes. A cluster containing a subgroup of MFHs was also detected. These results may have implications for the classification of soft tissue sarcomas, and are consistent with the view that sarcomas conventionally defined as MFHs do not represent a separate diagnostic category.

    British journal of cancer 2003;88;4;510-5

  • Adult midgut expressed sequence tags from the tsetse fly Glossina morsitans morsitans and expression analysis of putative immune response genes.

    Lehane MJ, Aksoy S, Gibson W, Kerhornou A, Berriman M, Hamilton J, Soares MB, Bonaldo MF, Lehane S and Hall N

    School of Biological Sciences, University of Wales, Bangor, LL57 2UW, UK.

    Background: Tsetse flies transmit African trypanosomiasis leading to half a million cases annually. Trypanosomiasis in animals (nagana) remains a massive brake on African agricultural development. While trypanosome biology is widely studied, knowledge of tsetse flies is very limited, particularly at the molecular level. This is a serious impediment to investigations of tsetse-trypanosome interactions. We have undertaken an expressed sequence tag (EST) project on the adult tsetse midgut, the major organ system for establishment and early development of trypanosomes.

    Results: A total of 21,427 ESTs were produced from the midgut of adult Glossina morsitans morsitans and grouped into 8,876 clusters or singletons potentially representing unique genes. Putative functions were ascribed to 4,035 of these by homology. Of these, a remarkable 3,884 had their most significant matches in the Drosophila protein database. We selected 68 genes with putative immune-related functions, macroarrayed them and determined their expression profiles following bacterial or trypanosome challenge. In both infections many genes are downregulated, suggesting a malaise response in the midgut. Trypanosome and bacterial challenge result in upregulation of different genes, suggesting that different recognition pathways are involved in the two responses. The most notable block of genes upregulated in response to trypanosome challenge are a series of Toll and Imd genes and a series of genes involved in oxidative stress responses.

    Conclusions: The project increases the number of known Glossina genes by two orders of magnitude. Identification of putative immunity genes and their preliminary characterization provides a resource for the experimental dissection of tsetse-trypanosome interactions.

    Genome biology 2003;4;10;R63

  • Cdh23 mutations in the mouse are associated with retinal dysfunction but not retinal degeneration.

    Libby RT, Kitamoto J, Holme RH, Williams DS and Steel KP

    MRC Institute of Hearing Research, University Park, Nottingham NG7 2RD, UK.

    Mutations in the cadherin 23 gene (CDH23) cause Usher syndrome type 1D in humans, a disease that results in retinitis pigmentosa and deafness. Cdh23 is also mutated in the waltzer mouse. In order to determine if the retina of the waltzer mouse undergoes retinal degeneration and to gain insight into the function of cadherin 23 in the retina, we have characterized the anatomy and physiology of retinas of waltzer mouse mutants. Three mutant alleles of Cdh23 were examined by histology and electroretinography (ERG). ERGs of the three Cdh23 mutant groups revealed two of them to have abnormal retinal function. One allele had a- and b-waves that were only approximately 80% of Cdh23 heterozygotes. Another allele had a significantly faster implicit time for both the a- and b-waves of the ERG. No anatomical abnormality was detected in any of the Cdh23 mutants by light microscopy. Because the mutant Cdh23 phenotype was found to be similar to the previously reported retinal phenotype of Myo7a mutant mice, the orthologue of another Usher syndrome (type 1B) gene, we generated mice that carried mutations in both genes to test for genetic interaction in the retina. No functional interaction between cadherin 23 and myosin VIIa was detected by either microscopy or ERG.

    Funded by: NEI NIH HHS: EY07042, EY12598, R01 EY007042-18

    Experimental eye research 2003;77;6;731-9

  • Dispersal of NK homeobox gene clusters in amphioxus and humans.

    Luke GN, Castro LF, McLay K, Bird C, Coulson A and Holland PW

    School of Animal and Microbial Sciences, University of Reading, Whiteknights, Reading RG6 6AJ, United Kingdom.

    The Drosophila melanogaster genome has six physically clustered NK-related homeobox genes in just 180 kb. Here we show that the NK homeobox gene cluster was an ancient feature of bilaterian animal genomes, but has been secondarily split in chordate ancestry. The NK homeobox gene clusters of amphioxus and vertebrates are each split and dispersed at two equivalent intergenic positions. From the ancestral NK gene cluster, only the Tlx-Lbx and NK3-NK4 linkages have been retained in chordates. This evolutionary pattern is in marked contrast to the Hox and ParaHox gene clusters, which are compact in amphioxus and vertebrates, but have been disrupted in Drosophila.

    Proceedings of the National Academy of Sciences of the United States of America 2003;100;9;5292-5

  • Whole-genome microarrays of fission yeast: characteristics, accuracy, reproducibility, and processing of array data.

    Lyne R, Burns G, Mata J, Penkett CJ, Rustici G, Chen D, Langford C, Vetrie D and Bähler J

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.

    Background: The genome of the fission yeast Schizosaccharomyces pombe has recently been sequenced, setting the stage for the post-genomic era of this increasingly popular model organism. We have built fission yeast microarrays, optimised protocols to improve array performance, and carried out experiments to assess various characteristics of microarrays.

    Results: We designed PCR primers to amplify specific probes (180-500 bp) for all known and predicted fission yeast genes, which are printed in duplicate onto separate regions of glass slides together with control elements (approximately 13,000 spots/slide). Fluorescence signal intensities depended on the size and intragenic position of the array elements, whereas the signal ratios were largely independent of element properties. Only the coding strand is covalently linked to the slides, and our array elements can discriminate transcriptional direction. The microarrays can distinguish sequences with up to 70% identity, above which cross-hybridisation contributes to the signal intensity. We tested the accuracy of signal ratios and measured the reproducibility of array data caused by biological and technical factors. Because the technical variability is lower, it is best to use samples prepared from independent biological experiments to obtain repeated measurements with swapping of fluorochromes to prevent dye bias. We also developed a script that discards unreliable data and performs a normalization to correct spatial artefacts.

    Conclusions: This paper provides data for several microarray properties that are rarely measured. The results define critical parameters for microarray design and experiments and provide a framework to optimise and interpret array data. Our arrays give reproducible and accurate expression ratios with high sensitivity. The scripts for primer design and initial data processing as well as primer sequences and detailed protocols are available from our website.

    Funded by: Cancer Research UK: A6517; Wellcome Trust: 077118

    BMC genomics 2003;4;1;27

  • Correlations between gene expression and gene conservation in fission yeast.

    Mata J and Bahler J

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Genes can be expressed at a wide range of levels, and they show different degrees of cross-species conservation. We compared gene expression levels to gene conservation by integrating microarray data from fission yeast (Schizosaccharomyces pombe) with lists of "core" genes (present in worm and budding and fission yeasts), "yeast-specific" genes (present in budding and fission yeasts, but not in worm), and "pombe-specific" genes (present in fission yeast only). Whereas a disproportionate number of core genes are highly expressed in vegetatively growing cells, many pombe-specific genes are expressed at lower levels. This bias is less pronounced in cells undergoing sexual development, when many pombe-specific genes become highly expressed. This implies that organism-specific proteins are more likely to function during specialized processes such as cellular differentiation. Accordingly, pombe-specific genes were overrepresented among genes induced during sexual development; they were particularly enriched in a group of genes induced during meiotic prophase, when homologous chromosomes pair and recombine. This raises the possibility that organism-specific genes with functions in meiotic prophase favor speciation by preventing fruitful meiosis between closely related organisms. Finally, the set of genes induced late during sexual differentiation, at the time of spore formation, was enriched in yeast-specific genes, indicating that these genes play specialized roles in ascospore development.

    Funded by: Cancer Research UK: A6517; Wellcome Trust: 077118

    Genome research 2003;13;12;2686-90

  • Pilot survey of expressed sequence tags (ESTs) from the asexual blood stages of Plasmodium vivax in human patients.

    Merino EF, Fernandez-Becerra C, Madeira AM, Machado AL, Durham A, Gruber A, Hall N and del Portillo HA

    Departamento de Parasitologia, ICB, Universidade de São Paulo, São Paulo, Brazil.

    Background: Plasmodium vivax is the most widely distributed human malaria, responsible for 70-80 million clinical cases each year and large socio-economical burdens for countries such as Brazil where it is the most prevalent species. Unfortunately, due to the impossibility of growing this parasite in continuous in vitro culture, research on P. vivax remains largely neglected.

    Methods: A pilot survey of expressed sequence tags (ESTs) from the asexual blood stages of P. vivax was performed. To do so, 1,184 clones from a cDNA library constructed with parasites obtained from 10 different human patients in the Brazilian Amazon were sequenced. Sequences were automatedly processed to remove contaminants and low quality reads. A total of 806 sequences with an average length of 586 bp met such criteria and their clustering revealed 666 distinct events. The consensus sequence of each cluster and the unique sequences of the singlets were used in similarity searches against different databases that included P. vivax, Plasmodium falciparum, Plasmodium yoelii, Plasmodium knowlesi, Apicomplexa and the GenBank non-redundant database. An E-value of <10(-30) was used to define a significant database match. ESTs were manually assigned a gene ontology (GO) terminology

    Results: A total of 769 ESTs could be assigned a putative identity based upon sequence similarity to known proteins in GenBank. Moreover, 292 ESTs were annotated and a GO terminology was assigned to 164 of them.

    Conclusion: These are the first ESTs reported for P. vivax and, as such, they represent a valuable resource to assist in the annotation of the P. vivax genome currently being sequenced. Moreover, since the GC-content of the P. vivax genome is strikingly different from that of P. falciparum, these ESTs will help in the validation of gene predictions for P. vivax and to create a gene index of this malaria parasite.

    Malaria journal 2003;2;21

  • The phusion assembler.

    Mullikin JC and Ning Z

    Informatics Department, The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    The Phusion assembler has assembled the mouse genome from the whole-genome shotgun (WGS) dataset collected by the Mouse Genome Sequencing Consortium, at ~7.5x sequence coverage, producing a high-quality draft assembly 2.6 gigabases in size, of which 90% of these bases are in 479 scaffolds. For the mouse genome, which is a large and repeat-rich genome, the input dataset was designed to include a high proportion of paired end sequences of various size selected inserts, from 2-200 kbp lengths, into various host vector templates. Phusion uses sequence data, called reads, and information about reads that share common templates, called read pairs, to drive the assembly of this large genome to highly accurate results. The preassembly stage, which clusters the reads into sensible groups, is a key element of the entire assembler, because it permits a simple approach to parallelization of the assembly stage, as each cluster can be treated independent of the others. In addition to the application of Phusion to the mouse genome, we will also present results from the WGS assembly of Caenorhabditis briggsae sequenced to about 11x coverage. The C. briggsae assembly was accessioned through EMBL,, using the series CAAC01000001-CAAC01000578, however, the Phusion mouse assembly described here was not accessioned. The mouse data was generated by the Mouse Genome Sequencing Consortium. The C. briggsae sequence was generated at The Wellcome Trust Sanger Institute and the Genome Sequencing Center, Washington University School of Medicine.

    Genome research 2003;13;1;81-90

  • The DNA sequence and analysis of human chromosome 6.

    Mungall AJ, Palmer SA, Sims SK, Edwards CA, Ashurst JL, Wilming L, Jones MC, Horton R, Hunt SE, Scott CE, Gilbert JG, Clamp ME, Bethel G, Milne S, Ainscough R, Almeida JP, Ambrose KD, Andrews TD, Ashwell RI, Babbage AK, Bagguley CL, Bailey J, Banerjee R, Barker DJ, Barlow KF, Bates K, Beare DM, Beasley H, Beasley O, Bird CP, Blakey S, Bray-Allen S, Brook J, Brown AJ, Brown JY, Burford DC, Burrill W, Burton J, Carder C, Carter NP, Chapman JC, Clark SY, Clark G, Clee CM, Clegg S, Cobley V, Collier RE, Collins JE, Colman LK, Corby NR, Coville GJ, Culley KM, Dhami P, Davies J, Dunn M, Earthrowl ME, Ellington AE, Evans KA, Faulkner L, Francis MD, Frankish A, Frankland J, French L, Garner P, Garnett J, Ghori MJ, Gilby LM, Gillson CJ, Glithero RJ, Grafham DV, Grant M, Gribble S, Griffiths C, Griffiths M, Hall R, Halls KS, Hammond S, Harley JL, Hart EA, Heath PD, Heathcott R, Holmes SJ, Howden PJ, Howe KL, Howell GR, Huckle E, Humphray SJ, Humphries MD, Hunt AR, Johnson CM, Joy AA, Kay M, Keenan SJ, Kimberley AM, King A, Laird GK, Langford C, Lawlor S, Leongamornlert DA, Leversha M, Lloyd CR, Lloyd DM, Loveland JE, Lovell J, Martin S, Mashreghi-Mohammadi M, Maslen GL, Matthews L, McCann OT, McLaren SJ, McLay K, McMurray A, Moore MJ, Mullikin JC, Niblett D, Nickerson T, Novik KL, Oliver K, Overton-Larty EK, Parker A, Patel R, Pearce AV, Peck AI, Phillimore B, Phillips S, Plumb RW, Porter KM, Ramsey Y, Ranby SA, Rice CM, Ross MT, Searle SM, Sehra HK, Sheridan E, Skuce CD, Smith S, Smith M, Spraggon L, Squares SL, Steward CA, Sycamore N, Tamlyn-Hall G, Tester J, Theaker AJ, Thomas DW, Thorpe A, Tracey A, Tromans A, Tubby B, Wall M, Wallis JM, West AP, White SS, Whitehead SL, Whittaker H, Wild A, Willey DJ, Wilmer TE, Wood JM, Wray PW, Wyatt JC, Young L, Younger RM, Bentley DR, Coulson A, Durbin R, Hubbard T, Sulston JE, Dunham I, Rogers J and Beck S

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Chromosome 6 is a metacentric chromosome that constitutes about 6% of the human genome. The finished sequence comprises 166,880,988 base pairs, representing the largest chromosome sequenced so far. The entire sequence has been subjected to high-quality manual annotation, resulting in the evidence-supported identification of 1,557 genes and 633 pseudogenes. Here we report that at least 96% of the protein-coding genes have been identified, as assessed by multi-species comparative sequence analysis, and provide evidence for the presence of further, otherwise unsupported exons/genes. Among these are genes directly implicated in cancer, schizophrenia, autoimmunity and many other diseases. Chromosome 6 harbours the largest transfer RNA gene cluster in the genome; we show that this cluster co-localizes with a region of high transcriptional activity. Within the essential immune loci of the major histocompatibility complex, we find HLA-B to be the most polymorphic gene on chromosome 6 and in the human genome.

    Nature 2003;425;6960;805-11

  • A table-driven, full-sensitivity similarity search algorithm.

    Myers G and Durbin R

    Department of Computer Science, University of California, Berkeley, Berkeley, CA 94720-1776, USA.

    Searching a database for a local alignment to a query under a typical scoring scheme, such as PAM120 or BLOSUM62 with affine gap costs, is a computation that has resisted algorithmic improvement due to its basis in dynamic programming and the weak nature of the signals being searched for. In a query preprocessing step, a set of tables can be built that permit one to (a) eliminate a large fraction of the dynamic programming matrix from consideration and (b) to compute several steps of the remainder with a single table lookup. While this result is not an asymptotic improvement over the original Smith-Waterman algorithm, its complexity is characterized in terms of some sparse features of the matrix and it yields the fastest software implementation to date for such searches.

    Journal of computational biology : a journal of computational molecular cell biology 2003;10;2;103-17

  • Identification of putative noncoding RNAs among the RIKEN mouse full-length cDNA collection.

    Numata K, Kanai A, Saito R, Kondo S, Adachi J, Wilming LG, Hume DA, Hayashizaki Y, Tomita M, RIKEN GER Group and GSL Members

    Graduate School of Media and Governance, Bioinformatics Program, Keio University, Fujisawa, Kanagawa 252-8520, Japan.

    With the sequencing and annotation of genomes and transcriptomes of several eukaryotes, the importance of noncoding RNA (ncRNA)-RNA molecules that are not translated to protein products-has become more evident. A subclass of ncRNA transcripts are encoded by highly regulated, multi-exon, transcriptional units, are processed like typical protein-coding mRNAs and are increasingly implicated in regulation of many cellular functions in eukaryotes. This study describes the identification of candidate functional ncRNAs from among the RIKEN mouse full-length cDNA collection, which contains 60,770 sequences, by using a systematic computational filtering approach. We initially searched for previously reported ncRNAs and found nine murine ncRNAs and homologs of several previously described nonmouse ncRNAs. Through our computational approach to filter artifact-free clones that lack protein coding potential, we extracted 4280 transcripts as the largest-candidate set. Many clones in the set had EST hits, potential CpG islands surrounding the transcription start sites, and homologies with the human genome. This implies that many candidates are indeed transcribed in a regulated manner. Our results demonstrate that ncRNAs are a major functional subclass of processed transcripts in mammals.

    Genome research 2003;13;6B;1301-6

  • A Y chromosomal influence on prostate cancer risk: the multi-ethnic cohort study.

    Paracchini S, Pearce CL, Kolonel LN, Altshuler D, Henderson BE and Tyler-Smith C

    Department of Biochemistry, University of Oxford, South Parks Road, Oxford OX1 3QU, UK.

    Background: A Y chromosomal role in prostate cancer has previously been suggested by both cytogenetic findings and patterns of Y chromosomal gene expression. We took advantage of the well established and stable phylogeny of the non-recombining segment of the Y chromosome to investigate the association between Y chromosomal DNA variation and prostate cancer risk.

    Methods: We examined the distribution of 116 Y lineages in 930 prostate cancer cases and 1208 controls from four ethnic groups from a cohort study in Hawaii and California.

    Results: One lineage, found only among the Japanese group in our study, was associated with a statistically significant predisposition to prostate cancer (odds ratio (OR) = 1.63; 95% confidence interval (CI) 1.07 to 2.47), and, in particular, to high severity disease in younger individuals (OR = 3.89; 95% CI 1.34 to 11.31).

    Conclusions: This finding suggests that a Y chromosomal factor contributes significantly to the development of prostate cancer in Japanese men.

    Funded by: NCI NIH HHS: R01 CA54281, R01 CA63464

    Journal of medical genetics 2003;40;11;815-9

  • Genomics: Relative pathogenic values.

    Parkhill J and Berry C

    Nature 2003;423;6935;23-5

  • Evolutionary strategies of human pathogens.

    Parkhill J and Thomson N

    The Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom.

    Cold Spring Harbor symposia on quantitative biology 2003;68;151-8

  • Comparative analysis of the genome sequences of Bordetella pertussis, Bordetella parapertussis and Bordetella bronchiseptica.

    Parkhill J, Sebaihia M, Preston A, Murphy LD, Thomson N, Harris DE, Holden MT, Churcher CM, Bentley SD, Mungall KL, Cerdeño-Tárraga AM, Temple L, James K, Harris B, Quail MA, Achtman M, Atkin R, Baker S, Basham D, Bason N, Cherevach I, Chillingworth T, Collins M, Cronin A, Davis P, Doggett J, Feltwell T, Goble A, Hamlin N, Hauser H, Holroyd S, Jagels K, Leather S, Moule S, Norberczak H, O'Neil S, Ormond D, Price C, Rabbinowitsch E, Rutter S, Sanders M, Saunders D, Seeger K, Sharp S, Simmonds M, Skelton J, Squares R, Squares S, Stevens K, Unwin L, Whitehead S, Barrell BG and Maskell DJ

    The Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    Bordetella pertussis, Bordetella parapertussis and Bordetella bronchiseptica are closely related Gram-negative beta-proteobacteria that colonize the respiratory tracts of mammals. B. pertussis is a strict human pathogen of recent evolutionary origin and is the primary etiologic agent of whooping cough. B. parapertussis can also cause whooping cough, and B. bronchiseptica causes chronic respiratory infections in a wide range of animals. We sequenced the genomes of B. bronchiseptica RB50 (5,338,400 bp; 5,007 predicted genes), B. parapertussis 12822 (4,773,551 bp; 4,404 genes) and B. pertussis Tohama I (4,086,186 bp; 3,816 genes). Our analysis indicates that B. parapertussis and B. pertussis are independent derivatives of B. bronchiseptica-like ancestors. During the evolution of these two host-restricted species there was large-scale gene loss and inactivation; host adaptation seems to be a consequence of loss, not gain, of function, and differences in virulence may be related to loss of regulatory or control functions.

    Nature genetics 2003;35;1;32-40

  • 400000 nematode ESTs on the Net.

    Parkinson J, Mitreva M, Hall N, Blaxter M and McCarter JP

    Institute of Cell, Animal and Population Biology, Ashworth Laboratories, King's Buildings, West Mains Rd, University of Edinburgh, EH9 3JT, Edinburgh, UK. <;

    The parasitic nematode expressed sequence tag (EST) project, a collaboration between University of Edinburgh and the Wellcome Trust Sanger Institute in the UK and the Genome Sequencing Center, St Louis, MO, USA, is currently generating sequence information from >30 different species of nematode. Over 400000 nematode ESTs are now available and at least another 130000 are planned. Here, an update is provided on the status of the project and describes the database tools being developed to disseminate these data.

    Funded by: NIAID NIH HHS: AI 46593

    Trends in parasitology 2003;19;7;283-6

  • Multiple inverted DNA repeats of Bacteroides fragilis that control polysaccharide antigenic variation are similar to the hin region inverted repeats of Salmonella typhimurium.

    Patrick S, Parkhill J, McCoy LJ, Lennard N, Larkin MJ, Collins M, Sczaniecka M and Blakely G

    Microbiology and Immunobiology, The Queen's University of Belfast, Grosvenor Road, Belfast BT12 6BN, UK.

    The important opportunistic pathogen Bacteroides fragilis is a strictly anaerobic Gram-negative bacterium and a member of the normal resident human gastrointestinal microbiota. Our earlier studies indicated that there is considerable within-strain variation in polysaccharide expression, as detected by mAb labelling. Analysis of the genome sequence has revealed multiple invertible DNA regions, designated fragilis invertible (fin) regions, seven of which are upstream of polysaccharide biosynthesis loci and are approximately 226 bp in size. Using orientation-specific PCR primers and sequence analysis with populations enriched for one antigenic type, two of these invertible regions were assigned to heteropolymeric polysaccharides with different sizes of repeating units, as determined by PAGE pattern. The implication of these findings is that inversion of the fin regions switches biosynthesis of these polysaccharides off and on. The invertible regions are bound by inverted repeats of 30 or 32 bp with striking similarity to the Salmonella typhimurium H flagellar antigen inversion cross-over (hix) recombination sites of the invertible hin region. It has been demonstrated that a plasmid-encoded Hin invertase homologue (FinB), present in B. fragilis NCTC 9343, binds specifically to the invertible regions and the recombination sites have been designated as fragilis inversion cross-over (fix) sites.

    Microbiology (Reading, England) 2003;149;Pt 4;915-24

  • Identification of a structurally distinct CD101 molecule encoded in the 950-kb Idd10 region of NOD mice.

    Penha-Gonçalves C, Moule C, Smink LJ, Howson J, Gregory S, Rogers J, Lyons PA, Suttie JJ, Lord CJ, Peterson LB, Todd JA and Wicker LS

    Juvenile Diabetes Research Foundation/Wellcome Trust (JDRF/WT) Diabetes and Inflammation Laboratory, Cambridge Institute for Medical Research, University of Cambridge, Addenbrooke's Hospital, Cambridge CB2 2XY, U.K.

    Genes affecting autoimmune type 1 diabetes susceptibility in the nonobese diabetic (NOD) mouse (Idd loci) have been mapped using a congenic strain breeding strategy. In the present study, we used a combination of BAC clone contig construction, polymorphism analysis of DNA from congenic strains, and sequence mining of the human orthologous region to generate an integrated map of the Idd10 region on mouse chromosome 3. We found seven genes and one pseudogene in the 950-kb Idd10 region. Although all seven genes in the interval are Idd10 candidates, we suggest the gene encoding the EWI immunoglobulin subfamily member EWI-101 (Cd101) as the most likely Idd10 candidate because of the previously reported immune-associated properties of the human CD101 molecule. Additional support for the candidacy of Cd101 is the presence of 17 exonic single-nucleotide polymorphisms that differ between the NOD and B6 sequences, 10 causing amino acid substitutions in the predicted CD101 protein. Four of these 10 substitutions are nonconservative, 2 of which could potentially alter N-linked glycosylation. Considering our results together with those previous reports that antibodies recognizing human CD101 modulate human T-cell and dendritic cell function, there is now justification to test whether the alteration of CD101 function affects autoimmune islet destruction.

    Diabetes 2003;52;6;1551-6

  • Composition, acquisition, and distribution of the Vi exopolysaccharide-encoding Salmonella enterica pathogenicity island SPI-7.

    Pickard D, Wain J, Baker S, Line A, Chohan S, Fookes M, Barron A, Gaora PO, Chabalgoity JA, Thanky N, Scholes C, Thomson N, Quail M, Parkhill J and Dougan G

    Centre for Molecular Microbiology and Infection, Department of Biological Sciences, Imperial College of Science, Technology and Medicine, Armstrong Road, London SW7 2AZ, UK.

    Vi capsular polysaccharide production is encoded by the viaB locus, which has a limited distribution in Salmonella enterica serovars. In S. enterica serovar Typhi, viaB is encoded on a 134-kb pathogenicity island known as SPI-7 that is located between partially duplicated tRNA(pheU) sites. Functional and bioinformatic analysis suggests that SPI-7 has a mosaic structure and may have evolved as a consequence of several independent insertion events. Analysis of viaB-associated DNA in Vi-positive S. enterica serovar Paratyphi C and S. enterica serovar Dublin isolates revealed the presence of similar SPI-7 islands. In S. enterica serovars Paratyphi C and Dublin, the SopE bacteriophage and a 15-kb fragment adjacent to the intact tRNA(pheU) site were absent. In S. enterica serovar Paratyphi C only, a region encoding a type IV pilus involved in the adherence of S. enterica serovar Typhi to host cells was missing. The remainder of the SPI-7 islands investigated exhibited over 99% DNA sequence identity in the three serovars. Of 30 other Salmonella serovars examined, 24 contained no insertions at the equivalent tRNA(pheU) site, 2 had a 3.7-kb insertion, and 4 showed sequence variation at the tRNA(pheU)-phoN junction, which was not analyzed further. Sequence analysis of the SPI-7 region from S. enterica serovar Typhi strain CT18 revealed significant synteny with clusters of genes from a variety of saprophytic bacteria and phytobacteria, including Pseudomonas aeruginosa and Xanthomonas axonopodis pv. citri. This analysis suggested that SPI-7 may be a mobile element, such as a conjugative transposon or an integrated plasmid remnant.

    Journal of bacteriology 2003;185;17;5055-65

  • Cobalamin synthesis in Yersinia enterocolitica 8081. Functional aspects of a putative metabolic island.

    Prentice MB, Cuccui J, Thomson N, Parkhill J, Deery E and Warren MJ

    Bart's and the London Medical School, London EC1A 7BE, UK.

    Advances in experimental medicine and biology 2003;529;43-6

  • Transgenics at breaking-point.

    Prosser H and Bradley A

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.

    In this issue of Cancer Cell, Forster et al. (2003) have generated mice that recapitulate both the mechanism (sporadic somatic translocation) and the consequences (expression of two translocation fusion genes) leading to an accurate leukemia model.

    Cancer cell 2003;3;5;411-3

  • Manipulation of the mouse genome: a multiple impact resource for drug discovery and development.

    Prosser H and Rastan S

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SA UK.

    Few would deny that the pharmaceutical industry's investment in genomics throughout the 1990s has yet to deliver in terms of drugs on the market. The reasons are complex and beyond the scope of this review. The unique ability to manipulate the mouse genome, however, has already had a positive impact on all stages of the drug discovery process and, increasingly, on the drug development process too. We give an overview of some recent applications of so-called 'transgenic' mouse technology in pharmaceutical research and development. We show how genetic manipulation in the mouse can be employed at multiple points in the drug discovery and development process, providing new solutions to old problems.

    Trends in biotechnology 2003;21;5;224-32

  • OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy.

    Raghava GP, Searle SM, Audley PC, Barber JD and Barton GJ

    School of Life Sciences, University of Dundee, Dow St, Dundee, DD1 5EH, Scotland, UK.

    Background: The alignment of two or more protein sequences provides a powerful guide in the prediction of the protein structure and in identifying key functional residues, however, the utility of any prediction is completely dependent on the accuracy of the alignment. In this paper we describe a suite of reference alignments derived from the comparison of protein three-dimensional structures together with evaluation measures and software that allow automatically generated alignments to be benchmarked. We test the OXBench benchmark suite on alignments generated by the AMPS multiple alignment method, then apply the suite to compare eight different multiple alignment algorithms. The benchmark shows the current state-of-the art for alignment accuracy and provides a baseline against which new alignment algorithms may be judged.

    Results: The simple hierarchical multiple alignment algorithm, AMPS, performed as well as or better than more modern methods such as CLUSTALW once the PAM250 pair-score matrix was replaced by a BLOSUM series matrix. AMPS gave an accuracy in Structurally Conserved Regions (SCRs) of 89.9% over a set of 672 alignments. The T-COFFEE method on a data set of families with <8 sequences gave 91.4% accuracy, significantly better than CLUSTALW (88.9%) and all other methods considered here. The complete suite is available from

    Conclusions: The OXBench suite of reference alignments, evaluation software and results database provide a convenient method to assess progress in sequence alignment techniques. Evaluation measures that were dependent on comparison to a reference alignment were found to give good discrimination between methods. The STAMP Sc Score which is independent of a reference alignment also gave good discrimination. Application of OXBench in this paper shows that with the exception of T-COFFEE, the majority of the improvement in alignment accuracy seen since 1985 stems from improved pair-score matrices rather than algorithmic refinements. The maximum theoretical alignment accuracy obtained by pooling results over all methods was 94.5% with 52.5% accuracy for alignments in the 0-10 percentage identity range. This suggests that further improvements in accuracy will be possible in the future.

    BMC bioinformatics 2003;4;47

  • Ehlers-Danlos syndrome with severe early-onset periodontal disease (EDS-VIII) is a distinct, heterogeneous disorder with one predisposition gene at chromosome 12p13.

    Rahman N, Dunstan M, Teare MD, Hanks S, Douglas J, Coleman K, Bottomly WE, Campbell ME, Berglund B, Nordenskjöld M, Forssell B, Burrows N, Lunt P, Young I, Williams N, Bignell GR, Futreal PA and Pope FM

    Section of Cancer Genetics, Institute of Cancer Research, Brooks-Lawley Building, 15 Cotswold Road, Sutton, Surrey SM2 5NG, United Kingdom.

    Ehlers-Danlos VIII (EDS-VIII) is an autosomal dominant disorder characterized by severe early-onset periodontal disease in conjunction with the features of Ehlers-Danlos syndrome (EDS). We performed a genomewide linkage search in a large Swedish pedigree with EDS-VIII and established linkage to a 7-cM interval on chromosome 12p13, generating a maximum multipoint LOD score of 5.17. Analysis of four further pedigrees with EDS-VIII revealed two consistent with linkage to 12p13 and two in which linkage could be excluded, indicating that EDS-VIII is a genetically heterogeneous disorder. Chromosome 12p13 has not previously been implicated in either EDS or periodontal disease and contains no known collagen genes or collagen-processing enzymes. Mutational screening of the microfibril-associated glycoprotein-2 gene, a strong candidate within the minimal interval, did not reveal any likely pathogenic mutations.

    American journal of human genetics 2003;73;1;198-204

  • Determination of Escherichia coli RNA polymerase structure by single particle cryoelectron microscopy.

    Ray P, Klaholz BP, Finn RD, Orlova EV, Burrows PC, Gowen B, Buck M and van Heel M

    Department of Biological Sciences, Wolfson Laboratories, Imperial College of London, Rm. 313, London SW7 2AY, United Kingdom.

    Methods in enzymology 2003;370;24-42

  • The finished genome sequence of Homo sapiens.

    Rogers J

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom.

    Cold Spring Harbor symposia on quantitative biology 2003;68;1-11

  • Variants in CHEK2 other than 1100delC do not make a major contribution to breast cancer susceptibility.

    Schutte M, Seal S, Barfoot R, Meijers-Heijboer H, Wasielewski M, Evans DG, Eccles D, Meijers C, Lohman F, Klijn J, van den Ouweland A, Futreal PA, Nathanson KL, Weber BL, Easton DF, Stratton MR, Rahman N and Breast Cancer Linkage Consortium

    Department of Medical Oncology, Erasmus Medical Center, Rotterdam, The Netherlands.

    We recently reported that a sequence variant in the cell-cycle-checkpoint kinase CHEK2 (CHEK2 1100delC) is a low-penetrance breast cancer-susceptibility allele in noncarriers of BRCA1 or BRCA2 mutations. To investigate whether other CHEK2 variants confer susceptibility to breast cancer, we screened the full CHEK2 coding sequence in BRCA1/2-negative breast cancer cases from 89 pedigrees with three or more cases of breast cancer. We identified one novel germline variant, R117G, in two separate families. To evaluate the possible association of R117G and two germline variants reported elsewhere, R145W and I157T with breast cancer, we screened 737 BRCA1/2-negative familial breast cancer cases from 605 families, 459 BRCA1/2-positive cases from 335 families, and 723 controls from the United Kingdom, the Netherlands, and North America. All three variants were rare in all groups, and none occurred at significantly elevated frequency in familial breast cancer cases compared with controls. These results indicate that 1100delC may be the only CHEK2 allele that makes an appreciable contribution to breast cancer susceptibility.

    American journal of human genetics 2003;72;4;1023-8

  • Evaluation of Fanconi Anemia genes in familial breast cancer predisposition.

    Seal S, Barfoot R, Jayatilake H, Smith P, Renwick A, Bascombe L, McGuffog L, Evans DG, Eccles D, Easton DF, Stratton MR, Rahman N and Breast Cancer Susceptibility Collaboration

    Section of Cancer Genetics, Institute of Cancer Research, Sutton, Surrey, United Kingdom.

    Fanconi Anemia (FA) is an autosomal recessive syndrome characterized by congenital abnormalities, progressive bone marrow failure, and susceptibility to cancer. FA has eight known complementation groups and is caused by mutations in at least seven genes. Biallelic BRCA2 mutations were shown recently to cause FA-D1. Monoallelic (heterozygous) BRCA2 mutations confer a high risk of breast cancer and are a major cause of familial breast cancer. To investigate whether heterozygous variants in other FA genes are high penetrance breast cancer susceptibility alleles, we screened germ-line DNA from 88 BRCA1/2-negative families, each with at least three cases of breast cancer, for mutations in FANCA, FANCC, FANCD2, FANCE, FANCF, and FANCG. Sixty-nine sequence variants were identified of which 25 were exonic. None of the exonic variants resulted in translational frameshifts or nonsense codons and 14 were polymorphisms documented previously. Of the remaining 11 exonic variants, 2 resulted in synonymous changes, and 7 were present in controls. Only 2 conservative missense variants, 1 in FANCA and 1 in FANCE, were each found in a single family and were not present in 300 controls. The results indicate that FA gene mutations, other than in BRCA2, are unlikely to be a frequent cause of highly penetrant breast cancer predisposition.

    Cancer research 2003;63;24;8596-9

  • A bad combination.

    Sebaihia M, Bentley S, Crossman L, Thomson N and Parkhill J

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Trends in microbiology 2003;11;7;297-9

  • The good, the bad and the ugly?

    Sebaihia M, Bentley SD, Holden MT and Parkhill J

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    Trends in microbiology 2003;11;5;204-5

  • Construction and integration of radiation-hybrid and cytogenetic maps of dog Chromosome X.

    Spriggs HF, Holmes NG, Breen MG, Deloukas PG, Langford CF, Ross MT, Carter NP, Davis ME, Knights CE, Smith AE, Farr CJ, McCarthy LC and Binns MM

    Animal Health Trust, Lanwades Park, Kentford, Newmarket, Suffolk, CB8 7UU, UK.

    Chromosome (chr) X is under-represented in current maps of the genome of the domestic dog ( Canis familiaris). To address this problem, we have constructed a small-insert, genomic DNA library in pBluescript from flow-sorted canine Chr X DNA. Fluorescence in situ hybridization (FISH) studies confirmed that the library was highly enriched for Chr X. Clones containing microsatellites were identified and sequenced. Database searches detected significant sequence identity between four X-derived clones and genes previously characterized in other species. Thirty-seven markers derived from these clones were mapped on Chr X by FISH, and of these, 28 were mapped by using the female-derived T72 whole-genome radiation hybrid (RH) panel (Research Genetics). Four X-linked canine genes from publicly available data were also mapped. Eight RH linkage groups with LOD >4.0 were identified, and FISH data were used to locate the groups on the chromosome; four groups could be unambiguously orientated by FISH data. In each case, the FISH and RH data were mutually consistent. The data suggest strongly conserved synteny between canine and human X Chrs. The pseudoautosomal region has been further characterized, and the putative or actual locations of nine genes of clinical relevance have been suggested.

    Mammalian genome : official journal of the International Mammalian Genome Society 2003;14;3;214-21

  • The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics.

    Stein LD, Bao Z, Blasiar D, Blumenthal T, Brent MR, Chen N, Chinwalla A, Clarke L, Clee C, Coghlan A, Coulson A, D'Eustachio P, Fitch DH, Fulton LA, Fulton RE, Griffiths-Jones S, Harris TW, Hillier LW, Kamath R, Kuwabara PE, Mardis ER, Marra MA, Miner TL, Minx P, Mullikin JC, Plumb RW, Rogers J, Schein JE, Sohrmann M, Spieth J, Stajich JE, Wei C, Willey D, Wilson RK, Durbin R and Waterston RH

    Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA..

    The soil nematodes Caenorhabditis briggsae and Caenorhabditis elegans diverged from a common ancestor roughly 100 million years ago and yet are almost indistinguishable by eye. They have the same chromosome number and genome sizes, and they occupy the same ecological niche. To explore the basis for this striking conservation of structure and function, we have sequenced the C. briggsae genome to a high-quality draft stage and compared it to the finished C. elegans sequence. We predict approximately 19,500 protein-coding genes in the C. briggsae genome, roughly the same as in C. elegans. Of these, 12,200 have clear C. elegans orthologs, a further 6,500 have one or more clearly detectable C. elegans homologs, and approximately 800 C. briggsae genes have no detectable matches in C. elegans. Almost all of the noncoding RNAs (ncRNAs) known are shared between the two species. The two genomes exhibit extensive colinearity, and the rate of divergence appears to be higher in the chromosomal arms than in the centers. Operons, a distinctive feature of C. elegans, are highly conserved in C. briggsae, with the arrangement of genes being preserved in 96% of cases. The difference in size between the C. briggsae (estimated at approximately 104 Mbp) and C. elegans (100.3 Mbp) genomes is almost entirely due to repetitive sequence, which accounts for 22.4% of the C. briggsae genome in contrast to 16.5% of the C. elegans genome. Few, if any, repeat families are shared, suggesting that most were acquired after the two species diverged or are undergoing rapid evolution. Coclustering the C. elegans and C. briggsae proteins reveals 2,169 protein families of two or more members. Most of these are shared between the two species, but some appear to be expanding or contracting, and there seem to be as many as several hundred novel C. briggsae gene families. The C. briggsae draft sequence will greatly improve the annotation of the C. elegans genome. Based on similarity to C. briggsae, we found strong evidence for 1,300 new C. elegans genes. In addition, comparisons of the two genomes will help to understand the evolutionary forces that mold nematode genomes.

    Funded by: NHGRI NIH HHS: 5P01 HG00956, 5U01 HG02042, P41 HG02223; NIGMS NIH HHS: R01 GM42432, T32 GM07754-22

    PLoS biology 2003;1;2;E45

  • Identification of candidate tumor-suppressor genes in 6q27 by combined deletion mapping and electronic expression profiling in lymphoid neoplasms.

    Steinemann D, Gesk S, Zhang Y, Harder L, Pilarsky C, Hinzmann B, Martin-Subero JI, Calasanz MJ, Mungall A, Rosenthal A, Siebert R and Schlegelberger B

    Institute of Cell and Molecular Pathology, Hannover Medical School, Hannover, Germany.

    Deletions in the long arm of chromosome 6 (6q) are among the most frequent chromosome aberrations in lymphoid neoplasms. Recently, the region of minimal deletion (RMD1) in 6q27 was narrowed down to 5-9 Mb. In the present study, we aimed to define the distal border of the commonly lost region in 6q27 more precisely and to identify and investigate tumor-suppressor genes (TSGs) from this region. Twenty-nine cases, in which our previous fluorescence in situ hybridization (FISH) screening that used a set of 36 YAC probes revealed loss in 6q25-27, were further investigated by means of FISH. In all cases, deletions of 6q27 extended from yeast artificial chromosome (YAC) 977e10 spanning the proximal border of RMD1 to the most telomeric YAC 933f7 within the recently established YAC-contig of this region. An interstitial homozygous deletion, flanked by the telomeric probe TelVysion6q and YAC 971g12, was detected, which substantially narrows down the RMD1. To identify candidate TSGs down-regulated in malignant lymphomas from this region of homozygous loss, we performed electronic profiling of expressed sequences mapped to this region. This analysis suggested the gene PDCD2 originally thought to be involved in programmed cell death to be probably down-regulated in malignant B-cell lymphomas compared to normal B lymphocytes. Nevertheless, mutation analyses failed to identify mutations in the coding region of PDCD2 in nine lymphomas with FISH-proved 6q27 deletions. Furthermore, epigenetic studies in these nine and an additional 48 lymphomas did not show altered methylation of the PDCD2 locus in these tumors. Possibly haploinsufficiency is effectual in accelerating tumor progression.

    Genes, chromosomes & cancer 2003;37;4;421-6

  • Insights into the effects on metal binding of the systematic substitution of five key glutamate ligands in the ferritin of Escherichia coli.

    Stillman TJ, Connolly PP, Latimer CL, Morland AF, Quail MA, Andrews SC, Treffry A, Guest JR, Artymiuk PJ and Harrison PM

    Krebs Institute, Department of Molecular Biology and Biotechnology, University of Sheffield, Sheffield S10 2TN, United Kingdom.

    Ferritins are nearly ubiquitous iron storage proteins playing a fundamental role in iron metabolism. They are composed of 24 subunits forming a spherical protein shell encompassing a central iron storage cavity. The iron storage mechanism involves the initial binding and subsequent O2-dependent oxidation of two Fe2+ ions located at sites A and B within the highly conserved dinuclear "ferroxidase center" in individual subunits. Unlike animal ferritins and the heme-containing bacterioferritins, the Escherichia coli ferritin possesses an additional iron-binding site (site C) located on the inner surface of the protein shell close to the ferroxidase center. We report the structures of five E. coli ferritin variants and their Fe3+ and Zn2+ (a redox-stable alternative for Fe2+) derivatives. Single carboxyl ligand replacements in sites A, B, and C gave unique effects on metal binding, which explain the observed changes in Fe2+ oxidation rates. Binding of Fe2+ at both A and B sites is clearly essential for rapid Fe2+ oxidation, and the linking of FeB2+ to FeC2+ enables the oxidation of three Fe2+ ions. The transient binding of Fe2+ at one of three newly observed Zn2+ sites may allow the oxidation of four Fe2+ by one dioxygen molecule.

    The Journal of biological chemistry 2003;278;28;26275-86

  • Sequence-based cancer genomics: progress, lessons and opportunities.

    Strausberg RL, Simpson AJ and Wooster R

    National Cancer Institute, 31 Center Drive, Room 10A07, Bethesda, Maryland 20892, USA.

    Technologies that provide a genome-wide view offer an unprecedented opportunity to scrutinize the molecular biology of the cancer cell. The information that is derived from these technologies is well suited to the development of public databases of alterations in the cancer genome and its expression. Here, we describe the synergistic efforts of research programmes in Brazil, the United Kingdom and the United States towards building integrated databases that are widely accessible to the research community, to enable basic and applied applications in cancer research.

    Nature reviews. Genetics 2003;4;6;409-18

  • Domain architectures of sigma54-dependent transcriptional activators.

    Studholme DJ and Dixon R

    Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom.

    Journal of bacteriology 2003;185;6;1757-67

  • A DNA element recognised by the molybdenum-responsive transcription factor ModE is conserved in Proteobacteria, green sulphur bacteria and Archaea.

    Studholme DJ and Pau RN

    Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK.

    Background: The transition metal molybdenum is essential for life. Escherichia coli imports this metal into the cell in the form of molybdate ions, which are taken up via an ABC transport system. In E. coli and other Proteobacteria molybdenum metabolism and homeostasis are regulated by the molybdate-responsive transcription factor ModE.

    Results: Orthologues of ModE are widespread amongst diverse prokaryotes, but not ubiquitous. We identified probable ModE-binding sites upstream of genes implicated in molybdenum metabolism in green sulphur bacteria and methanogenic Archaea as well as in Proteobacteria. We also present evidence of horizontal transfer of nitrogen fixation genes between green sulphur bacteria and methanogenic Archaea.

    Conclusions: Whereas most of the archaeal helix-turn-helix-containing transcription factors belong to families that are Archaea-specific, ModE is unusual in that it is found in both Archaea and Bacteria. Moreover, its cognate upstream DNA recognition sequence is also conserved between Archaea and Bacteria, despite the fundamental differences in their core transcription machinery. ModE is the third example of a transcriptional regulator with a binding signal that is conserved in Bacteria and Archaea.

    BMC microbiology 2003;3;24

  • A comparison of Pfam and MEROPS: two databases, one comprehensive, and one specialised.

    Studholme DJ, Rawlings ND, Barrett AJ and Bateman A

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.

    Background: We wished to compare two databases based on sequence similarity: one that aims to be comprehensive in its coverage of known sequences, and one that specialises in a relatively small subset of known sequences. One of the motivations behind this study was quality control. Pfam is a comprehensive collection of alignments and hidden Markov models representing families of proteins and domains. MEROPS is a catalogue and classification of enzymes with proteolytic activity (peptidases or proteases). These secondary databases are used by researchers worldwide, yet their contents are not peer reviewed. Therefore, we hoped that a systematic comparison of the contents of Pfam and MEROPS would highlight missing members and false-positives leading to improvements in quality of both databases. An additional reason for carrying out this study was to explore the extent of consensus in the definition of a protein family.

    Results: About half (89 out of 174) of the peptidase families in MEROPS overlapped single Pfam families. A further 32 MEROPS families overlapped multiple Pfam families. Where possible, new Pfam families were built to represent most of the MEROPS families that did not overlap Pfam. When comparing the numbers of sequences found in the overlap between a MEROPS family and its corresponding Pfam family, in most cases the overlap was substantial (52 pairs of MEROPS and Pfam families had an intersection size of greater than 75% of the union) but there were some differences in the sets of sequences included in the MEROPS families versus the overlapping Pfam families.

    Conclusions: A number of the discrepancies between MEROPS families and their corresponding Pfam families arose from differences in the aims and philosophies of the two databases. Examination of some of the discrepancies highlighted additional members of families, which have subsequently been added in both Pfam and MEROPS. This has led to improvements in the quality of both databases. Overall there was a great deal of consensus between the databases in definitions of a protein family.

    Funded by: Wellcome Trust: 087656

    BMC bioinformatics 2003;4;17

  • Beyond release: the equitable use of genomic information.

    Sulston J

    Wellcome Trust Sanger Institute, Hinxton, CB10 1RQ, Cambridge, UK. <;

    Lancet 2003;362;9381;400-2

  • A canine cancer-gene microarray for CGH analysis of tumors.

    Thomas R, Fiegler H, Ostrander EA, Galibert F, Carter NP and Breen M

    Oncology Research Group, Centre for Preventive Medicine, Animal Health Trust, Lanwades Park, Kentford, Newmarket, Suffolk, UK.

    As with many human cancers, canine tumors demonstrate recurrent chromosome aberrations. A detailed knowledge of such aberrations may facilitate diagnosis, prognosis and the selection of appropriate therapy. Following recent advances made in human genomics, we are developing a DNA microarray for the domestic dog, to be used in the detection and characterization of copy number changes in canine tumors. As a proof of principle, we have developed a small-scale microarray comprising 87 canine BAC clones. The array is composed of 26 clones selected from a panel of 24 canine cancer genes, representing 18 chromosomes, and an additional set of clones representing dog chromosomes 11, 13, 14 and 31. These chromosomes were shown previously to be commonly aberrant in canine multicentric malignant lymphoma. Clones representing the sex chromosomes were also included. We outline the principles of canine microarray development, and present data obtained from microarray analysis of three canine lymphoma cases previously characterized using conventional cytogenetic techniques.

    Cytogenetic and genome research 2003;102;1-4;254-60

  • Fitting the niche by genomic adaptation.

    Thomson N, Bentley S, Holden M and Parkhill J

    Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Studying microbial genomics has shown that the genomes of bacteria are extremely dynamic in evolutionary terms. Many research groups have linked the adaptation of an organism to a niche to large changes in genome size and content. A number of recent papers have underlined the degree to which the genomes of different organisms are a reflection of the opportunities and constraints imposed by their chosen niche.

    Nature reviews. Microbiology 2003;1;2;92-3

  • The value of comparison.

    Thomson N, Sebaihia M, Cerdeño-Tárraga A, Bentley S, Crossman L and Parkhill J

    Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    With the number of published microbial genomes now in excess of 100, any new genome that is sequenced is likely to have a close relative available for comparison. Indeed, it is increasingly difficult to perform any genomic analysis that is not comparative. This should, however, not be seen as a drawback; it is often the case that a large amount of information can be drawn from these comparisons, especially between closely related organisms. Several genome sequences published recently indicate the value of comparisons at the genomic level.

    Nature reviews. Microbiology 2003;1;1;11-2

  • The building blocks of pathogenicity.

    Thomson NR and Parkhill J

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Trends in microbiology 2003;11;2;66-7

  • All walks of life.

    Thomson NR, Cerdeño-Tárraga A, Crossman L, Sebaihia M and Parkhill J

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Trends in microbiology 2003;11;4;159-60

  • Association of the T-cell regulatory gene CTLA4 with susceptibility to autoimmune disease.

    Ueda H, Howson JM, Esposito L, Heward J, Snook H, Chamberlain G, Rainbow DB, Hunter KM, Smith AN, Di Genova G, Herr MH, Dahlman I, Payne F, Smyth D, Lowe C, Twells RC, Howlett S, Healy B, Nutland S, Rance HE, Everett V, Smink LJ, Lam AC, Cordell HJ, Walker NM, Bordin C, Hulme J, Motzo C, Cucca F, Hess JF, Metzker ML, Rogers J, Gregory S, Allahabadia A, Nithiyananthan R, Tuomilehto-Wolf E, Tuomilehto J, Bingley P, Gillespie KM, Undlien DE, Rønningen KS, Guja C, Ionescu-Tîrgovişte C, Savage DA, Maxwell AP, Carson DJ, Patterson CC, Franklyn JA, Clayton DG, Peterson LB, Wicker LS, Todd JA and Gough SC

    Juvenile Diabetes Research Foundation/Wellcome Trust Diabetes and Inflammation Laboratory, Cambridge Institute for Medical Research, University of Cambridge, Wellcome Trust/MRC Building, Cambridge, CB2 2XY, UK.

    Genes and mechanisms involved in common complex diseases, such as the autoimmune disorders that affect approximately 5% of the population, remain obscure. Here we identify polymorphisms of the cytotoxic T lymphocyte antigen 4 gene (CTLA4)--which encodes a vital negative regulatory molecule of the immune system--as candidates for primary determinants of risk of the common autoimmune disorders Graves' disease, autoimmune hypothyroidism and type 1 diabetes. In humans, disease susceptibility was mapped to a non-coding 6.1 kb 3' region of CTLA4, the common allelic variation of which was correlated with lower messenger RNA levels of the soluble alternative splice form of CTLA4. In the mouse model of type 1 diabetes, susceptibility was also associated with variation in CTLA-4 gene splicing with reduced production of a splice form encoding a molecule lacking the CD80/CD86 ligand-binding domain. Genetic mapping of variants conferring a small disease risk can identify pathways in complex disorders, as exemplified by our discovery of inherited, quantitative alterations of CTLA4 contributing to autoimmune tissue destruction.

    Nature 2003;423;6939;506-11

  • QUAD system offers fair shares to all authors.

    Verhagen JV, Wallace KJ, Collins SC and Scott TR

    Nature 2003;426;6967;602

  • Deoxycorticosterone upregulates PDS (Slc26a4) in mouse kidney: role of pendrin in mineralocorticoid-induced hypertension.

    Verlander JW, Hassell KA, Royaux IE, Glapion DM, Wang ME, Everett LA, Green ED and Wall SM

    Department of Medicine, University of Florida College of Medicine, Gainesville, USA.

    Pendrin is an anion exchanger expressed along the apical plasma membrane and apical cytoplasmic vesicles of type B and of non-A, non-B intercalated cells of the distal convoluted tubule, connecting tubule, and cortical collecting duct. Thus, Pds (Slc26a4) is a candidate gene for the putative apical anion-exchange process of the type B intercalated cell. Because apical anion exchange-mediated transport is upregulated with deoxycorticosterone pivalate (DOCP), we tested whether Pds mRNA and protein expression in mouse kidney were upregulated after administration of this aldosterone analogue by using quantitative real-time polymerase chain reaction as well as light and electron microscopic immunolocalization. In kidneys from DOCP-treated mice, Pds mRNA increased 60%, whereas pendrin protein expression in the apical plasma membrane increased 2-fold in non-A, non-B intercalated cells and increased 6-fold in type B cells. Because pendrin transports HCO3- and Cl-, we tested whether DOCP treatment unmasks abnormalities in acid-base or NaCl balance in Pds (-/-) mice. In the absence of DOCP, arterial pH, systolic blood pressure, and body weight were similar in Pds (+/+) and Pds (-/-) mice. After DOCP treatment, weight gain and hypertension were observed in Pds (+/+) but not in Pds (-/-) mice. Moreover, after DOCP administration, metabolic alkalosis was more severe in Pds (-/-) than Pds (+/+) mice. We conclude that pendrin is upregulated with aldosterone analogues and is critical in the pathogenesis of mineralocorticoid-induced hypertension and metabolic alkalosis.

    Funded by: NIDDK NIH HHS: DK 52935

    Hypertension 2003;42;3;356-62

  • Complex transcription and splicing of odorant receptor genes.

    Volz A, Ehlers A, Younger R, Forbes S, Trowsdale J, Schnorr D, Beck S and Ziegler A

    Institut für Immungenetik, Universitätsklinikum Charité, Humboldt-Universität zu Berlin, Spandauer Damm 130, Germany.

    Human major histocompatibility (human leucocyte antigen (HLA)) complex-linked odorant receptor (OR) genes are among the best characterized OR genes in the human genome. In addition to their functions as odorant receptors in olfactory epithelium, they have been suggested to play a role in the fertilization process. Here, we report the first in-depth analysis of their expression and regulation within testicular tissue. Sixteen HLA-linked OR and three non-HLA-linked OR were analyzed. One OR gene (hs6M1-16, in positive transcriptional orientation) exhibited six different transcriptional start sites combined with extensive alternative splicing within the 5'-untranslated region, the coding exon, and the 3'-untranslated region. Long distance splicing, exon sharing, and premature polyadenylation were features of another three OR loci (hs6M1-18, -21, and -27, all upstream of hs6M1-16, but in negative transcriptional orientation). Determination of the transcriptional start sites of these OR genes identified a region of 81 bp with potential bi-directional transcriptional activity. The results demonstrate that HLA-linked OR genes are subject to unusually complex transcriptional regulatory mechanisms.

    The Journal of biological chemistry 2003;278;22;19691-701

  • Modeling del(17)(p11.2p11.2) and dup(17)(p11.2p11.2) contiguous gene syndromes by chromosome engineering in mice: phenotypic consequences of gene dosage imbalance.

    Walz K, Caratini-Rivera S, Bi W, Fonseca P, Mansouri DL, Lynch J, Vogel H, Noebels JL, Bradley A and Lupski JR

    Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA.

    Contiguous gene syndromes (CGS) are a group of disorders associated with chromosomal rearrangements of which the phenotype is thought to result from altered copy numbers of physically linked dosage-sensitive genes. Smith-Magenis syndrome (SMS) is a CGS associated with a deletion within band p11.2 of chromosome 17. Recently, patients harboring the predicted reciprocal duplication product [dup(17)(p11.2p11.2)] have been described as having a relatively mild phenotype. By chromosomal engineering, we created rearranged chromosomes carrying the deletion [Df(11)17] or duplication [Dp(11)17] of the syntenic region on mouse chromosome 11 that spans the genomic interval commonly deleted in SMS patients. Df(11)17/+ mice exhibit craniofacial abnormalities, seizures, marked obesity, and male-specific reduced fertility. Dp(11)17/+ animals are underweight and do not have seizures, craniofacial abnormalities, or reduced fertility. Examination of Df(11)17/Dp(11)17 animals suggests that most of the observed phenotypes result from gene dosage effects. Our murine models represent a powerful tool to analyze the consequences of gene dosage imbalance in this genomic interval and to investigate the molecular genetic bases of both SMS and dup(17)(p11.2p11.2).

    Funded by: NCI NIH HHS: P01 CA 75719

    Molecular and cellular biology 2003;23;10;3646-55

  • Gene discovery in the Entamoeba invadens genome.

    Wang Z, Samuelson J, Clark CG, Eichinger D, Paul J, Van Dellen K, Hall N, Anderson I and Loftus B

    Center for Bio/Molecular Science Naval Research Laboratory, Washington, DC 20375, USA.

    Entamoeba invadens, a parasite of reptiles, is a model for the study of encystation by the human enteric pathogen Entamoeba histolytica, because E. invadens form cysts in axenic culture. With approximately 0.5-fold sequence coverage of the genome, we were able to get insights into E. invadens gene and genome features. Overall, the E. invadens genome displays many of the features that are emerging from ongoing genome sequencing efforts in E. histolytica. At the nucleotide level the E. invadens genome has on average 60% sequence identity with that of E. histolytica. The presence of introns in E. invadens was predicted with similar consensus (GTTTGT em leader A/TAG) sequences to those identified in E. histolytica and Entamoeba dispar. Sequences highly repeated in the genome of E. histolytica (rRNAs, tRNAs, CXXC-rich proteins, and Leu-rich repeat proteins) were found to be highly repeated in the E. invadens genome. Numerous proteins homologous to those implicated in amoebic virulence, (Gal/GalNAc lectins, amoebapores, and cysteine proteinases) and drug resistance (p-glycoproteins) were identified. Homologs of proteins involved in cell cycle, vesicular trafficking and signal transduction were identified, which may be involved in en/excystation and cell growth of E. invadens. Finally, multiple copies of a number of E. invadens genes coding for predicted enzymes involved in core metabolism and the targets of anti-amoebic drugs were identified.

    Funded by: NIAID NIH HHS: R01 AI46516

    Molecular and biochemical parasitology 2003;129;1;23-31

  • More on the sequencing of the human genome.

    Waterston RH, Lander ES and Sulston JE

    Department of Genome Sciences, University of Washington, Box 357730, Seattle, WA 98195, USA.

    Proceedings of the National Academy of Sciences of the United States of America 2003;100;6;3022-4; author reply 3025-6

  • Breast and ovarian cancer.

    Wooster R and Weber BL

    Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom.

    The New England journal of medicine 2003;348;23;2339-47

  • Leishmania major chromosome 3 contains two long convergent polycistronic gene clusters separated by a tRNA gene.

    Worthey EA, Martinez-Calvillo S, Schnaufer A, Aggarwal G, Cawthra J, Fazelinia G, Fong C, Fu G, Hassebrock M, Hixson G, Ivens AC, Kiser P, Marsolini F, Rickel E, Rickell E, Salavati R, Sisk E, Sunkin SM, Stuart KD and Myler PJ

    Seattle Biomedical Research Institute, 4 Nickerson Street, Seattle, WA 98109-1651, USA.

    Leishmania parasites (order Kinetoplastida, family Trypanosomatidae) cause a spectrum of human diseases ranging from asymptomatic to lethal. The approximately 33.6 Mb genome is distributed among 36 chromosome pairs that range in size from approximately 0.3 to 2.8 Mb. The complete nucleotide sequence of Leishmania major Friedlin chromosome 1 revealed 79 protein-coding genes organized into two divergent polycistronic gene clusters with the mRNAs transcribed towards the telomeres. We report here the complete nucleotide sequence of chromosome 3 (384 518 bp) and an analysis revealing 95 putative protein-coding ORFs. The ORFs are primarily organized into two large convergent polycistronic gene clusters (i.e. transcribed from the telomeres). In addition, a single gene at the left end is transcribed divergently towards the telomere, and a tRNA gene separates the two convergent gene clusters. Numerous genes have been identified, including those for metabolic enzymes, kinases, transporters, ribosomal proteins, spliceosome components, helicases, an RNA-binding protein and a DNA primase subunit.

    Funded by: NIAID NIH HHS: R01 AI053667-01, R01 AI053667-02, R01 AI40599

    Nucleic acids research 2003;31;14;4201-10

  • COP9 signalosome subunit 3 is essential for maintenance of cell proliferation in the mouse embryonic epiblast.

    Yan J, Walz K, Nakamura H, Carattini-Rivera S, Zhao Q, Vogel H, Wei N, Justice MJ, Bradley A and Lupski JR

    Department of Molecular and Human Genetics, Texas Children's Hospital, Houston, Texas 77030, USA.

    Csn3 (Cops3) maps to the mouse chromosome 11 region syntenic to the common deletion interval for the Smith-Magenis syndrome, a contiguous gene deletion syndrome. It encodes the third subunit of an eight-subunit protein complex, the COP9 signalosome (CSN), which controls a wide variety of molecules of different functions. Mutants of this complex caused lethality at early development of both plants and Drosophila melanogaster. CSN function in vivo in mammals is unknown. We disrupted the murine Csn3 gene in three independent ways with insertional vectors, including constructing a approximately 3-Mb inversion chromosome. The heterozygous mice appeared normal, although the protein level was reduced. Csn3(-/-) embryos arrested after 5.5 days postcoitum (dpc) and resorbed by 8.5 dpc. Mutant embryos form an abnormal egg cylinder which does not gastrulate. They have reduced numbers of epiblast cells, mainly due to increased cell death. In the Csn3(-/-) mice, subunit 8 of the COP9 complex was not detected by immunohistochemical techniques, suggesting that the absence of Csn3 may disrupt the entire COP9 complex. Therefore, Csn3 is important for maintaining the integrity of the COP9 signalosome and is crucial to maintain the survival of epiblast cells and thus the development of the postimplantation embryo in mice.

    Funded by: NCI NIH HHS: P01CA75719

    Molecular and cellular biology 2003;23;19;6798-808

  • The BON domain: a putative membrane-binding domain.

    Yeats C and Bateman A

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK CB10 1SA.

    Trends in biochemical sciences 2003;28;7;352-5

  • New knowledge from old: in silico discovery of novel protein domains in Streptomyces coelicolor.

    Yeats C, Bentley S and Bateman A

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SA, UK.

    Background: Streptomyces coelicolor has long been considered a remarkable bacterium with a complex life-cycle, ubiquitous environmental distribution, linear chromosomes and plasmids, and a huge range of pharmaceutically useful secondary metabolites. Completion of the genome sequence demonstrated that this diversity carried through to the genetic level, with over 7000 genes identified. We sought to expand our understanding of this organism at the molecular level through identification and annotation of novel protein domains. Protein domains are the evolutionary conserved units from which proteins are formed.

    Results: Two automated methods were employed to rapidly generate an optimised set of targets, which were subsequently analysed manually. A final set of 37 domains or structural repeats, represented 204 times in the genome, was developed. Using these families enabled us to correlate items of information from many different resources. Several immediately enhance our understanding both of S. coelicolor and also general bacterial molecular mechanisms, including cell wall biosynthesis regulation and streptomycete telomere maintenance.

    Discussion: Delineation of protein domain families enables detailed analysis of protein function, as well as identification of likely regions or residues of particular interest. Hence this kind of prior approach can increase the rate of discovery in the laboratory. Furthermore we demonstrate that using this type of in silico method it is possible to fairly rapidly generate new biological information from previously uncorrelated data.

    BMC microbiology 2003;3;3

  • Comparison of the A2 gene locus in Leishmania donovani and Leishmania major and its control over cutaneous infection.

    Zhang WW, Mendez S, Ghosh A, Myler P, Ivens A, Clos J, Sacks DL and Matlashewski G

    Department of Microbiology and Immunology, McGill University, Montreal, Quebec H3A 2B4, Canada.

    In Old World Leishmania infections, Leishmania donovani is responsible for fatal visceral leishmaniasis, and L. major is responsible for non-fatal cutaneous leishmaniasis in humans. The genetic differences between these species which govern the pathology or site of infection are not known. We have therefore carried out detailed analysis of the A2 loci in L. major and L. donovani because A2 is expressed in L. donovani but not L. major, and A2 is required for survival in visceral organs by L. donovani. We demonstrate that although L. major contains A2 gene regulatory sequences, the multiple repeats that exist in L. donovani A2 protein coding regions are absent in L. major, and the remaining corresponding A2 sequences appear to represent non-expressed pseudogenes. It was possible to restore amastigote-specific A2 expression to L. major, confirming that A2 regulatory sequences remain functional in L. major. Although L. major is a cutaneous parasite in rodents and humans, restoring A2 expression to L. major inhibited its ability to establish a cutaneous infection in susceptible BALB/c or resistant C57BL6 mice, a phenotype typical of L. donovani. There was no detectable cellular immune response against L. major after cutaneous infection with A2-expressing L. major, suggesting that the lack of growth was not attributable to acquired host resistance but to an A2-mediated suppression of parasite survival in skin macrophages. These observations argue that the lack of A2 expression in L. major contributed to its divergence from L. donovani with respect to the pathology of infection.

    The Journal of biological chemistry 2003;278;37;35508-15

  • Positional cloning of a quantitative trait locus on chromosome 13q14 that influences immunoglobulin E levels and asthma.

    Zhang Y, Leaves NI, Anderson GG, Ponting CP, Broxholme J, Holt R, Edser P, Bhattacharyya S, Dunham A, Adcock IM, Pulleyn L, Barnes PJ, Harper JI, Abecasis G, Cardon L, White M, Burton J, Matthews L, Mott R, Ross M, Cox R, Moffatt MF and Cookson WO

    Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK.

    Atopic or immunoglobulin E (IgE)-mediated diseases include the common disorders of asthma, atopic dermatitis and allergic rhinitis. Chromosome 13q14 shows consistent linkage to atopy and the total serum IgE concentration. We previously identified association between total serum IgE levels and a novel 13q14 microsatellite (USAT24G1; ref. 7) and have now localized the underlying quantitative-trait locus (QTL) in a comprehensive single-nucleotide polymorphism (SNP) map. We found replicated association to IgE levels that was attributed to several alleles in a single gene, PHF11. We also found association with these variants to severe clinical asthma. The gene product (PHF11) contains two PHD zinc fingers and probably regulates transcription. Distinctive splice variants were expressed in immune tissues and cells.

    Nature genetics 2003;34;2;181-6

* quick link -