Sanger Institute - Publications 2004

Number of papers published in 2004: 224

  • Regions syntenic to human 17q are gained in mouse and rat neuroblastoma.

    Łastowska M, Chung YJ, Cheng Ching N, Haber M, Norris MD, Kees UR, Pearson AD and Jackson MS

    Institute of Human Genetics, International Centre for Life, University of Newcastle upon Tyne, Newcastle upon Tyne, United Kingdom.

    Gain of chromosome arm 17q is the most frequent chromosomal change in human neuroblastoma and is a powerful predictor of adverse outcome of disease. This suggests that the region of gain includes a gene or genes critical for tumor pathogenesis. Analyses of breakpoint positions have revealed that the shortest region of gain (SRG) extends from MPO (17q23.1) to 17qter. Because this encompasses >300 genes, it precludes the identification of candidate genes from human breakpoint data alone. However, mouse chromosome 11, which is syntenic to human chromosome 17, is gained in up to 30% of neuroblastoma tumors developed in a murine MYCN transgenic model of this disease. To confirm that this key genetic change indicates the involvement of a molecular pathway conserved between mouse and man and is not occurring coincidentally in the transgenic model, we used fluorescence in situ hybridization to analyze sporadic cases of both mouse and rat neuroblastoma. Our results confirmed the presence of chromosome 11 gain in all three of the mouse cell lines we analyzed, with the SRG extending from Stat5b (101.6 Mb) to tel. In addition, the rat neuroblastoma cell line harbors an extra copy of distal chromosome 10, extending from 92.8 to 109.3 Mb, which is also syntenic to human 17q. Comparison of the regions gained in all three species has excluded 4.2 Mb from the previously defined region of 17q gain in humans as a likely location of the candidate gene or genes, and strongly suggests that the molecular etiology of neuroblastoma is similar in all three species.

    Genes, chromosomes & cancer 2004;40;2;158-63

  • Microevolution and history of the plague bacillus, Yersinia pestis.

    Achtman M, Morelli G, Zhu P, Wirth T, Diehl I, Kusecek B, Vogler AJ, Wagner DM, Allender CJ, Easterday WR, Chenal-Francisque V, Worsham P, Thomson NR, Parkhill J, Lindler LE, Carniel E and Keim P

    Department of Molecular Biology, Max-Planck Institut für Infektionsbiologie, D-10117 Berlin, Germany.

    The association of historical plague pandemics with Yersinia pestis remains controversial, partly because the evolutionary history of this largely monomorphic bacterium was unknown. The microevolution of Y. pestis was therefore investigated by three different multilocus molecular methods, targeting genomewide synonymous SNPs, variation in number of tandem repeats, and insertion of IS100 insertion elements. Eight populations were recognized by the three methods, and we propose an evolutionary tree for these populations, rooted on Yersinia pseudotuberculosis. The tree invokes microevolution over millennia, during which enzootic pestoides isolates evolved. This initial phase was followed by a binary split 6,500 years ago, which led to populations that are more frequently associated with human disease. These populations do not correspond directly to classical biovars that are based on phenotypic properties. Thus, we recommend that henceforth groupings should be based on molecular signatures. The age of Y. pestis inferred here is compatible with the dates of historical pandemic plague. However, it is premature to infer an association between any modern molecular grouping and a particular pandemic wave that occurred before the 20th century.

    Funded by: NIGMS NIH HHS: R01-GM060795

    Proceedings of the National Academy of Sciences of the United States of America 2004;101;51;17837-42

  • Mutagenic insertion and chromosome engineering resource (MICER).

    Adams DJ, Biggs PJ, Cox T, Davies R, van der Weyden L, Jonkers J, Smith J, Plumb B, Taylor R, Nishijima I, Yu Y, Rogers J and Bradley A

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambs, CB10 1SA, UK.

    Embryonic stem cell technology revolutionized biology by providing a means to assess mammalian gene function in vivo. Although it is now routine to generate mice from embryonic stem cells, one of the principal methods used to create mutations, gene targeting, is a cumbersome process. Here we describe the indexing of 93,960 ready-made insertional targeting vectors from two libraries. 5,925 of these vectors can be used directly to inactivate genes with an average targeting efficiency of 28%. Combinations of vectors from the two libraries can be used to disrupt both alleles of a gene or engineer larger genomic changes such as deletions, duplications, translocations or inversions. These indexed vectors constitute a public resource (Mutagenic Insertion and Chromosome Engineering Resource; MICER) for high-throughput, targeted manipulation of the mouse genome.

    Nature genetics 2004;36;8;867-71

  • An ENU-induced mutation in AP-2alpha leads to middle ear and ocular defects in Doarad mice.

    Ahituv N, Erven A, Fuchs H, Guy K, Ashery-Padan R, Williams T, de Angelis MH, Avraham KB and Steel KP

    Department of Human Genetics and Molecular Medicine, Sackler School of Medicine, Tel Aviv University, 69978, Tel Aviv, Israel.

    One of the advantages of N-ethyl- N-nitrosourea (ENU)-induced mutagenesis is that, after randomly causing point mutations, a variety of alleles can be generated in genes leading to diverse phenotypes. For example, transcription factor AP-2alpha ( Tcfap2a) null homozygote mice show a large spectrum of developmental defects, among them missing middle ear bones and tympanic ring. This is the usual occurrence, where mutations causing middle ear anomalies usually coincide with other abnormalities. Using ENU-induced mutagenesis, we discovered a new dominant Tcfap2a mutant named Doarad ( Dor) that has a missense mutation in the PY motif of its transactivation domain, leading to a misshapen malleus, incus, and stapes without any other observable phenotype. Dor homozygous mice die perinatally, showing prominent abnormal facial structures and ocular defects. In vitro assays suggest that this mutation causes a "gain of function" in the transcriptional activation of AP-2alpha. These mice enable us to address more specifically the developmental role of Tcfap2a in the eye and middle ear and are the first report of a mutation in a gene specifically causing middle ear abnormalities, leading to conductive hearing loss.

    Funded by: NIDCR NIH HHS: DE12728

    Mammalian genome : official journal of the International Mammalian Genome Society 2004;15;6;424-32

  • Expression profiling of the Leishmania life cycle: cDNA arrays identify developmentally regulated genes present but not annotated in the genome.

    Almeida R, Gilmartin BJ, McCann SH, Norrish A, Ivens AC, Lawson D, Levick MP, Smith DF, Dyall SD, Vetrie D, Freeman TC, Coulson RM, Sampaio I, Schneider H and Blackwell JM

    Cambridge Institute for Medical Research, Wellcome Trust/MRC Building, Addenbrooke's Hospital, Hills Road, Cambridge CB2 2XY, UK.

    As genomic sequencing of Leishmania nears completion, functional analyses that provide a global genetic perspective on biological processes are important. Despite polycistronic transcription, RNA transcript abundance can be measured using microarrays. To provide a resource to evaluate cDNA arrays, we undertook 5' expressed sequence tag analysis of 2183 full-length randomly selected cDNAs from Leishmania major promastigote (days 3, 7, 10 of culture in vitro), and lesion-derived amastigote libraries. PCR-amplified inserts from 1830 of these cDNA representing 1001 unique genes were spotted onto microarrays, and compared internally with PCR-amplified open reading frames (ORFs) from 904 genes representing 842 unique genes annotated in the L. major genome. Microarrays were screened with RNA from procyclic, metacyclic and amastigote populations of L. major. Redundant clones on the array gave highly reproducible results, providing confidence in identification of stage-specific gene expression. Four hundred and thirty unique (i.e. non-redundant) stage-specific genes were identified. A higher percentage of stage-specific gene expression was observed in amastigotes ( approximately 35%) compared to metacyclics ( approximately 12%) for both cDNAs and ORFs, but cDNAs provided a richer source of regulated genes than currently annotated ORFs from the Leishmania genome. In mapping cDNAs onto the Leishmania genome, we noted that approximately 42% aligned to regions not recognised as genes using current predictive annotation tools. These genes are highly represented in our stage-specific genes, and therefore represent important drug targets and vaccine candidates. Careful annotation of cDNAs onto the Leishmania genome will be important before producing the next generation of oligonucleotide arrays based on annotated genes of the genomic sequencing project.

    Funded by: Wellcome Trust: 061343

    Molecular and biochemical parasitology 2004;136;1;87-100

  • SCOP database in 2004: refinements integrate structure and sequence family data.

    Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C and Murzin AG

    MRC Centre for Protein Engineering, Hills Road, Cambridge CB2 2QH, UK.

    The Structural Classification of Proteins (SCOP) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. Protein domains in SCOP are hierarchically classified into families, superfamilies, folds and classes. The continual accumulation of sequence and structural data allows more rigorous analysis and provides important information for understanding the protein world and its evolutionary repertoire. SCOP participates in a project that aims to rationalize and integrate the data on proteins held in several sequence and structure databases. As part of this project, starting with release 1.63, we have initiated a refinement of the SCOP classification, which introduces a number of changes mostly at the levels below superfamily. The pending SCOP reclassification will be carried out gradually through a number of future releases. In addition to the expanded set of static links to external resources, available at the level of domain entries, we have started modernization of the interface capabilities of SCOP allowing more dynamic links with other databases. SCOP can be accessed at

    Nucleic acids research 2004;32;Database issue;D226-9

  • Strong positive selection and recombination drive the antigenic variation of the PilE protein of the human pathogen Neisseria meningitidis.

    Andrews TD and Gojobori T

    The Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom.

    The PilE protein is the major component of the Neisseria meningitidis pilus, which is encoded by the pilE/pilS locus that includes an expressed gene and eight homologous silent fragments. The silent gene fragments have been shown to recombine through gene conversion with the expressed gene and thereby provide a means by which novel antigenic variants of the PilE protein can be generated. We have analyzed the evolutionary rate of the pilE gene using the nucleotide sequence of two complete pilE/pilS loci. The very high rate of evolution displayed by the PilE protein appears driven by both recombination and positive selection. Within the semivariable region of the pilE and pilS genes, recombination appears to occur within multiple small sequence blocks that lie between conserved sequence elements. Within the hypervariable region, positive selection was identified from comparison of the silent and expressed genes. The unusual gene conversion mechanism that operates at the pilE/pilS locus is a strategy employed by N. meningitidis to enhance mutation of certain regions of the PilE protein. The silent copies of the gene effectively allow "parallelized" evolution of pilE, thus enabling the encoded protein to rapidly explore a large area of sequence space in an effort to find novel antigenic variants.

    Genetics 2004;166;1;25-32

  • Chromosome 21 and down syndrome: from genomics to pathophysiology.

    Antonarakis SE, Lyle R, Dermitzakis ET, Reymond A and Deutsch S

    Department of Genetic Medicine and Development, University of Geneva Medical School and University Hospitals of Geneva, 1 rue Michel-Servet, 1211 Geneva, Switzerland.

    The sequence of chromosome 21 was a turning point for the understanding of Down syndrome. Comparative genomics is beginning to identify the functional components of the chromosome and that in turn will set the stage for the functional characterization of the sequences. Animal models combined with genome-wide analytical methods have proved indispensable for unravelling the mysteries of gene dosage imbalance.

    Nature reviews. Genetics 2004;5;10;725-38

  • Domain insertions in protein structures.

    Aroul-Selvam R, Hubbard T and Sasidharan R

    The Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Domains are the structural, functional or evolutionary units of proteins. Proteins can comprise a single domain or a combination of domains. In multi-domain proteins, the domains almost always occur end-to-end, i.e., one domain follows the C-terminal end of another domain. However, there are exceptions to this common pattern, where multi-domain proteins are formed by insertion of one domain (insert) into another domain (parent). Here, we provide a quantitative description of known insertions in the Protein Data Bank (PDB). We found that 9% of domain combinations observed in non-redundant PDB are insertions. Although 90% of all insertions involve only one insert, proteins can clearly have multiple (nested, two-domain and three-domain) inserts. We also observed correlations between the structure and function of a domain and its tendency to be found as a parent or an insert. There is a bias in insert position towards the C terminus of parents. We observed that the atomic distance between the N and C terminus of an insert is significantly smaller when compared to the N-to-C distance in a parent context or a single domain context. Insertions are found always to occur in loop regions of parent domains. Our observations regarding the relationship between domain insertions and the structure, function and evolution of proteins have implications for protein engineering.

    Journal of molecular biology 2004;338;4;633-41

  • A predominantly neolithic origin for Y-chromosomal DNA variation in North Africa.

    Arredi B, Poloni ES, Paracchini S, Zerjal T, Fathallah DM, Makrelouf M, Pascali VL, Novelletto A and Tyler-Smith C

    Istituto di Medicina Legale, Università Cattolica del Sacro Cuore di Roma, Rome, Italy.

    We have typed 275 men from five populations in Algeria, Tunisia, and Egypt with a set of 119 binary markers and 15 microsatellites from the Y chromosome, and we have analyzed the results together with published data from Moroccan populations. North African Y-chromosomal diversity is geographically structured and fits the pattern expected under an isolation-by-distance model. Autocorrelation analyses reveal an east-west cline of genetic variation that extends into the Middle East and is compatible with a hypothesis of demic expansion. This expansion must have involved relatively small numbers of Y chromosomes to account for the reduction in gene diversity towards the West that accompanied the frequency increase of Y haplogroup E3b2, but gene flow must have been maintained to explain the observed pattern of isolation-by-distance. Since the estimates of the times to the most recent common ancestor (TMRCAs) of the most common haplogroups are quite recent, we suggest that the North African pattern of Y-chromosomal variation is largely of Neolithic origin. Thus, we propose that the Neolithic transition in this part of the world was accompanied by demic diffusion of Afro-Asiatic-speaking pastoralists from the Middle East.

    American journal of human genetics 2004;75;2;338-45

  • Caenorhabditis elegans functional genomics: omic resonance.

    Astin J, Merry A, Rajan J and Kuwabara PE

    Department of Biochemistry, University of Bristol, The School of Medical Sciences, University Walk, Bristol BS8 1TD, UK.

    The nematode Caenorhabditis elegans is widely used as a model organism for studying many fundamental aspects of development and cell biology, including processes underlying human disease. The genome of C. elegans encodes over 19,000 protein-coding genes and hundreds of non-coding RNAs. The availability of whole genome sequence has facilitated the development of high throughput techniques for elucidating the function of individual genes and gene products. Furthermore, attempts can now be made to integrate these substantial functional genomics data collections and to understand at a global level how the flow of genomic information that is at the core of the central dogma leads to the development of a multicellular organism.

    Briefings in functional genomics & proteomics 2004;3;1;26-34

  • The knockout mouse project.

    Austin CP, Battey JF, Bradley A, Bucan M, Capecchi M, Collins FS, Dove WF, Duyk G, Dymecki S, Eppig JT, Grieder FB, Heintz N, Hicks G, Insel TR, Joyner A, Koller BH, Lloyd KC, Magnuson T, Moore MW, Nagy A, Pollock JD, Roses AD, Sands AT, Seed B, Skarnes WC, Snoddy J, Soriano P, Stewart DJ, Stewart F, Stillman B, Varmus H, Varticovski L, Verma IM, Vogt TF, von Melchner H, Witkowski J, Woychik RP, Wurst W, Yancopoulos GD, Young SG and Zambrowicz B

    National Human Genome Research Institute, National Institutes of Health, Building 31, Room 4B09, 31 Center Drive, Bethesda, Maryland 20892, USA.

    Mouse knockout technology provides a powerful means of elucidating gene function in vivo, and a publicly available genome-wide collection of mouse knockouts would be significantly enabling for biomedical discovery. To date, published knockouts exist for only about 10% of mouse genes. Furthermore, many of these are limited in utility because they have not been made or phenotyped in standardized ways, and many are not freely available to researchers. It is time to harness new technologies and efficiencies of production to mount a high-throughput international effort to produce and phenotype knockouts for all mouse genes, and place these resources into the public domain.

    Funded by: Wellcome Trust: 077188

    Nature genetics 2004;36;9;921-4

  • The European dimension for the mouse genome mutagenesis program.

    Auwerx J, Avner P, Baldock R, Ballabio A, Balling R, Barbacid M, Berns A, Bradley A, Brown S, Carmeliet P, Chambon P, Cox R, Davidson D, Davies K, Duboule D, Forejt J, Granucci F, Hastie N, de Angelis MH, Jackson I, Kioussis D, Kollias G, Lathrop M, Lendahl U, Malumbres M, von Melchner H, Müller W, Partanen J, Ricciardi-Castagnoli P, Rigby P, Rosen B, Rosenthal N, Skarnes B, Stewart AF, Thornton J, Tocchini-Valentini G, Wagner E, Wahli W and Wurst W

    Mouse Clinical Institute (MCI), Illkirch, Strasbourg, France [corrected].

    The European Mouse Mutagenesis Consortium is the European initiative contributing to the international effort on functional annotation of the mouse genome. Its objectives are to establish and integrate mutagenesis platforms, gene expression resources, phenotyping units, storage and distribution centers and bioinformatics resources. The combined efforts will accelerate our understanding of gene function and of human health and disease.

    Funded by: Medical Research Council: MC_U127527203; Telethon: TGM03S01, TGM06S01; Wellcome Trust: 077188

    Nature genetics 2004;36;9;925-7

  • The Burkholderia cepacia epidemic strain marker is part of a novel genomic island encoding both virulence and metabolism-associated genes in Burkholderia cenocepacia.

    Baldwin A, Sokol PA, Parkhill J and Mahenthiralingam E

    Cardiff School of Biosciences, Cardiff University, Cardiff CF10 3TL, Wales, Canada TN2 4N1.

    The Burkholderia cepacia epidemic strain marker (BCESM) is a useful epidemiological marker for virulent B. cenocepacia strains that infect patients with cystic fibrosis. However, there was no evidence that the original marker, identified by random amplified polymorphic DNA fingerprinting, contributed to pathogenicity. Here we demonstrate that the BCESM is part of a novel genomic island encoding genes linked to both virulence and metabolism. The BCESM was present on a 31.7-kb low-GC-content island that encoded 35 predicted coding sequences (CDSs): an N-acyl homoserine lactone (AHL) synthase gene (cciI) and corresponding transcriptional regulator (cciR), representing the first time cell signaling genes have been found on a genomic island; fatty acid biosynthesis genes; an IS66 family transposase; transcriptional regulator CDSs; amino acid metabolism genes; and a group of hypothetical genes. Mutagenesis of the AHL synthase, amidase (amiI), and porin (opcI) genes on the island was carried out. Testing of the isogenic mutants in a rat model of chronic lung infection demonstrated that the amidase played a role in persistence, while the AHL synthase and porin were both involved in virulence. The island, designated the B. cenocepacia island (cci), is the first genomic island to be defined in the B. cepacia complex and its discovery validates the original epidemiological correlation of the BCESM with virulent CF strains. The features of the cci, which overlap both pathogenicity and metabolism, expand the concept of bacterial pathogenicity islands and illustrate the diversity of accessory functions that can be acquired by lateral gene transfer in bacteria.

    Infection and immunity 2004;72;3;1537-47

  • The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website.

    Bamford S, Dawson E, Forbes S, Clements J, Pettett R, Dogan A, Flanagan A, Teague J, Futreal PA, Stratton MR and Wooster R

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.

    The discovery of mutations in cancer genes has advanced our understanding of cancer. These results are dispersed across the scientific literature and with the availability of the human genome sequence will continue to accrue. The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website have been developed to store somatic mutation data in a single location and display the data and other information related to human cancer. To populate this resource, data has currently been extracted from reports in the scientific literature for somatic mutations in four genes, BRAF, HRAS, KRAS2 and NRAS. At present, the database holds information on 66 634 samples and reports a total of 10 647 mutations. Through the web pages, these data can be queried, displayed as figures or tables and exported in a number of formats. COSMIC is an ongoing project that will continue to curate somatic mutation data and release it through the website.

    British journal of cancer 2004;91;2;355-8

  • Bioinformatics of proteases in the MEROPS database.

    Barrett AJ

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 ISA, UK.

    Proteolytic enzymes represent approximately approximately 2% of the total number of proteins present in all types of organisms. Many of these enzymes are of medical importance, and those that are of potential interest as drug targets can be divided into the endogenous enzymes encoded in the human genome, and the exogenous proteases encoded in the genomes of disease-causing organisms. There are also naturally occurring inhibitors of proteases, some of which have pharmaceutical relevance. The MEROPS database provides a rich source of information on proteases and their inhibitors. Storage and retrieval of this information is facilitated by the use of a hierarchical classification system (which was pioneered by the compilers of the database) in which homologous proteases and their inhibitors are divided into clans and families.

    Current opinion in drug discovery & development 2004;7;3;334-41

  • The Pfam protein families database.

    Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C and Eddy SR

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Pfam is a large collection of protein families and domains. Over the past 2 years the number of families in Pfam has doubled and now stands at 6190 (version 10.0). Methodology improvements for searching the Pfam collection locally as well as via the web are described. Other recent innovations include modelling of discontinuous domains allowing Pfam domain definitions to be closer to those found in structure databases. Pfam is available on the web in the UK (, the USA (, France ( and Sweden (

    Nucleic acids research 2004;32;Database issue;D138-41

  • Genome sequence of the enterobacterial phytopathogen Erwinia carotovora subsp. atroseptica and characterization of virulence factors.

    Bell KS, Sebaihia M, Pritchard L, Holden MT, Hyman LJ, Holeva MC, Thomson NR, Bentley SD, Churcher LJ, Mungall K, Atkin R, Bason N, Brooks K, Chillingworth T, Clark K, Doggett J, Fraser A, Hance Z, Hauser H, Jagels K, Moule S, Norbertczak H, Ormond D, Price C, Quail MA, Sanders M, Walker D, Whitehead S, Salmond GP, Birch PR, Parkhill J and Toth IK

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.

    The bacterial family Enterobacteriaceae is notable for its well studied human pathogens, including Salmonella, Yersinia, Shigella, and Escherichia spp. However, it also contains several plant pathogens. We report the genome sequence of a plant pathogenic enterobacterium, Erwinia carotovora subsp. atroseptica (Eca) strain SCRI1043, the causative agent of soft rot and blackleg potato diseases. Approximately 33% of Eca genes are not shared with sequenced enterobacterial human pathogens, including some predicted to facilitate unexpected metabolic traits, such as nitrogen fixation and opine catabolism. This proportion of genes also contains an overrepresentation of pathogenicity determinants, including possible horizontally acquired gene clusters for putative type IV secretion and polyketide phytotoxin synthesis. To investigate whether these gene clusters play a role in the disease process, an arrayed set of insertional mutants was generated, and mutations were identified. Plant bioassays showed that these mutants were significantly reduced in virulence, demonstrating both the presence of novel pathogenicity determinants in Eca, and the impact of functional genomics in expanding our understanding of phytopathogenicity in the Enterobacteriaceae.

    Proceedings of the National Academy of Sciences of the United States of America 2004;101;30;11105-10

  • Characterization of the imprinted polycomb gene L3MBTL, a candidate 20q tumour suppressor gene, in patients with myeloid malignancies.

    Bench AJ, Li J, Huntly BJ, Delabesse E, Fourouclas N, Hunt AR, Deloukas P and Green AR

    Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Cambridge CB2 2XY, UK.

    Chromosome 20q deletion is a recurrent chromosomal abnormality associated with myeloid malignancies. L3MBTL represents a strong candidate tumour suppressor gene since it lies within the common deleted region, is a member of the Polycomb-like family, encodes the human homologue of a Drosophila tumour suppressor and is expressed within haematopoietic progenitor cells. We describe the structure of L3MBTL, identify two putative promoters each associated with two CpG islands and characterize a complex pattern of alternative splicing events. Mutation analysis of the gene in patients with and without a 20q deletion identified several polymorphisms but no acquired mutations. The two CpG islands spanning promoter 2 undergo monoallelic methylation in normal haematopoietic cells consistent with imprinting of L3MBTL. Samples from patients with a 20q deletion retained either the methylated or unmethylated allele but retention of the methylated allele did not correlate with reduction in L3MBTL mRNA levels. The absence of a correlation between L3MBTL methylation and transcription could be shown to reflect loss of imprinting in one patient. In addition, our results demonstrate that inactivation of L3MBTL is not a common occurrence in patients with a 20q deletion or in cytogenetically normal patients with polycythaemia vera.

    British journal of haematology 2004;127;5;509-18

  • Genomes for medicine.

    Bentley DR

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.

    We have the human genome sequence. It is freely available, accurate and nearly complete. But is the genome ready for medicine? The new resource is already changing genetic research strategies to find information of medical value. Now we need high-quality annotation of all the functionally important sequences and the variations within them that contribute to health and disease. To achieve this, we need more genome sequences, systematic experimental analyses, and extensive information on human phenotypes. Flexible and user-friendly access to well-annotated genomes will create an environment for innovation, and the potential for unlimited use of sequencing in biomedical research and practice.

    Nature 2004;429;6990;440-5

  • Genomic pot pourri.

    Bentley S, Crossman L, Cerdeño-Tárraga A and Parkhill J

    Nature reviews. Microbiology 2004;2;12;928-9

  • Comparative genomic structure of prokaryotes.

    Bentley SD and Parkhill J

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.

    Recent advances in DNA-sequencing technologies have made available an enormous resource of data for the study of bacterial genomes. The broad sample of complete genomes currently available allows us to look at variation in the gross features and characteristics of genomes while the detail of the sequences reveal some of the mechanisms by which these genomes evolve. This review aims to describe bacterial genome structures according to current knowledge and proposed hypotheses. We also describe examples where mechanisms of genome evolution have acted in the adaptation of bacterial species to particular niches.

    Annual review of genetics 2004;38;771-92

  • SCP1, a 356,023 bp linear plasmid adapted to the ecology and developmental biology of its host, Streptomyces coelicolor A3(2).

    Bentley SD, Brown S, Murphy LD, Harris DE, Quail MA, Parkhill J, Barrell BG, McCormick JR, Santamaria RI, Losick R, Yamasaki M, Kinashi H, Chen CW, Chandra G, Jakimowicz D, Kieser HM, Kieser T and Chater KF

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    The sequencing of the entire genetic complement of Streptomyces coelicolor A3(2) has been completed with the determination of the 365,023 bp sequence of the linear plasmid SCP1. Remarkably, the functional distribution of SCP1 genes somewhat resembles that of the chromosome: predicted gene products/functions include ECF sigma factors, antibiotic biosynthesis, a gamma-butyrolactone signalling system, members of the actinomycete-specific Wbl class of regulatory proteins and 14 secreted proteins. Some of these genes are among the 18 that contain a TTA codon, making them targets for the developmentally important tRNA encoded by the bldA gene. RNA analysis and gene fusions showed that one of the TTA-containing genes is part of a large bldA-dependent operon, the gene products of which include three proteins isolated from the spore surface by detergent washing (SapC, D and E), and several probable metabolic enzymes. SCP1 shows much evidence of recombinational interactions with other replicons and transposable elements during its history. For example, it has two sets of partitioning genes (which may explain why an integrated copy of SCP1 partially suppressed the defective partitioning of a parAB-deleted chromosome during sporulation). SCP1 carries a cluster of probable transfer determinants and genes encoding likely DNA polymerase III subunits, but it lacks an obvious candidate gene for the terminal protein associated with its ends. This may be related to atypical features of its end sequences.

    Funded by: PHS HHS: F32 G12961

    Molecular microbiology 2004;51;6;1615-28

  • Data mining parasite genomes.

    Berriman M

    Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, CB10 ISA, UK.

    The term 'data mining' can be used to describe any process where useful information is extracted from data with a large background of 'noise'. In the context of a genome project, several stages involve data mining. Amongst the sequence data, 'signals' need to be detected that indicate the presence of interesting features. Often this involves differentiating between transcribed and non-transcribed bases to predict coding regions. After detection, defining the roles of these sequences involves sifting through multiple lines of evidence. If these roles are accurately reflected in genome annotation, they can be used by researchers to frame queries and interrogate the data further.

    Parasitology 2004;128 Suppl 1;S23-31

  • Annotation of parasite genomes.

    Berriman M and Harris M

    Pathogen Sequencing Unit, Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, UK.

    Genome annotation is the application of useful biological descriptions to sequence data. Different levels of time and effort can be invested to produce correspondingly different depths of annotation depending on what methods are employed. Researchers using genome data should, therefore, understand how annotations are generated to assess their validity correctly and to determine what level of inferences can be made accordingly. Thorough annotation requires a large range of procedures, most of which involve manual reviews of all available evidence. First, gene structures often are computed algorithmically and edited based on in-depth analyses of the underlying sequence data. Second, functional predictions draw on data from various sources. Finally, the use of structured and controlled descriptions, such as those provided by gene ontology, can be used so that final descriptions are not only consistent and unambiguous, but capable of being used in further downstream analyses such as cross-species comparisons.

    Methods in molecular biology (Clifton, N.J.) 2004;270;17-44

  • Curation of the Plasmodium falciparum genome.

    Berry AE, Gardner MJ, Caspers GJ, Roos DS and Berriman M

    Pathogen Sequencing Unit, Wellcome Trust Sanger Institute, Hinxton Hall, Cambridge CB10 1SA, UK.

    The malaria genome has proved invaluable to researchers worldwide in the continuing fight against malaria by stimulating and underpinning molecular approaches in gene expression studies, vaccine and drug discovery research, and by providing data to facilitate hypothesis-driven research. The combination of in silico and experimental investigations has already yielded dividends by strengthening our understanding of the many facets of the malaria parasite Plasmodium falciparum. The recently initiated curation of the genome resource is a vital investment for maintaining and enhancing the use of this genomic information in the post-genomic era.

    Trends in parasitology 2004;20;12;548-52

  • High-resolution analysis of DNA copy number using oligonucleotide microarrays.

    Bignell GR, Huang J, Greshock J, Watt S, Butler A, West S, Grigorova M, Jones KW, Wei W, Stratton MR, Futreal PA, Weber B, Shapero MH and Wooster R

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK.

    Genomic copy number alterations are a feature of many human diseases including cancer. We have evaluated the effectiveness of an oligonucleotide array, originally designed to detect single-nucleotide polymorphisms, to assess DNA copy number. We first showed that fluorescent signal from the oligonucleotide array varies in proportion to both decreases and increases in copy number. Subsequently we applied the system to a series of 20 cancer cell lines. All of the putative homozygous deletions (10) and high-level amplifications (12; putative copy number >4) tested were confirmed by PCR (either qPCR or normal PCR) analysis. Low-level copy number changes for two of the lines under analysis were compared with BAC array CGH; 77% (n = 44) of the autosomal chromosomes used in the comparison showed consistent patterns of LOH (loss of heterozygosity) and low-level amplification. Of the remaining 10 comparisons that were discordant, eight were caused by low SNP densities and failed in both lines. The studies demonstrate that combining the genotype and copy number analyses gives greater insight into the underlying genetic alterations in cancer cells with identification of complex events including loss and reduplication of loci.

    Genome research 2004;14;2;287-95

  • BAC finishing strategies.

    Bird C and Grafham D

    Welcome Trust Genome Campus, The Sanger Institute, Cambridge, UK.

    Methods in molecular biology (Clifton, N.J.) 2004;255;255-77

  • Biological database design and implementation.

    Birney E and Clamp M

    EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

    We present our experience of building biological databases. Such databases have most aspects in common with other complex databases in other fields. We do not believe that biological data are that different from complex data in other fields. Our experience has led us to emphasise simplicity and conservative technology choices when building these databases. This is a short paper of advice that we hope is useful to people designing their own biological database.

    Briefings in bioinformatics 2004;5;1;31-8

  • Ensembl 2004.

    Birney E, Andrews D, Bevan P, Caccamo M, Cameron G, Chen Y, Clarke L, Coates G, Cox T, Cuff J, Curwen V, Cutts T, Down T, Durbin R, Eyras E, Fernandez-Suarez XM, Gane P, Gibbins B, Gilbert J, Hammond M, Hotz H, Iyer V, Kahari A, Jekosch K, Kasprzyk A, Keefe D, Keenan S, Lehvaslaiho H, McVicker G, Melsopp C, Meidl P, Mongin E, Pettett R, Potter S, Proctor G, Rae M, Searle S, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Ureta-Vidal A, Woodwark C, Clamp M and Hubbard T

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    The Ensembl ( database project provides a bioinformatics framework to organize biology around the sequences of large genomes. It is a comprehensive and integrated source of annotation of large genome sequences, available via interactive website, web services or flat files. As well as being one of the leading sources of genome annotation, Ensembl is an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements. The facilities of the system range from sequence analysis to data storage and visualization and installations exist around the world both in companies and at academic sites. With a total of nine genome sequences available from Ensembl and more genomes to follow, recent developments have focused mainly on closer integration between genomes and external data.

    Nucleic acids research 2004;32;Database issue;D468-70

  • An overview of Ensembl.

    Birney E, Andrews TD, Bevan P, Caccamo M, Chen Y, Clarke L, Coates G, Cuff J, Curwen V, Cutts T, Down T, Eyras E, Fernandez-Suarez XM, Gane P, Gibbins B, Gilbert J, Hammond M, Hotz HR, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Keenan S, Lehvaslaiho H, McVicker G, Melsopp C, Meidl P, Mongin E, Pettett R, Potter S, Proctor G, Rae M, Searle S, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Ureta-Vidal A, Woodwark KC, Cameron G, Durbin R, Cox A, Hubbard T and Clamp M

    EMBL European Bioinformatics Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    Ensembl ( is a bioinformatics project to organize biological information around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of individual genomes, and of the synteny and orthology relationships between them. It is also a framework for integration of any biological data that can be mapped onto features derived from the genomic sequence. Ensembl is available as an interactive Web site, a set of flat files, and as a complete, portable open source software system for handling genomes. All data are provided without restriction, and code is freely available. Ensembl's aims are to continue to "widen" this biological integration to include other model organisms relevant to understanding human biology as they become available; to "deepen" this integration to provide an ever more seamless linkage between equivalent components in different species; and to provide further classification of functional elements in the genome that have been previously elusive.

    Funded by: Wellcome Trust: 062023

    Genome research 2004;14;5;925-8

  • GeneWise and Genomewise.

    Birney E, Clamp M and Durbin R

    The European Bioinformatics Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    We present two algorithms in this paper: GeneWise, which predicts gene structure using similar protein sequences, and Genomewise, which provides a gene structure final parse across cDNA- and EST-defined spliced structure. Both algorithms are heavily used by the Ensembl annotation system. The GeneWise algorithm was developed from a principled combination of hidden Markov models (HMMs). Both algorithms are highly accurate and can provide both accurate and complete gene structures when used with the correct evidence.

    Genome research 2004;14;5;988-95

  • A survey of RNA editing in human brain.

    Blow M, Futreal PA, Wooster R and Stratton MR

    Cancer Genome Project, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.

    We have conducted a survey of RNA editing in human brain by comparing sequences of clones from a human brain cDNA library to the reference human genome sequence and to genomic DNA from the same individual. In the RNA sample from which the library was constructed, approximately 1:2000 nucleotides were edited out of >3 Mb surveyed. All edits were adenosine to inosine (A-->I) and were predominantly in intronic and in intergenic RNAs. No edits were found in translated exons and few in untranslated exons. Most edits were in high-copy-number repeats, usually Alus. Analysis of the genome in the vicinity of edited sequences strongly supports the idea that formation of intramolecular double-stranded RNA with an inverted copy underlies most A-->I editing. The likelihood of editing is increased by the presence of two inverted copies of a sequence within the same intron, proximity of the two sequences to each other (preferably within 2 kb), and by a high density of inverted copies in the vicinity. Editing exhibits sequence preferences and is less likely at an adenosine 3' to a guanosine and more likely at an adenosine 5' to a guanosine. Simulation by BLAST alignment of the double-stranded RNA molecules that underlie known edits indicates that there is a greater likelihood of A-->I editing at A:C mismatches than editing at other mismatches or at A:U matches. However, because A:U matches in double-stranded RNA are more common than all mismatches, overall the likely effect of editing is to increase the number of mismatches in double-stranded RNA.

    Genome research 2004;14;12;2379-87

  • NIPBL mutations and genetic heterogeneity in Cornelia de Lange syndrome.

    Borck G, Redon R, Sanlaville D, Rio M, Prieur M, Lyonnet S, Vekemans M, Carter NP, Munnich A, Colleaux L and Cormier-Daire V

    INSERM U393 and Département de Génétique Médicale, Hôpital Necker - Enfants Malades, Paris, France.

    Journal of medical genetics 2004;41;12;e128

  • Genome-wide screening using automated fluorescent genotyping to detect cryptic cytogenetic abnormalities in children with idiopathic syndromic mental retardation.

    Borck G, Rio M, Sanlaville D, Redon R, Molinari F, Bacq D, Raoul O, Cormier-Daire V, Lyonnet S, Amiel J, Le Merrer M, de Blois MC, Prieur M, Vekemans M, Carter NP, Munnich A and Colleaux L

    INSERM U393 et Département de Génétique, Hôpital Necker-Enfants Malades, Paris, France.

    Mental retardation (MR) is the most common developmental disability, affecting approximately 2% of the population. The causes of MR are diverse and poorly understood, but chromosomal rearrangements account for 4-28% of cases, and duplications/deletions smaller than 5 Mb are known to cause syndromic MR. We have previously developed a strategy based on automated fluorescent microsatellite genotyping to test for telomere integrity. This strategy detected about 10% of cryptic subtelomeric rearrangements in patients with idiopathic syndromic MR. Because telomere screening is a first step toward the goal of analyzing the entire genome for chromosomal rearrangements in MR, we have extended our strategy to 400 markers evenly distributed along the chromosomes to detect interstitial anomalies. Among 97 individuals tested, three anomalies were found: two deletions (one in three siblings) and one parental disomy. These results emphasize the value of a genome-wide microsatellite scan for the detection of interstitial aberrations and demonstrate that automated genotyping is a sensitive method that not only detects small interstitial rearrangements and their parental origin but also provides a unique opportunity to detect uniparental disomies. This study will hopefully contribute to the delineation of new contiguous gene syndromes and the identification of new imprinted regions.

    Clinical genetics 2004;66;2;122-7

  • The ingi and RIME non-LTR retrotransposons are not randomly distributed in the genome of Trypanosoma brucei.

    Bringaud F, Biteau N, Zuiderwijk E, Berriman M, El-Sayed NM, Ghedin E, Melville SE, Hall N and Baltz T

    Laboratoire de Génomique Fonctionnelle des Trypanosomatides, UMR-5162 CNRS, Université Victor Segalen Bordeaux II, Bordeaux, France.

    The ingi (long and autonomous) and RIME (short and nonautonomous) non--long-terminal repeat retrotransposons are the most abundant mobile elements characterized to date in the genome of the African trypanosome Trypanosoma brucei. These retrotransposons were thought to be randomly distributed, but a detailed and comprehensive analysis of their genomic distribution had not been performed until now. To address this question, we analyzed the ingi/RIME sequences and flanking sequences from the ongoing T. brucei genome sequencing project (TREU927/4 strain). Among the 81 ingi/RIME elements analyzed, 60% are complete, and 7% of the ingi elements (approximately 15 copies per haploid genome) appear to encode for their own transposition. The size of the direct repeat flanking the ingi/RIME retrotransposons is conserved (i.e., 12-bp), and a strong 11-bp consensus pattern precedes the 5'-direct repeat. The presence of a consensus pattern upstream of the retroelements was confirmed by the analysis of the base occurrence in 294 GSS containing 5'-adjacent ingi/RIME sequences. The conserved sequence is present upstream of ingis and RIMEs, suggesting that ingi-encoded enzymatic activities are used for retrotransposition of RIMEs, which are short nonautonomous retroelements. In conclusion, the ingi and RIME retroelements are not randomly distributed in the genome of T. brucei and are preceded by a conserved sequence, which may be the recognition site of the ingi-encoded endonuclease.

    Molecular biology and evolution 2004;21;3;520-8

  • Micro-geographical differentiation in Northern Iberia revealed by Y-chromosomal DNA analysis.

    Brion M, Quintans B, Zarrabeitia M, Gonzalez-Neira A, Salas A, Lareu V, Tyler-Smith C and Carracedo A

    Institute of Legal Medicine, University of Santiago de Compostela, San Francisco s/n., 15782 Santiago de Compostela, Spain.

    Y-chromosome diversity has been analyzed at a micro-geographical level, examining 10 binary polymorphisms and 7 short tandem repeats (STRs) in 443 samples belonging to 11 populations from two regions of Northern Spain, Galicia and Cantabria. Both regions, as a whole, cluster with other Iberian populations. However, some individual populations, particularly that from the Pas Valley in Cantabria, depart markedly from this general pattern, with higher genetic distances and reduced diversity. This unusual population is even more distinct than the Basques from their Iberian neighbors. Genetic drift in a small isolated population could explain this special behavior, and in addition to its anthropological interest, this finding has important forensic implications.

    Gene 2004;329;17-25

  • Variation in the effectors of the type III secretion system among Photorhabdus species as revealed by genomic analysis.

    Brugirard-Ricaud K, Givaudan A, Parkhill J, Boemare N, Kunst F, Zumbihl R and Duchaud E

    Laboratoire EMIP Ecologie Microbienne des Insectes et Interaction Hôte-Pathogène, Université de Montpellier II, UMR1133 INRA-UMII, 34095 Montpellier 5, France.

    Entomopathogenic bacteria of the genus Photorhabdus harbor a type III secretion system. This system was probably acquired prior to the separation of the species within this genus. Furthermore, the core components of the secretion machinery are highly conserved but the predicted effectors differ between Photorhabdus luminescens and P. asymbiotica, two highly related species with different hosts.

    Journal of bacteriology 2004;186;13;4376-81

  • Neurospheres: insights into neural stem cell biology.

    Campos LS

    Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom.

    Neural stem cells (NSC) are a tissue-specific subtype of self-renewing and multipotent cells that can give rise to all neural populations. In this review, the importance of maintaining cell-cell contacts in the study of NSC is highlighted, and data obtained from some crucial single-cell studies is compared to results obtained from neurospheres, where aggregates of NSC are grown in suspension. In particular, results that indicate how this culture system may be well suited to analyze NSC plasticity, cell-cell, and cell-extracellular matrix (ECM) interactions are pointed out, and the hypothesis that cell-cell and cell-ECM contacts may be essential for NSC maintenance, survival, and proliferation is highlighted. Finally, it is suggested that neurospheres might play a role in the study of context-dependent behavior of NSC in niches by providing a system where NSC can be challenged chemically or biologically and analyzed in vitro, in a time- and context-dependent manner.

    Journal of neuroscience research 2004;78;6;761-9

  • As normal as normal can be?

    Carter NP

    Two papers report that large-scale copy-number variations, ranging in size from 100 kb to 2 Mb, are distributed widely throughout the human genome, and that a high proportion of them encompass known genes. This unexpected level of genome variation has implications for our view of human genetic diversity and phenotypic variation.

    Nature genetics 2004;36;9;931-2

  • Applications of genomic microarrays to explore human chromosome structure and function.

    Carter NP and Vetrie D

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, UK.

    The combination of genomic microarrays with comparative genomic hybridization and with chromatin immunoprecipitation is providing an increasingly detailed view of the way in which the human genome is organized and functions and how disorganization and disfunction can lead to disease. These studies are enhanced by the flexibility of array technology, allowing resolutions from coverage of the whole genome using 200 kb cloned DNA inserts to detailed analysis using PCR products or oligonucleotides of 100 bp or less. In particular, the use of chromatin immunoprecipitation is providing new insights into chromosome structure and gene regulation and control through the analysis of protein--DNA interactions.

    Human molecular genetics 2004;13 Spec No 2;R297-302

  • Molecular characterization and evolution of X and Y-borne ATRX homologues in American marsupials.

    Carvalho-Silva DR, O'Neill RJ, Brown JD, Huynh K, Waters PD, Pask AJ, Delbridge ML and Graves JA

    Research School of Biological Science, Australian National University, ACT 0200, Canberra, Australia.

    In eutherians, the sex-reversing ATRX gene on the X has no homologue on the Y chromosome. However, testis-specific and ubiquitously expressed X-borne genes have been identified in Australian marsupials. We studied nucleotide sequence and chromosomal location of ATRX homologues in two American marsupials, the opossums Didelphis virginiana and Monodelphis domestica. A PCR fragment of M. domestica ATRX was used to probe Southern blots and to screen male genomic libraries. Southern analysis demonstrated ATRX homologues on both X and Y in D. virginiana, and two clones were isolated which hybridized to a single position on the Y chromosome in male-derived cells but to multiple sites of the X in female cells. In M. domestica, there was a single clone that mapped to the X but not to the Y, suggesting that it represents the M. domestica ATRX. However a male-specific band was detected in Southern blots probed with the D. virginiana ATRY and with a mouse ATRX clone, which implies that the Y copy in M. domestica has diverged further from other ATRX homologues. Thus there appears to be a Y-borne copy of ATRY in American, as well as Australian marsupials, although it has diverged in sequence, as have other Y genes that are testis-specific in both eutherian and marsupial lineages.

    Chromosome research : an international journal on the molecular, supramolecular and evolutionary aspects of chromosome biology 2004;12;8;795-804

  • New environments, versatile genomes.

    Cerdeño-Tárraga A, Crossman L and Parkhill J

    The Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.

    Nature reviews. Microbiology 2004;2;6;446-7

  • Pathogens in decay.

    Cerdeño-Tárraga A, Thomson N and Parkhill J

    Nature reviews. Microbiology 2004;2;10;774-5

  • Knock-in human rhodopsin-GFP fusions as mouse models for human disease and targets for gene therapy.

    Chan F, Bradley A, Wensel TG and Wilson JH

    Verna and Marrs McLean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX 77030, USA.

    The human rhodopsin gene is the locus for numerous alleles linked to the neurodegenerative disease retinitis pigmentosa. To facilitate the study of retinal degeneration and to test reagents designed to alter the structure and function of this gene, we have developed strains of mice whose native rhodopsin gene has been replaced with the corresponding human DNA modified to encode an enhanced GFP fusion at the C terminus of rhodopsin. The human rhodopsin-GFP fusion faithfully mimics the expression and distribution of wild-type rhodopsin in heterozygotes and serves as a sensitive reporter of rod-cell structure and integrity. In homozygotes, however, the gene induces progressive retinal degeneration bearing many of the hallmarks of recessive retinitis pigmentosa. When the gene is flanked by recognition sites for Cre recombinase, protein expression is reduced approximately 5-fold despite undiminished mRNA levels, suggesting translation inhibition. GFP-tagged human rhodopsin provides a sensitive method to monitor the development of normal and diseased retinas in dissected samples, and it offers a noninvasive means to observe the progress of retinal degeneration and the efficacy of gene-based therapies in whole animals.

    Funded by: NEI NIH HHS: EY002520, EY11731

    Proceedings of the National Academy of Sciences of the United States of America 2004;101;24;9109-14

  • Analysis of multiple genomic sequence alignments: a web resource, online tools, and lessons learned from analysis of mammalian SCL loci.

    Chapman MA, Donaldson IJ, Gilbert J, Grafham D, Rogers J, Green AR and Göttgens B

    Cambridge Institute for Medical Research, Cambridge, CB2 2XY, UK.

    Comparative analysis of genomic sequences is becoming a standard technique for studying gene regulation. However, only a limited number of tools are currently available for the analysis of multiple genomic sequences. An extensive data set for the testing and training of such tools is provided by the SCL gene locus. Here we have expanded the data set to eight vertebrate species by sequencing the dog SCL locus and by annotating the dog and rat SCL loci. To provide a resource for the bioinformatics community, all SCL sequences and functional annotations, comprising a collation of the extensive experimental evidence pertaining to SCL regulation, have been made available via a Web server. A Web interface to new tools specifically designed for the display and analysis of multiple sequence alignments was also implemented. The unique SCL data set and new sequence comparison tools allowed us to perform a rigorous examination of the true benefits of multiple sequence comparisons. We demonstrate that multiple sequence alignments are, overall, superior to pairwise alignments for identification of mammalian regulatory regions. In the search for individual transcription factor binding sites, multiple alignments markedly increase the signal-to-noise ratio compared to pairwise alignments.

    Genome research 2004;14;2;313-8

  • WormBase as an integrated platform for the C. elegans ORFeome.

    Chen N, Lawson D, Bradnam K, Harris TW and Stein LD

    Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA.

    The ORFeome project has validated and corrected a large number of predicted gene models in the nematode C. elegans, and has provided an enormous resource for proteome-scale studies. To make the resource useful to the research and teaching community, it needs to be integrated with other large-scale data sets, including the C. elegans genome, cell lineage, neurological wiring diagram, transcriptome, and gene expression map. This integration is also critical because the ORFeome data sets, like other 'omics' data sets, have significant false-positive and false-negative rates, and comparison to related data is necessary to make confidence judgments in any given data point. WormBase, the central data repository for information about C. elegans and related nematodes, provides such a platform for integration. In this report, we will describe how C. elegans ORFeome data are deposited in the database, how they are used to correct gene models, how they are integrated and displayed in the context of other data sets at the WormBase Web site, and how WormBase establishes connection with the reagent-based resources at the ORFeome project Web site.

    Funded by: NHGRI NIH HHS: P41-HG02223

    Genome research 2004;14;10B;2155-61

  • Bigenic Cre/loxP, puDeltatk conditional genetic ablation.

    Chen YT, Levasseur R, Vaishnav S, Karsenty G and Bradley A

    Program in Developmental Biology, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA.

    Genetic ablation experiments are used to resolve problems regarding cell lineages and the in vivo function of certain groups of cells. We describe a two-component conditional ablation technology using a mouse carrying an X-linked puDeltatk transgene, which is only activated in cells expressing Cre. Ablation of the Cre-expressing cells can be temporally regulated by the time of ganciclovir (GCV) administration. This strategy was demonstrated using a Col2Cre transgenic line. Differentiating chondrocytes in bigenic animals could be ablated at different developmental stages resulting in disorganized growth plates and dwarfism. Macrocephaly, macroglossia and umbilical hernia were also observed in ablated 18.5 dpc embryos. Crosses between the puDeltatk selector transgenic line and existing cre lines will facilitate numerous temporally regulated tissue-specific ablation experiments.

    Nucleic acids research 2004;32;20;e161

  • Inducible gene trapping with drug-selectable markers and Cre/loxP to identify developmentally regulated genes.

    Chen YT, Liu P and Bradley A

    Program in Developmental Biology, Baylor College of Medicine, Houston, Texas, USA.

    Gene trapping in mouse embryonic stem cells is an important genetic approach that allows simultaneous mutation of genes and generation of corresponding mutant mice. We designed a selection scheme with drug selection markers and Cre/loxP technology which allows screening of gene trap events that responded to a signaling molecule in a 96-well format. Nine hundred twenty gene trap clones were assayed, and 258 were classified as gene traps induced by in vitro differentiation. Sixty-five of the in vitro differentiation-inducible gene traps were also responsive to retinoic acid treatment. In vivo analysis revealed that 85% of the retinoic acid-inducible gene traps trapped developmentally regulated genes, consistent with the observation that genes induced by retinoic acid treatment are likely to be developmentally regulated. Our results demonstrate that the inducible gene trapping system described here can be used to enrich in vitro for traps in genes of interest. Furthermore, we demonstrate that the cre reporter is extremely sensitive and can be used to explore chromosomal regions that are not detectable with neo as a selection cassette.

    Molecular and cellular biology 2004;24;22;9930-41

  • The apical caspase dronc governs programmed and unprogrammed cell death in Drosophila.

    Chew SK, Akdemir F, Chen P, Lu WJ, Mills K, Daish T, Kumar S, Rodriguez A and Abrams JM

    Department of Cell Biology, UT Southwestern Medical Center, Dallas, TX 75390, USA.

    Among the seven caspases encoded in the fly genome, only dronc contains a caspase recruitment domain. To assess the function of this gene in development, we produced a null mutation in dronc. Animals lacking zygotic dronc are defective for programmed cell death (PCD) and arrest as early pupae. These mutants present a range of defects, including extensive hyperplasia of hematopoietic tissues, supernumerary neuronal cells, and head involution failure. dronc genetically interacts with the Ced4/Apaf1 counterpart, Dark, and adult structures lacking dronc are disrupted for fine patterning. Furthermore, in diverse models of metabolic injury, dronc- cells are completely insensitive to induction of cell killing. These findings establish dronc as an essential regulator of cell number in development and illustrate broad requirements for this apical caspase in adaptive responses during stress-induced apoptosis.

    Funded by: NIGMS NIH HHS: R01 GM072124-14A1, R01GM072124

    Developmental cell 2004;7;6;897-907

  • Proteomics in postgenomic neuroscience: the end of the beginning.

    Choudhary J and Grant SG

    Wellcome Trust Sanger Institute, Cambridge CB10 1SA, UK.

    Proteomics is complementary to genomic approaches anchored in DNA and RNA. Global characterization of proteins is providing new insights into general biological structures as well as synapses, receptor complexes and other neuronal and glial features. Current challenges for proteomics of the nervous system include problems relating to sample preparation, brain complexity, limited databases and informatics tools. The combination of proteomics with other global functional genomic approaches at the levels of genome and transcriptome, together with network biology, will provide important bridges between genes, physiology and pathology.

    Nature neuroscience 2004;7;5;440-5

  • A whole-genome mouse BAC microarray with 1-Mb resolution for analysis of DNA copy number changes by array comparative genomic hybridization.

    Chung YJ, Jonkers J, Kitson H, Fiegler H, Humphray S, Scott C, Hunt S, Yu Y, Nishijima I, Velds A, Holstege H, Carter N and Bradley A

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Microarray-based comparative genomic hybridization (CGH) has become a powerful method for the genome-wide detection of chromosomal imbalances. Although BAC microarrays have been used for mouse CGH studies, the resolving power of these analyses was limited because high-density whole-genome mouse BAC microarrays were not available. We therefore developed a mouse BAC microarray containing 2803 unique BAC clones from mouse genomic libraries at 1-Mb intervals. For the general amplification of BAC clone DNA prior to spotting, we designed a set of three novel degenerate oligonucleotide-primed (DOP) PCR primers that preferentially amplify mouse genomic sequences while minimizing unwanted amplification of contaminating Escherichia coli DNA. The resulting 3K mouse BAC microarrays reproducibly identified DNA copy number alterations in cell lines and primary tumors, such as single-copy deletions, regional amplifications, and aneuploidy.

    Genome research 2004;14;1;188-96

  • The Jalview Java alignment editor.

    Clamp M, Cuff J, Searle SM and Barton GJ

    The Wellcome Trust Sanger Institute, Hinxton, UK.

    Multiple sequence alignment remains a crucial method for understanding the function of groups of related nucleic acid and protein sequences. However, it is known that automatic multiple sequence alignments can often be improved by manual editing. Therefore, tools are needed to view and edit multiple sequence alignments. Due to growth in the sequence databases, multiple sequence alignments can often be large and difficult to view efficiently. The Jalview Java alignment editor is presented here, which enables fast viewing and editing of large multiple sequence alignments.

    Bioinformatics (Oxford, England) 2004;20;3;426-7

  • Binary and microsatellite polymorphisms of the Y-chromosome in the Mbenzele pygmies from the Central African Republic.

    Coia V, Caglià A, Arredi B, Donati F, Santos FR, Pandya A, Taglioli L, Paoli G, Pascali V, Spedini G, Destro-Bisol G and Tyler-Smith C

    Department of Animal and Human Biology, University La Sapienza, Rome, Italy.

    This study analyzes the variation of six binary polymorphisms and six microsatellites in the Mbenzele Pygmies from the Central African Republic. Five different haplogroups (B2b, E(xE3a), E3a, P and BR(xB2b,DE,P)) were observed, with frequencies ranging from 0.022 (haplogroup P) to 0.609 (haplogroup E3a). A comparison of haplogroup frequencies indicates a close genetic affinity between the Mbenzele and the Biaka Pygmies, a finding consistent with the common origin and the geographical proximity of the two populations. The haplogroups P, BR(xB2b,DE,P) and E(xE3a), which are rare in sub-Saharan Africa but common in western Eurasia, were observed with frequencies ranging from 0.022 (haplogroup P) to 0.087 (haplogroup E(xE3a)). Thirty different microsatellite haplotypes were detected, with frequencies ranging from 0.022 to 0.152. The Mbenzele share the highest percent of microsatellite haplotypes with the Biaka Pygmies. Five out seven haplotypes which are shared by the Mbenzele and Biaka Pygmies belong to haplogroup E3a, which suggests that they are of Bantu origin. The plot based on F(st) genetic distances calculated using microsatellite data provides a picture of population relationships which is in part congruent and in part complementary to that obtained using haplogroup frequencies. Finally, the Mbenzele and Biaka Pygmies were found to be markedly more genetically similar using Y-chromosomal than autosomal microsatellites. We suggest that this could be due to the higher phylogenetic stability of Y-chromosome and to the effect of the male-biased gene flow during the Bantu expansion.

    American journal of human biology : the official journal of the Human Biology Council 2004;16;1;57-67

  • Improved techniques for the identification of pseudogenes.

    Coin L and Durbin R

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.

    Motivation: Pseudogenes are the remnants of genomic sequences of genes which are no longer functional. They are frequent in most eukaryotic genomes, and an important resource for comparative genomics. However, pseudogenes are often mis-annotated as functional genes in sequence databases. Current methods for identifying pseudogenes include methods which rely on the presence of stop codons and frameshifts, as well as methods based on the ratio of non-silent to silent nucleotide substitution rates (dN/dS). A recent survey concluded that 50% of human pseudogenes have no detectable truncation in their pseudo-coding regions, indicating that the former methods lack sensitivity. The latter methods have been used to find sets of genes enriched for pseudogenes, but are not specific enough to accurately separate pseudogenes from expressed genes.

    Results: We introduce a program called pseudogene inference from loss of constraint (PSILC) which incorporates novel methods for separating pseudogenes from functional genes. The methods calculate the log-odds score that evolution along the final branch of the gene tree to the query gene has been according to the following constraints: A neutral nucleotide model compared to a Pfam domain encoding model (PSILC(nuc/dom)); A protein coding model compared to a Pfam domain encoding model (PSILC(prot/dom)). Using the manual annotation of human chromosome 6, we show that both these methods result in a more accurate classification of pseudogenes than dN/dS when a Pfam domain alignment is available.

    Availability: PSILC is available from

    Funded by: Wellcome Trust

    Bioinformatics (Oxford, England) 2004;20 Suppl 1;i94-100

  • Enhanced protein domain discovery using taxonomy.

    Coin L, Bateman A and Durbin R

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Background: It is well known that different species have different protein domain repertoires, and indeed that some protein domains are kingdom specific. This information has not yet been incorporated into statistical methods for finding domains in sequences of amino acids.

    Results: We show that by incorporating our understanding of the taxonomic distribution of specific protein domains, we can enhance domain recognition in protein sequences. We identify 4447 new instances of Pfam domains in the SP-TREMBL database using this technique, equivalent to the coverage increase given by the last 8.3% of Pfam families and to a 0.7% increase in the number of domain predictions. We use PSI-BLAST to cross-validate our new predictions. We also benchmark our approach using a SCOP test set of proteins of known structure, and demonstrate improvements relative to standard Hidden Markov model techniques.

    Conclusions: Explicitly including knowledge about the taxonomic distribution of protein domains can enhance protein domain recognition. Our method can also incorporate other context-specific domain distributions - such as domain co-occurrence and protein localisation.

    BMC bioinformatics 2004;5;56

  • A genome annotation-driven approach to cloning the human ORFeome.

    Collins JE, Wright CL, Edwards CA, Davis MP, Grinham JA, Cole CG, Goward ME, Aguado B, Mallya M, Mokrab Y, Huckle EJ, Beare DM and Dunham I

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    We have developed a systematic approach to generating cDNA clones containing full-length open reading frames (ORFs), exploiting knowledge of gene structure from genomic sequence. Each ORF was amplified by PCR from a pool of primary cDNAs, cloned and confirmed by sequencing. We obtained clones representing 70% of genes on human chromosome 22, whereas searching available cDNA clone collections found at best 48% from a single collection and 60% for all collections combined.

    Genome biology 2004;5;10;R84

  • Comparative genomics of transcriptional control in the human malaria parasite Plasmodium falciparum.

    Coulson RM, Hall N and Ouzounis CA

    Computational Genomics Group, The European Bioinformatics Institute, European Molecular Biology Laboratory Cambridge Outstation, Cambridge CB10 1SD, United Kingdom.

    The life cycle of the parasite Plasmodium falciparum, responsible for the most deadly form of human malaria, requires specialized protein expression for survival in the mammalian host and insect vector. To identify components of processes controlling gene expression during its life cycle, the malarial genome--along with seven crown eukaryote group genomes--was queried with a reference set of transcription-associated proteins (TAPs). Following clustering on the basis of sequence similarity of the TAPs with their homologs, and together with hidden Markov model profile searches, 156 P. falciparum TAPs were identified. This represents about a third of the number of TAPs usually found in the genome of a free-living eukaryote. Furthermore, the P. falciparum genome appears to contain a low number of sequences, which are highly conserved and abundant within the kingdoms of free-living eukaryotes, that contribute to gene-specific transcriptional regulation. However, in comparison with these other eukaryotic genomes, the CCCH-type zinc finger (common in proteins modulating mRNA decay and translation rates) was found to be the most abundant in the P. falciparum genome. This observation, together with the paucity of malarial transcriptional regulators identified, suggests Plasmodium protein levels are primarily determined by posttranscriptional mechanisms.

    Genome research 2004;14;8;1548-54

  • Differential requirements for COPI transport during vertebrate early development.

    Coutinho P, Parsons MJ, Thomas KA, Hirst EM, Saúde L, Campos I, Williams PH and Stemple DL

    Division of Developmental Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, United Kingdom.

    The coatomer vesicular coat complex is essential for normal Golgi and secretory activities in eukaryotic cells. Through positional cloning of genes controlling zebrafish notochord development, we found that the sneezy, happy, and dopey loci encode the alpha, beta, and beta' subunits of the coatomer complex. Export from mutant endoplasmic reticulum is blocked, Golgi structure is disrupted, and mutant embryos eventually degenerate due to widespread apoptosis. The early embryonic phenotype, however, demonstrates that despite its "housekeeping" functions, coatomer activity is specifically and cell autonomously required for normal chordamesoderm differentiation, perinotochordal basement membrane formation, and melanophore pigmentation. Hence, differential requirements for coatomer activity among embryonic tissues lead to tissue-specific developmental defects. Moreover, we note that the mRNA encoding alpha coatomer is strikingly upregulated in notochord progenitors, and we present data suggesting that alpha coatomer transcription is tuned to activity- and cell type-specific secretory loads.

    Developmental cell 2004;7;4;547-58

  • Premature termination codons enhance mRNA decapping in human cells.

    Couttet P and Grange T

    Institut Jacques Monod du CNRS, Universités Paris 6-7, Tour 43, 2 Place Jussieu, 75251 Paris Cedex 05, France.

    Nonsense-mediated mRNA decay (NMD) is a eukaryotic surveillance process that promotes selective degradation of imperfect messages containing premature translation termination codons (PTCs). In yeast, PTCs trigger both deadenylylation-independent mRNA decapping, thereby allowing their rapid degradation by a 5' to 3' exonuclease, and to a smaller extent accelerated deadenylylation. It is not clear to what extent this decay pathway is conserved in higher eukaryotes. We used a transcriptional pulse strategy relying on a tetracycline-regulated promoter to study the decay of a PTC- containing beta-globin mRNA in human cells. We show that a PTC destabilizes the mRNA and decreases its half-life from >16 h to 3 h. The deadenylylation rate is increased, but not sufficiently to account for the decreased half-life on its own. Using a circularization RT-PCR (cRT-PCR) strategy, we could detect decapped degradation intermediates and measure simultaneously their poly(A) tail length. This allowed us to show that a PTC enhances the rate of mRNA decapping and that decapped products have been deadenylylated to a certain extent. Thus the major feature of the NMD pathway, enhanced decapping, is conserved from yeast to man even though the kinetic details might differ between various mRNAs and/or species.

    Nucleic acids research 2004;32;2;488-94

  • Biofilm formation and dispersal in Xanthomonas campestris.

    Crossman L and Dow JM

    Pathogen Sequencing Unit, Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Xanthomonas campestris pathovar campestris is the causal agent of black rot disease of cruciferous plants. A cell-cell signalling system encoded by genes within the rpf cluster is required for the full virulence of this plant pathogen. This system has recently been implicated in regulation of the formation and dispersal of Xanthomonas biofilms.

    Microbes and infection / Institut Pasteur 2004;6;6;623-9

  • Chalk and cheese.

    Crossman L, Cerdeño-Tárraga A and Thomson NR

    Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.

    Nature reviews. Microbiology 2004;2;7;528-9

  • Sequencing the environment.

    Crossman L, Sebaihia M, Cerdeño-Tárraga A and Parkhill J

    The Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Nature reviews. Microbiology 2004;2;3;184-5

  • The Ensembl computing architecture.

    Cuff JA, Coates GM, Cutts TJ and Rae M

    The Broad Institute, Cambridge, Massachusetts 02141, USA.

    Ensembl is a software project to automatically annotate large eukaryotic genomes and release them freely into the public domain. The project currently automatically annotates 10 complete genomes. This makes very large demands on compute resources, due to the vast number of sequence comparisons that need to be executed. To circumvent the financial outlay often associated with classical supercomputing environments, farms of multiple, lower-cost machines have now become the norm and have been deployed successfully with this project. The architecture and design of farms containing hundreds of compute nodes is complex and nontrivial to implement. This study will define and explain some of the essential elements to consider when designing such systems. Server architecture and network infrastructure are discussed with a particular emphasis on solutions that worked and those that did not (often with fairly spectacular consequences). The aim of the study is to give the reader, who may be implementing a large-scale biocompute project, an insight into some of the pitfalls that may be waiting ahead.

    Genome research 2004;14;5;971-5

  • The Ensembl automatic gene annotation system.

    Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E, Searle SM and Clamp M

    The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    As more genomes are sequenced, there is an increasing need for automated first-pass annotation which allows timely access to important genomic information. The Ensembl gene-building system enables fast automated annotation of eukaryotic genomes. It annotates genes based on evidence derived from known protein, cDNA, and EST sequences. The gene-building system rests on top of the core Ensembl (MySQL) database schema and Perl Application Programming Interface (API), and the data generated are accessible through the Ensembl genome browser ( To date, the Ensembl predicted gene sets are available for the A. gambiae, C. briggsae, zebrafish, mouse, rat, and human genomes and have been heavily relied upon in the publication of the human, mouse, rat, and A. gambiae genome sequence analysis. Here we describe in detail the gene-building system and the algorithms involved. All code and data are freely available from

    Genome research 2004;14;5;942-50

  • The HapMap project and its application to genetic studies of drug response.

    Deloukas P and Bentley D

    Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, UK.

    The pharmacogenomics journal 2004;4;2;88-90

  • The DNA sequence and comparative analysis of human chromosome 10.

    Deloukas P, Earthrowl ME, Grafham DV, Rubenfield M, French L, Steward CA, Sims SK, Jones MC, Searle S, Scott C, Howe K, Hunt SE, Andrews TD, Gilbert JG, Swarbreck D, Ashurst JL, Taylor A, Battles J, Bird CP, Ainscough R, Almeida JP, Ashwell RI, Ambrose KD, Babbage AK, Bagguley CL, Bailey J, Banerjee R, Bates K, Beasley H, Bray-Allen S, Brown AJ, Brown JY, Burford DC, Burrill W, Burton J, Cahill P, Camire D, Carter NP, Chapman JC, Clark SY, Clarke G, Clee CM, Clegg S, Corby N, Coulson A, Dhami P, Dutta I, Dunn M, Faulkner L, Frankish A, Frankland JA, Garner P, Garnett J, Gribble S, Griffiths C, Grocock R, Gustafson E, Hammond S, Harley JL, Hart E, Heath PD, Ho TP, Hopkins B, Horne J, Howden PJ, Huckle E, Hynds C, Johnson C, Johnson D, Kana A, Kay M, Kimberley AM, Kershaw JK, Kokkinaki M, Laird GK, Lawlor S, Lee HM, Leongamornlert DA, Laird G, Lloyd C, Lloyd DM, Loveland J, Lovell J, McLaren S, McLay KE, McMurray A, Mashreghi-Mohammadi M, Matthews L, Milne S, Nickerson T, Nguyen M, Overton-Larty E, Palmer SA, Pearce AV, Peck AI, Pelan S, Phillimore B, Porter K, Rice CM, Rogosin A, Ross MT, Sarafidou T, Sehra HK, Shownkeen R, Skuce CD, Smith M, Standring L, Sycamore N, Tester J, Thorpe A, Torcasso W, Tracey A, Tromans A, Tsolas J, Wall M, Walsh J, Wang H, Weinstock K, West AP, Willey DL, Whitehead SL, Wilming L, Wray PW, Young L, Chen Y, Lovering RC, Moschonas NK, Siebert R, Fechtel K, Bentley D, Durbin R, Hubbard T, Doucette-Stamm L, Beck S, Smith DR and Rogers J

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK.

    The finished sequence of human chromosome 10 comprises a total of 131,666,441 base pairs. It represents 99.4% of the euchromatic DNA and includes one megabase of heterochromatic sequence within the pericentromeric region of the short and long arm of the chromosome. Sequence annotation revealed 1,357 genes, of which 816 are protein coding, and 430 are pseudogenes. We observed widespread occurrence of overlapping coding genes (either strand) and identified 67 antisense transcripts. Our analysis suggests that both inter- and intrachromosomal segmental duplications have impacted on the gene count on chromosome 10. Multispecies comparative analysis indicated that we can readily annotate the protein-coding genes with current resources. We estimate that over 95% of all coding exons were identified in this study. Assessment of single base changes between the human chromosome 10 and chimpanzee sequence revealed nonsense mutations in only 21 coding genes with respect to the human sequence.

    Nature 2004;429;6990;375-81

  • The Hotdog fold: wrapping up a superfamily of thioesterases and dehydratases.

    Dillon SC and Bateman A

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK.

    Background: The Hotdog fold was initially identified in the structure of Escherichia coli FabA and subsequently in 4-hydroxybenzoyl-CoA thioesterase from Pseudomonas sp. strain CBS. Since that time structural determinations have shown a number of other apparently unrelated proteins also share the Hotdog fold.

    Results: Using sequence analysis we unify a large superfamily of HotDog domains. Membership includes numerous prokaryotic, archaeal and eukaryotic proteins involved in several related, but distinct, catalytic activities, from metabolic roles such as thioester hydrolysis in fatty acid metabolism, to degradation of phenylacetic acid and the environmental pollutant 4-chlorobenzoate. The superfamily also includes FapR, a non-catalytic bacterial homologue that is involved in transcriptional regulation of fatty acid biosynthesis. We have defined 17 subfamilies, with some characterisation. Operon analysis has revealed numerous HotDog domain-containing proteins to be fusion proteins, where two genes, once separate but adjacent open-reading frames, have been fused into one open-reading frame to give a protein with two functional domains. Finally we have generated a Hidden Markov Model library from our analysis, which can be used as a tool for predicting the occurrence of HotDog domains in any protein sequence.

    Conclusions: The HotDog domain is both an ancient and ubiquitous motif, with members found in the three branches of life.

    BMC bioinformatics 2004;5;109

  • Array comparative genomic hybridization analysis of colorectal cancer cell lines and primary carcinomas.

    Douglas EJ, Fiegler H, Rowan A, Halford S, Bicknell DC, Bodmer W, Tomlinson IP and Carter NP

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Array comparative genomic hybridization, with a genome-wide resolution of approximately 1 Mb, has been used to investigate copy number changes in 48 colorectal cancer (CRC) cell lines and 37 primary CRCs. The samples were divided for analysis according to the type of genomic instability that they exhibit, microsatellite instability (MSI) or chromosomal instability (CIN). Consistent copy number changes were identified, including gain of chromosomes 20, 13, and 8q and smaller regions of amplification such as chromosome 17q11.2-q12. Loss of chromosome 18q was a recurrent finding along with deletion of discrete regions such as chromosome 4q34-q35. The overall pattern of copy number change was strikingly similar between cell lines and primary cancers with a few obvious exceptions such as loss of chromosome 6 and gain of chromosomes 15 and 12p in the former. A greater number of aberrations were detected in CIN+ than MSI+ samples as well as differences in the type and extent of change reported. For example, loss of chromosome 8p was a common event in CIN+ cell lines and cancers but was often found to be gained in MSI+ cancers. In addition, the target of amplification on chromosome 8q appeared to differ, with 8q24.21 amplified frequently in CIN+ samples but 8q24.3 amplification a common finding in MSI+ samples. A number of genes of interest are located within the frequently aberrated regions, which are likely to be of importance in the development and progression of CRC.

    Cancer research 2004;64;14;4817-25

  • Comparative cell wall core biosynthesis in the mycolated pathogens, Mycobacterium tuberculosis and Corynebacterium diphtheriae.

    Dover LG, Cerdeño-Tárraga AM, Pallen MJ, Parkhill J and Besra GS

    School of Biosciences, The University of Birmingham, Edgbaston, Birmingham B15 2TT, UK.

    The recent determination of the complete genome sequence of Corynebacterium diphtheriae, the aetiological agent of diphtheria, has allowed a detailed comparison of its physiology with that of its closest sequenced pathogenic relative Mycobacterium tuberculosis. Of major importance to the pathogenicity and resilience of the latter is its particularly complex cell envelope. The corynebacteria share many of the features of this extraordinary structure although to a lesser level of complexity. The cell envelope of M. tuberculosis has provided the molecular targets for several of the major anti-tubercular drugs. Given a backdrop of emerging multi-drug resistant strains of the organism (MDR-TB) and its continuing global threat to human health, the search for novel anti-tubercular agents is of paramount importance. The unique structure of this cell wall and the importance of its integrity to the viability of the organism suggest that the search for novel drug targets within the array of enzymes responsible for its construction may prove fruitful. Although the application of modern bioinformatics techniques to the 'mining' of the M. tuberculosis genome has already increased our knowledge of the biosynthesis and assembly of the mycobacterial cell wall, several issues remain uncertain. Further analysis by comparison with its relatives may bring clarity and aid the early identification of novel cellular targets for new anti-tuberculosis drugs. In order to facilitate this aim, this review intends to illustrate the broad similarities and highlight the structural differences between the two bacterial envelopes and discuss the genetics of their biosynthesis.

    FEMS microbiology reviews 2004;28;2;225-50

  • What can we learn from noncoding regions of similarity between genomes?

    Down TA and Hubbard TJ

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    Background: In addition to known protein-coding genes, large amounts of apparently non-coding sequence are conserved between the human and mouse genomes. It seems reasonable to assume that these conserved regions are more likely to contain functional elements than less-conserved portions of the genome.

    Methods: Here we used a motif-oriented machine learning method based on the Relevance Vector Machine algorithm to extract the strongest signal from a set of non-coding conserved sequences.

    Results: We successfully fitted models to reflect the non-coding sequences, and showed that the results were quite consistent for repeated training runs. Using the learned models to scan genomic sequence, we found that they often made predictions close to the start of annotated genes. We compared this method with other published promoter-prediction systems, and showed that the set of promoters which are detected by this method is substantially similar to that detected by existing methods.

    Conclusions: The results presented here indicate that the promoter signal is the strongest single motif-based signal in the non-coding functional fraction of the genome. They also lend support to the belief that there exists a substantial subset of promoter regions which share several common features including, but not restricted to, a relative abundance of CpG dinucleotides. This subset is detectable by a variety of distinct computational methods.

    BMC bioinformatics 2004;5;131

  • SNP allele frequency estimation in DNA pools and variance components analysis.

    Downes K, Barratt BJ, Akan P, Bumpstead SJ, Taylor SD, Clayton DG and Deloukas P

    Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK.

    The estimation of single nucleotide polymorphism (SNP) allele frequency in pooled DNA samples has been proposed as a cost-effective approach to whole genome association studies. However, the key issue is the allele frequency window in which a genotyping method operates and provides a statistically reliable answer. We assessed the homogeneous mass extend assay and estimated the variance associated with each experimental stage. We report that a relationship between estimated allele frequency and variance might exist, suggesting that high statistical power can be retained at low, as well as high, allele frequencies. Assuming this relationship, the formation of subpools consisting of 100 samples retains an effective sample size greater than 70% of the true sample size, with a savings of 11-fold the cost of an individual genotyping study, regardless of allele frequency.

    BioTechniques 2004;36;5;840-5

  • The DNA sequence and analysis of human chromosome 13.

    Dunham A, Matthews LH, Burton J, Ashurst JL, Howe KL, Ashcroft KJ, Beare DM, Burford DC, Hunt SE, Griffiths-Jones S, Jones MC, Keenan SJ, Oliver K, Scott CE, Ainscough R, Almeida JP, Ambrose KD, Andrews DT, Ashwell RI, Babbage AK, Bagguley CL, Bailey J, Bannerjee R, Barlow KF, Bates K, Beasley H, Bird CP, Bray-Allen S, Brown AJ, Brown JY, Burrill W, Carder C, Carter NP, Chapman JC, Clamp ME, Clark SY, Clarke G, Clee CM, Clegg SC, Cobley V, Collins JE, Corby N, Coville GJ, Deloukas P, Dhami P, Dunham I, Dunn M, Earthrowl ME, Ellington AG, Faulkner L, Frankish AG, Frankland J, French L, Garner P, Garnett J, Gilbert JG, Gilson CJ, Ghori J, Grafham DV, Gribble SM, Griffiths C, Hall RE, Hammond S, Harley JL, Hart EA, Heath PD, Howden PJ, Huckle EJ, Hunt PJ, Hunt AR, Johnson C, Johnson D, Kay M, Kimberley AM, King A, Laird GK, Langford CJ, Lawlor S, Leongamornlert DA, Lloyd DM, Lloyd C, Loveland JE, Lovell J, Martin S, Mashreghi-Mohammadi M, McLaren SJ, McMurray A, Milne S, Moore MJ, Nickerson T, Palmer SA, Pearce AV, Peck AI, Pelan S, Phillimore B, Porter KM, Rice CM, Searle S, Sehra HK, Shownkeen R, Skuce CD, Smith M, Steward CA, Sycamore N, Tester J, Thomas DW, Tracey A, Tromans A, Tubby B, Wall M, Wallis JM, West AP, Whitehead SL, Willey DL, Wilming L, Wray PW, Wright MW, Young L, Coulson A, Durbin R, Hubbard T, Sulston JE, Beck S, Bentley DR, Rogers J and Ross MT

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK.

    Chromosome 13 is the largest acrocentric human chromosome. It carries genes involved in cancer including the breast cancer type 2 (BRCA2) and retinoblastoma (RB1) genes, is frequently rearranged in B-cell chronic lymphocytic leukaemia, and contains the DAOA locus associated with bipolar disorder and schizophrenia. We describe completion and analysis of 95.5 megabases (Mb) of sequence from chromosome 13, which contains 633 genes and 296 pseudogenes. We estimate that more than 95.4% of the protein-coding genes of this chromosome have been identified, on the basis of comparison with other vertebrate genome sequences. Additionally, 105 putative non-coding RNA genes were found. Chromosome 13 has one of the lowest gene densities (6.5 genes per Mb) among human chromosomes, and contains a central region of 38 Mb where the gene density drops to only 3.1 genes per Mb.

    Nature 2004;428;6982;522-8

  • Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes.

    Durrant C, Zondervan KT, Cardon LR, Hunt S, Deloukas P and Morris AP

    Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom.

    We present a novel approach to disease-gene mapping via cladistic analysis of single-nucleotide polymorphism (SNP) haplotypes obtained from large-scale, population-based association studies, applicable to whole-genome screens, candidate-gene studies, or fine-scale mapping. Clades of haplotypes are tested for association with disease, exploiting the expected similarity of chromosomes with recent shared ancestry in the region flanking the disease gene. The method is developed in a logistic-regression framework and can easily incorporate covariates such as environmental risk factors or additional unlinked loci to allow for population structure. To evaluate the power of this approach to detect disease-marker association, we have developed a simulation algorithm to generate high-density SNP data with short-range linkage disequilibrium based on empirical patterns of haplotype diversity. The results of the simulation study highlight substantial gains in power over single-locus tests for a wide range of disease models, despite overcorrection for multiple testing.

    American journal of human genetics 2004;75;1;35-43

  • Production of soluble mammalian proteins in Escherichia coli: identification of protein features that correlate with successful expression.

    Dyson MR, Shadbolt SP, Vincent KJ, Perera RL and McCafferty J

    The Atlas of Gene Expression Project, The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    Background: In the search for generic expression strategies for mammalian protein families several bacterial expression vectors were examined for their ability to promote high yields of soluble protein. Proteins studied included cell surface receptors (Ephrins and Eph receptors, CD44), kinases (EGFR-cytoplasmic domain, CDK2 and 4), proteases (MMP1, CASP2), signal transduction proteins (GRB2, RAF1, HRAS) and transcription factors (GATA2, Fli1, Trp53, Mdm2, JUN, FOS, MAD, MAX). Over 400 experiments were performed where expression of 30 full-length proteins and protein domains were evaluated with 6 different N-terminal and 8 C-terminal fusion partners. Expression of an additional set of 95 mammalian proteins was also performed to test the conclusions of this study.

    Results: Several protein features correlated with soluble protein expression yield including molecular weight and the number of contiguous hydrophobic residues and low complexity regions. There was no relationship between successful expression and protein pI, grand average of hydropathicity (GRAVY), or sub-cellular location. Only small globular cytoplasmic proteins with an average molecular weight of 23 kDa did not require a solubility enhancing tag for high level soluble expression. Thioredoxin (Trx) and maltose binding protein (MBP) were the best N-terminal protein fusions to promote soluble expression, but MBP was most effective as a C-terminal fusion. 63 of 95 mammalian proteins expressed at soluble levels of greater than 1 mg/l as N-terminal H10-MBP fusions and those that failed possessed, on average, a higher molecular weight and greater number of contiguous hydrophobic amino acids and low complexity regions.

    Conclusions: By analysis of the protein features identified here, this study will help predict which mammalian proteins and domains can be successfully expressed in E. coli as soluble product and also which are best targeted for a eukaryotic expression system. In some cases proteins may be truncated to minimise molecular weight and the numbers of contiguous hydrophobic amino acids and low complexity regions to aid soluble expression in E. coli.

    BMC biotechnology 2004;4;32

  • Future potential of the Human Epigenome Project.

    Eckhardt F, Beck S, Gut IG and Berlin K

    Epigenomics AG, Kleine Präsidentenstrasse 1, 10178 Berlin, Germany.

    Deciphering the information encoded in the human genome is key for the further understanding of human biology, physiology and evolution. With the draft sequence of the human genome completed, elucidation of the epigenetic information layer of the human genome becomes accessible. Epigenetic mechanisms are mediated by either chemical modifications of the DNA itself or by modifications of proteins that are closely associated with DNA. Defects of the epigenetic regulation involved in processes such as imprinting, X chromosome inactivation, transcriptional control of genes, as well as mutations affecting DNA methylation enzymes, contribute fundamentally to the etiology of many human diseases. Headed by the Human Epigenome Consortium, the Human Epigenome Project is a joint effort by an international collaboration that aims to identify, catalog and interpret genome-wide DNA methylation patterns of all human genes in all major tissues. Methylation variable positions are thought to reflect gene activity, tissue type and disease state, and are useful epigenetic markers revealing the dynamic state of the genome. Like single nucleotide polymorphisms, methylation variable positions will greatly advance our ability to elucidate and diagnose the molecular basis of human diseases.

    Expert review of molecular diagnostics 2004;4;5;609-18

  • Advances in schistosome genomics.

    El-Sayed NM, Bartholomeu D, Ivens A, Johnston DA and LoVerde PT

    Department of Parasite Genomics, The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA.

    Funded by: NIAID NIH HHS: U01 AI 48828

    Trends in parasitology 2004;20;4;154-7

  • Comparative evolutionary genomics of androgen-binding protein genes.

    Emes RD, Riley MC, Laukaitis CM, Goodstadt L, Karn RC and Ponting CP

    MRC Functional Genetics Unit, Department of Human Anatomy and Genetics, University of Oxford, Oxford OX1 3QX, United Kingdom.

    Allelic variation within the mouse androgen-binding protein (ABP) alpha subunit gene (Abpa) has been suggested to promote assortative mating and thus prezygotic isolation. This is consistent with the elevated evolutionary rates observed for the Abpa gene, and the Abpb and Abpg genes whose products (ABPbeta and ABPgamma) form heterodimers with ABPalpha. We have investigated the mouse sequence that contains the three Abpa/b/g genes, and orthologous regions in rat, human, and chimpanzee genomes. Our studies reveal extensive "remodeling" of this region: Duplication rates of Abpa-like and Abpbg-like genes in mouse are >2 orders of magnitude higher than the average rate for all mouse genes; synonymous nucleotide substitution rates are twofold higher; and the Abpabg genomic region has expanded nearly threefold since divergence of the rodents. During this time, one in six amino acid sites in ABPbetagamma-like proteins appear to have been subject to positive selection; these may constitute a site of interaction with receptors or ligands. Greater adaptive variation among Abpbg-like sequences than among Abpa-like sequences suggests that assortative mating preferences are more influenced by variation in Abpbg-like genes. We propose a role for ABPalpha/beta/gamma proteins as pheromones, or in modulating odorant detection. This would account for the extraordinary adaptive evolution of these genes, and surrounding genomic regions, in murid rodents.

    Genome research 2004;14;8;1516-29

  • The ENCODE (ENCyclopedia Of DNA Elements) Project.

    ENCODE Project Consortium

    The ENCyclopedia Of DNA Elements (ENCODE) Project aims to identify all functional elements in the human genome sequence. The pilot phase of the Project is focused on a specified 30 megabases (approximately 1%) of the human genome sequence and is organized as an international consortium of computational and laboratory-based scientists working to develop and apply high-throughput approaches for detecting all sequence elements that confer biological function. The results of this pilot phase will guide future efforts to analyze the entire human genome.

    Funded by: NHGRI NIH HHS: R01 HG003143

    Science (New York, N.Y.) 2004;306;5696;636-40

  • ESTGenes: alternative splicing from ESTs in Ensembl.

    Eyras E, Caccamo M, Curwen V and Clamp M

    The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    We describe a novel algorithm for deriving the minimal set of nonredundant transcripts compatible with the splicing structure of a set of ESTs mapped on a genome. Sets of ESTs with compatible splicing are represented by a special type of graph. We describe the algorithms for building the graphs and for deriving the minimal set of transcripts from the graphs that are compatible with the evidence. These algorithms are part of the Ensembl automatic gene annotation system, and its results, using ESTs, are provided at as ESTgenes for the mosquito, Caenorhabditis briggsae, C. elegans, zebrafish, human, mouse, and rat genomes. Here we also report on the results of this method applied to the human and mouse genomes.

    Genome research 2004;14;5;976-87

  • Amplification and overexpression of E2F3 in human bladder cancer.

    Feber A, Clark J, Goodwin G, Dodson AR, Smith PH, Fletcher A, Edwards S, Flohr P, Falconer A, Roe T, Kovacs G, Dennis N, Fisher C, Wooster R, Huddart R, Foster CS and Cooper CS

    Section of Molecular Carcinogenesis and Male Urological Cancer Research, Centre, Institute of Cancer Research, Sutton, Surrey SM2 5NG, UK.

    We demonstrate that, in human bladder cancer, amplification of the E2F3 gene, located at 6p22, is associated with overexpression of its encoded mRNA transcripts and high levels of expression of E2F3 protein. Immunohistochemical analyses of E2F3 protein levels have established that around one-third (33/101) of primary transitional cell carcinomas of the bladder overexpress nuclear E2F3 protein, with the proportion of tumours containing overexpressed nuclear E2F3 increasing with tumour stage and grade. When considered together with the established role of E2F3 in cell cycle progression, these results suggest that the E2F3 gene represents a candidate bladder cancer oncogene that is activated by DNA amplification and overexpression.

    Oncogene 2004;23;8;1627-30

  • A large AZFc deletion removes DAZ3/DAZ4 and nearby genes from men in Y haplogroup N.

    Fernandes S, Paracchini S, Meyer LH, Floridia G, Tyler-Smith C and Vogt PH

    Section of Molecular Genetics & Infertility, Department of Gynecological Endocrinology & Reproductive Medicine, University of Heidelberg, Heidelberg, Germany.

    Deletion of the entire AZFc locus on the human Y chromosome leads to male infertility. The functional roles of the individual gene families mapped to AZFc are, however, still poorly understood, since the analysis of the region is complicated by its repeated structure. We have therefore used single-nucleotide variants (SNVs) across approximately 3 Mb of the AZFc sequence to identify 17 AZFc haplotypes and have examined them for deletion of individual AZFc gene copies. We found five individuals who lacked SNVs from a large segment of DNA containing the DAZ3/DAZ4 and BPY2.2/BPY2.3 gene doublets in distal AZFc. Southern blot analyses showed that the lack of these SNVs was due to deletion of the underlying DNA segment. Typing 118 binary Y markers showed that all five individuals belonged to Y haplogroup N, and 15 of 15 independently ascertained men in haplogroup N carried a similar deletion. Haplogroup N is known to be common and widespread in Europe and Asia, and there is no indication of reduced fertility in men with this Y chromosome. We therefore conclude that a common variant of the human Y chromosome lacks the DAZ3/DAZ4 and BPY2.2/BPY2.3 doublets in distal AZFc and thus that these genes cannot be required for male fertility; the gene content of the AZFc locus is likely to be genetically redundant. Furthermore, the observed deletions cannot be derived from the GenBank reference sequence by a single recombination event; an origin by homologous recombination from such a sequence organization must be preceded by an inversion event. These data confirm the expectation that the human Y chromosome sequence and gene complement may differ substantially between individuals and more variations are to be expected in different Y chromosomal haplogroups.

    American journal of human genetics 2004;74;1;180-7

  • Genomic array technology.

    Fiegler H and Carter NP

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.

    Methods in cell biology 2004;75;769-85

  • A central role for the notochord in vertebral patterning.

    Fleming A, Keynes R and Tannahill D

    Department of Anatomy, University of Cambridge, Downing Street, Cambridge CB2 3DY, UK.

    The vertebrates are defined by their segmented vertebral column, and vertebral periodicity is thought to originate from embryonic segments, the somites. According to the widely accepted 'resegmentation' model, a single vertebra forms from the recombination of the anterior and posterior halves of two adjacent sclerotomes on both sides of the embryo. Although there is supporting evidence for this model in amniotes, it remains uncertain whether it applies to all vertebrates. To explore this, we have investigated vertebral patterning in the zebrafish. Surprisingly, we find that vertebral bodies (centra) arise by secretion of bone matrix from the notochord rather than somites; centra do not form via a cartilage intermediate stage, nor do they contain osteoblasts. Moreover, isolated, cultured notochords secrete bone matrix in vitro, and ablation of notochord cells at segmentally reiterated positions in vivo prevents the formation of centra. Analysis of fss mutant embryos, in which sclerotome segmentation is disrupted, shows that whereas neural arch segmentation is also disrupted, centrum development proceeds normally. These findings suggest that the notochord plays a key, perhaps ancient, role in the segmental patterning of vertebrae.

    Development (Cambridge, England) 2004;131;4;873-80

  • Nodal/activin signaling establishes oral-aboral polarity in the early sea urchin embryo.

    Flowers VL, Courteau GR, Poustka AJ, Weng W and Venuti JM

    Department of Cell Biology and Anatomy, Louisiana State University Health Sciences Center, New Orleans, Louisiana 70112-1393, USA.

    Components of the Wnt signaling pathway are involved in patterning the sea urchin primary or animal-vegetal (AV) axis, but the molecular cues that pattern the secondary embryonic axis, the aboral/oral (AO) axis, are not known. In an analysis of signaling molecules that influence patterning along the sea urchin embryonic axes, we found that members of the activin subfamily of transforming growth factor-beta (TGF-beta) signaling molecules influence the establishment of AO polarities in the early embryo. Injection of activin mRNAs into fertilized eggs or treatment with exogenously applied recombinant activin altered the allocation of ectodermal fates and ventralized the embryo. The phenotypes observed resemble the ventralized phenotype previously reported for NiCl2, a known disrupter of AO patterning. Sensitivity to exogenous activin occurs between fertilization and the late blastula stage, which is also the time of highest NiCl2 sensitivity. These results argue that specification of fates along the embryonic AO axis involves TGF-beta signaling. To further examine TGF-beta signaling in these embryos, we cloned an endogenous TGF-beta from sea urchin embryos that is a member of the activin subfamily, SpNodal, and show through gain of function analysis that it recapitulates results obtained with exogenous activins and NiCl2. The expression pattern of SpNodal is consistent with a role for nodal signaling in the establishment of fates along the AO axis. Loss of function experiments using SpNodal antisense morpholinos also support a role for SpNodal in the establishment of the AO axis.

    Developmental dynamics : an official publication of the American Association of Anatomists 2004;231;4;727-40

  • RNA interference: human genes hit the big screen.

    Fraser A

    Nature 2004;428;6981;375-8

  • Towards full employment: using RNAi to find roles for the redundant.

    Fraser A

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Cancer is a genetic disease that ultimately results from the failure of cells to respond correctly to diverse signals. Signal transduction and signal integration are highly complex, requiring the combinatorial interaction of multiple genes. Classical genetics in model organisms including Caenorhabditis elegans has been of immense use in identifying nonredundant components of conserved signalling pathways. However, it is likely that there is much functional redundancy in the informational processing machinery of metazoan cells; we therefore need to develop methods for uncovering such redundant functions in model organisms if we are to use them to understand complex gene interactions and oncogene cooperation. RNAi may provide a powerful tool to probe redundancy in informational networks. In this review, I set out some of the progress made so far by classical genetics in understanding redundancy in gene networks, and outline how RNAi may allow us to approach this problem more systematically in C. elegans. In particular, I discuss the use of genome-wide RNAi screens in C. elegans to identify synthetic lethal interactions and compare this with synthetic lethal interaction analysis in Saccharomyces cerevisiae.

    Oncogene 2004;23;51;8346-52

  • A probabilistic view of gene function.

    Fraser AG and Marcotte EM

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Cells are controlled by the complex and dynamic actions of thousands of genes. With the sequencing of many genomes, the key problem has shifted from identifying genes to knowing what the genes do; we need a framework for expressing that knowledge. Even the most rigorous attempts to construct ontological frameworks describing gene function (e.g., the Gene Ontology project) ultimately rely on manual curation and are thus labor-intensive and subjective. But an alternative exists: the field of functional genomics is piecing together networks of gene interactions, and although these data are currently incomplete and error-prone, they provide a glimpse of a new, probabilistic view of gene function. We outline such a framework, which revolves around a statistical description of gene interactions derived from large, systematically compiled data sets. In this probabilistic view, pleiotropy is implicit, all data have errors and the definition of gene function is an iterative process that ultimately converges on the correct functions. The relationships between the genes are defined by the data, not by hand. Even this comprehensive view fails to capture key aspects of gene function, not least their dynamics in time and space, showing that there are limitations to the model that must ultimately be addressed.

    Nature genetics 2004;36;6;559-64

  • Development through the eyes of functional genomics.

    Fraser AG and Marcotte EM

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    In many of the model organisms used to study development, it is becoming relatively routine to carry out global analyses of gene function. These analyses take many forms, from microarray analyses to the construction of physical interaction maps to the systematic analyses of loss-of-function phenotypes. Such large-scale datasets can be integrated to generate complex gene networks, and we explore how these gene networks can contribute to an understanding of developmental pathways. In particular, we examine how combining large-scale expression experiments and gene networks may move us towards a molecular description of the events of development, embodied in a succession of stage-specific subnetworks sampled from an organism's overall gene network.

    Current opinion in genetics & development 2004;14;4;336-42

  • A census of human cancer genes.

    Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N and Stratton MR

    Cancer Genome Project, Human Genome Analysis Group and Pfam Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambs, CB10 1SA, UK.

    Nature reviews. Cancer 2004;4;3;177-83

  • Global gene expression of fission yeast in response to cisplatin.

    Gatti L, Chen D, Beretta GL, Rustici G, Carenini N, Corna E, Colangelo D, Zunino F, Bähler J and Perego P

    Istituto Nazionale Tumori, 20133, Milan, Italy.

    The cellular response to the antitumor drug cisplatin is complex, and resistance is widespread. To gain insights into the global transcriptional response and mechanisms of resistance, we used microarrays to examine the fission yeast cell response to cisplatin. In two isogenic strains with differing drug sensitivity, cisplatin activated a stress response involving glutathione-S-transferase, heat shock, and recombinational repair genes. Genes required for proteasome-mediated protein degradation were up-regulated in the sensitive strain, whereas genes for DNA damage recognition/repair and for mitotic progression were induced in the resistant strain. The response to cisplatin overlaps in part with the responses to cadmium and the DNA-damaging agent methylmethane sulfonate. The different gene groups involved in the cellular response to cisplatin help the cells to tolerate and repair DNA damage and to overcome cell cycle blocks. These findings are discussed with respect to known cisplatin response pathways in human cells.

    Funded by: Cancer Research UK: A6517; Wellcome Trust: 077118

    Cellular and molecular life sciences : CMLS 2004;61;17;2253-63

  • A family with severe insulin resistance and diabetes due to a mutation in AKT2.

    George S, Rochford JJ, Wolfrum C, Gray SL, Schinner S, Wilson JC, Soos MA, Murgatroyd PR, Williams RM, Acerini CL, Dunger DB, Barford D, Umpleby AM, Wareham NJ, Davies HA, Schafer AJ, Stoffel M, O'Rahilly S and Barroso I

    Department of Clinical Biochemistry, University of Cambridge, Addenbrooke's Hospital, Hills Road, Cambridge CB2 2QQ, UK.

    Inherited defects in signaling pathways downstream of the insulin receptor have long been suggested to contribute to human type 2 diabetes mellitus. Here we describe a mutation in the gene encoding the protein kinase AKT2/PKBbeta in a family that shows autosomal dominant inheritance of severe insulin resistance and diabetes mellitus. Expression of the mutant kinase in cultured cells disrupted insulin signaling to metabolic end points and inhibited the function of coexpressed, wild-type AKT. These findings demonstrate the central importance of AKT signaling to insulin sensitivity in humans.

    Funded by: Wellcome Trust: 078986

    Science (New York, N.Y.) 2004;304;5675;1325-8

  • Gene synteny and evolution of genome architecture in trypanosomatids.

    Ghedin E, Bringaud F, Peterson J, Myler P, Berriman M, Ivens A, Andersson B, Bontempi E, Eisen J, Angiuoli S, Wanless D, Von Arx A, Murphy L, Lennard N, Salzberg S, Adams MD, White O, Hall N, Stuart K, Fraser CM and El-Sayed NM

    Parasity Genomics, The Institute for Genomics Research, 9712 Medical Center Dr. Rockville, MD 20850, USA.

    The trypanosomatid protozoa Trypanosoma brucei, Trypanosoma cruzi and Leishmania major are related human pathogens that cause markedly distinct diseases. Using information from genome sequencing projects currently underway, we have compared the sequences of large chromosomal fragments from each species. Despite high levels of divergence at the sequence level, these three species exhibit a striking conservation of gene order, suggesting that selection has maintained gene order among the trypanosomatids over hundreds of millions of years of evolution. The few sites of genome rearrangement between these species are marked by the presence of retrotransposon-like elements, suggesting that retrotransposons may have played an important role in shaping trypanosomatid genome organization. A degenerate retroelement was identified in L. major by examining the regions near breakage points of the synteny. This is the first such element found in L. major suggesting that retroelements were found in the common ancestor of all three species.

    Funded by: NIAID NIH HHS: AI43062, AI45038, AI45039, AI45061, AI49599

    Molecular and biochemical parasitology 2004;134;2;183-91

  • Genome sequence of the Brown Norway rat yields insights into mammalian evolution.

    Gibbs RA, Weinstock GM, Metzker ML, Muzny DM, Sodergren EJ, Scherer S, Scott G, Steffen D, Worley KC, Burch PE, Okwuonu G, Hines S, Lewis L, DeRamo C, Delgado O, Dugan-Rocha S, Miner G, Morgan M, Hawes A, Gill R, Celera, Holt RA, Adams MD, Amanatides PG, Baden-Tillson H, Barnstead M, Chin S, Evans CA, Ferriera S, Fosler C, Glodek A, Gu Z, Jennings D, Kraft CL, Nguyen T, Pfannkoch CM, Sitter C, Sutton GG, Venter JC, Woodage T, Smith D, Lee HM, Gustafson E, Cahill P, Kana A, Doucette-Stamm L, Weinstock K, Fechtel K, Weiss RB, Dunn DM, Green ED, Blakesley RW, Bouffard GG, De Jong PJ, Osoegawa K, Zhu B, Marra M, Schein J, Bosdet I, Fjell C, Jones S, Krzywinski M, Mathewson C, Siddiqui A, Wye N, McPherson J, Zhao S, Fraser CM, Shetty J, Shatsman S, Geer K, Chen Y, Abramzon S, Nierman WC, Havlak PH, Chen R, Durbin KJ, Egan A, Ren Y, Song XZ, Li B, Liu Y, Qin X, Cawley S, Worley KC, Cooney AJ, D'Souza LM, Martin K, Wu JQ, Gonzalez-Garay ML, Jackson AR, Kalafus KJ, McLeod MP, Milosavljevic A, Virk D, Volkov A, Wheeler DA, Zhang Z, Bailey JA, Eichler EE, Tuzun E, Birney E, Mongin E, Ureta-Vidal A, Woodwark C, Zdobnov E, Bork P, Suyama M, Torrents D, Alexandersson M, Trask BJ, Young JM, Huang H, Wang H, Xing H, Daniels S, Gietzen D, Schmidt J, Stevens K, Vitt U, Wingrove J, Camara F, Mar Albà M, Abril JF, Guigo R, Smit A, Dubchak I, Rubin EM, Couronne O, Poliakov A, Hübner N, Ganten D, Goesele C, Hummel O, Kreitler T, Lee YA, Monti J, Schulz H, Zimdahl H, Himmelbauer H, Lehrach H, Jacob HJ, Bromberg S, Gullings-Handley J, Jensen-Seaman MI, Kwitek AE, Lazar J, Pasko D, Tonellato PJ, Twigger S, Ponting CP, Duarte JM, Rice S, Goodstadt L, Beatson SA, Emes RD, Winter EE, Webber C, Brandt P, Nyakatura G, Adetobi M, Chiaromonte F, Elnitski L, Eswara P, Hardison RC, Hou M, Kolbe D, Makova K, Miller W, Nekrutenko A, Riemer C, Schwartz S, Taylor J, Yang S, Zhang Y, Lindpaintner K, Andrews TD, Caccamo M, Clamp M, Clarke L, Curwen V, Durbin R, Eyras E, Searle SM, Cooper GM, Batzoglou S, Brudno M, Sidow A, Stone EA, Venter JC, Payseur BA, Bourque G, López-Otín C, Puente XS, Chakrabarti K, Chatterji S, Dewey C, Pachter L, Bray N, Yap VB, Caspi A, Tesler G, Pevzner PA, Haussler D, Roskin KM, Baertsch R, Clawson H, Furey TS, Hinrichs AS, Karolchik D, Kent WJ, Rosenbloom KR, Trumbower H, Weirauch M, Cooper DN, Stenson PD, Ma B, Brent M, Arumugam M, Shteynberg D, Copley RR, Taylor MS, Riethman H, Mudunuri U, Peterson J, Guyer M, Felsenfeld A, Old S, Mockrin S, Collins F and Rat Genome Sequencing Project Consortium

    Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, MS BCM226, One Baylor Plaza, Houston, Texas 77030, USA <>.

    The laboratory rat (Rattus norvegicus) is an indispensable tool in experimental medicine and drug development, having made inestimable contributions to human health. We report here the genome sequence of the Brown Norway (BN) rat strain. The sequence represents a high-quality 'draft' covering over 90% of the genome. The BN rat sequence is the third complete mammalian genome to be deciphered, and three-way comparisons with the human and mouse genomes resolve details of mammalian evolution. This first comprehensive analysis includes genes and proteins and their relation to human disease, repeated sequences, comparative genome-wide studies of mammalian orthologous chromosomal regions and rearrangement breakpoints, reconstruction of ancestral karyotypes and the events leading to existing species, rates of variation, and lineage-specific and lineage-independent evolutionary events such as expansion of gene families, orthology relations and protein evolution.

    Funded by: NHGRI NIH HHS: U01 HG002137-02S2

    Nature 2004;428;6982;493-521

  • Chromatin architecture of the human genome: gene-rich domains are enriched in open chromatin fibers.

    Gilbert N, Boyle S, Fiegler H, Woodfine K, Carter NP and Bickmore WA

    MRC Human Genetics Unit, Edinburgh, EH4 2XU, Scotland.

    We present an analysis of chromatin fiber structure across the human genome. Compact and open chromatin fiber structures were separated by sucrose sedimentation and their distributions analyzed by hybridization to metaphase chromosomes and genomic microarrays. We show that compact chromatin fibers originate from some sites of heterochromatin (C-bands), and G-bands (euchromatin). Open chromatin fibers correlate with regions of highest gene density, but not with gene expression since inactive genes can be in domains of open chromatin, and active genes in regions of low gene density can be embedded in compact chromatin fibers. Moreover, we show that chromatin fiber structure impacts on further levels of chromatin condensation. Regions of open chromatin fibers are cytologically decondensed and have a distinctive nuclear organization. We suggest that domains of open chromatin may create an environment that facilitates transcriptional activation and could provide an evolutionary constraint to maintain clusters of genes together along chromosomes.

    Cell 2004;118;5;555-66

  • The complete nucleotide sequence of the resistance plasmid R478: defining the backbone components of incompatibility group H conjugative plasmids through comparative genomics.

    Gilmour MW, Thomson NR, Sanders M, Parkhill J and Taylor DE

    Department of Medical Microbiology and Immunology, University of Alberta, Edmonton, Alta., Canada T6G 2R3.

    Horizontal transfer of resistance determinants amongst bacteria can be achieved by conjugative plasmid DNA elements. We have determined the complete 274,762 bp sequence of the incompatibility group H (IncH) plasmid R478, originally isolated from the Gram negative opportunistic pathogen Serratia marcescens. This self-transferable extrachromosomal genetic element contains 295 predicted genes, of which 144 are highly similar to coding sequences of IncH plasmids R27 and pHCM1. The regions of similarity among these three IncH plasmids principally encode core plasmid determinants (i.e., replication, partitioning and stability, and conjugative transfer) and we conducted a comparative analysis to define the minimal IncHI plasmid backbone determinants. No resistance determinants are included in the backbone and most of the sequences unique to R478 were contained in a large contiguous region between the two transfer regions. These findings indicate that plasmid evolution occurs through gene acquisition/loss predominantly in regions outside of the core determinants. Furthermore, a modular evolution for R478 was signified by the presence of gene neighbors or operons that were highly related to sequences from a wide range of chromosomal, transposon, and plasmid elements. The conjugative transfer regions are most similar to sequences encoded on SXT, Rts1, pCAR1, R391, and pRS241d. The dual partitioning modules encoded on R478 resemble numerous sequences; including pMT1, pCTX-M3, pCP301, P1, P7, and pB171. R478 also codes for resistance to tetracycline (Tn10), chloramphenicol (cat), kanamycin (aphA), mercury (similar to Tn21), silver (similar to pMG101), copper (similar to pRJ1004), arsenic (similar to pYV), and tellurite (two separate regions similar to IncHI2 ter determinants and IncP kla determinants). Other R478-encoded sequences are related to Tn7, IS26, tus, mucAB, and hok, where the latter is surrounded by insLKJ, and could potentially be involved in post-segregation killing. The similarity to a diverse set of bacterial sequences highlights the ability of horizontally transferable DNA elements to acquire and disseminate genetic traits through the bacterial gene pool.

    Plasmid 2004;52;3;182-202

  • Coping with cold: An integrative, multitissue analysis of the transcriptome of a poikilothermic vertebrate.

    Gracey AY, Fraser EJ, Li W, Fang Y, Taylor RR, Rogers J, Brass A and Cossins AR

    School of Biological Sciences, University of Liverpool, Biosciences Building, Crown Street, Liverpool L39 7ZB, United Kindgom.

    How do organisms respond adaptively to environmental stress? Although some gene-specific responses have been explored, others remain to be identified, and there is a very poor understanding of the system-wide integration of response, particularly in complex, multitissue animals. Here, we adopt a transcript screening approach to explore the mechanisms underpinning a major, whole-body phenotypic transition in a vertebrate animal that naturally experiences extreme environmental stress. Carp were exposed to increasing levels of cold, and responses across seven tissues were assessed by using a microarray composed of 13,440 cDNA probes. A large set of unique cDNAs (approximately 3,400) were affected by cold. These cDNAs included an expression signature common to all tissues of 252 up-regulated genes involved in RNA processing, translation initiation, mitochondrial metabolism, proteasomal function, and modification of higher-order structures of lipid membranes and chromosomes. Also identified were large numbers of transcripts with highly tissue-specific patterns of regulation. By unbiased profiling of gene ontologies, we have identified the distinctive functional features of each tissue's response and integrate them into a comprehensive view of the whole-body transition from one strongly adaptive phenotype to another. This approach revealed an expression signature suggestive of atrophy in cooled skeletal muscle. This environmental genomics approach by using a well studied but nongenomic species has identified a range of candidate genes endowing thermotolerance and reveals a previously unrecognized scale and complexity of responses that impacts at the level of cellular and tissue function.

    Proceedings of the National Academy of Sciences of the United States of America 2004;101;48;16970-5

  • 1-Mb resolution array-based comparative genomic hybridization using a BAC clone set optimized for cancer gene analysis.

    Greshock J, Naylor TL, Margolin A, Diskin S, Cleaver SH, Futreal PA, deJong PJ, Zhao S, Liebman M and Weber BL

    Abramson Family Cancer Research Institute, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA.

    Array-based comparative genomic hybridization (aCGH) is a recently developed tool for genome-wide determination of DNA copy number alterations. This technology has tremendous potential for disease-gene discovery in cancer and developmental disorders as well as numerous other applications. However, widespread utilization of a CGH has been limited by the lack of well characterized, high-resolution clone sets optimized for consistent performance in aCGH assays and specifically designed analytic software. We have assembled a set of approximately 4100 publicly available human bacterial artificial chromosome (BAC) clones evenly spaced at approximately 1-Mb resolution across the genome, which includes direct coverage of approximately 400 known cancer genes. This aCGH-optimized clone set was compiled from five existing sets, experimentally refined, and supplemented for higher resolution and enhancing mapping capabilities. This clone set is associated with a public online resource containing detailed clone mapping data, protocols for the construction and use of arrays, and a suite of analytical software tools designed specifically for aCGH analysis. These resources should greatly facilitate the use of aCGH in gene discovery.

    Genome research 2004;14;1;179-87

  • Chromosome paints from single copies of chromosomes.

    Gribble S, Ng BL, Prigmore E, Burford DC and Carter NP

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    We have used OmniPlex library technology to construct chromosome painting probes from single copies of flow sorted chromosomes. We show that this whole genome amplification technology is particularly efficient at amplifying single copies of chromosomes for the production of paints and that single aberrant chromosomes can be analysed in this way using reverse chromosome painting. The efficient generation of painting probes from single copies of sorted chromosomes has the advantage that the probe must be specific for the chromosome sorted and will not suffer from contamination from other chromosomes particularly in situations where flow karyotype peaks are poorly resolved. These initial results suggest that OmniPlex whole genome amplification will be equally effective in other cytogenetic applications where only small amounts of DNA are available, i.e. from single cells or from small pieces of microdissected tissue.

    Chromosome research : an international journal on the molecular, supramolecular and evolutionary aspects of chromosome biology 2004;12;2;143-51

  • Applications of combined DNA microarray and chromosome sorting technologies.

    Gribble SM, Fiegler H, Burford DC, Prigmore E, Yang F, Carr P, Ng BL, Sun T, Kamberov ES, Makarov VL, Langmore JP and Carter NP

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    The sequencing of the human genome has led to the availability of an extensive mapped clone resource that is ideal for the construction of DNA microarrays. These genomic clone microarrays have largely been used for comparative genomic hybridisation studies of tumours to enable accurate measurement of copy number changes (array-CGH) at increased resolution. We have utilised these microarrays as the target for chromosome painting and reverse chromosome painting to provide a similar improvement in analysis resolution for these studies in a process we have termed array painting. In array painting, chromosomes are flow sorted, fluorescently labelled and hybridised to the microarray. The complete composition and the breakpoints of aberrant chromosomes can be analysed at high resolution in this way with a considerable reduction in time, effort and cytogenetic expertise required for conventional analysis using fluorescence in situ hybridisation. In a similar way, the resolution of cross-species chromosome painting can be improved and we present preliminary observations of the organisation of homologous DNA blocks between the white cheeked gibbon chromosome 14 and human chromosomes 2 and 17.

    Chromosome research : an international journal on the molecular, supramolecular and evolutionary aspects of chromosome biology 2004;12;1;35-43

  • The microRNA Registry.

    Griffiths-Jones S

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 9SA, UK.

    The miRNA Registry provides a service for the assignment of miRNA gene names prior to publication. A comprehensive and searchable database of published miRNA sequences is accessible via a web interface (, and all sequence and annotation data are freely available for download. Release 2.0 of the database contains 506 miRNA entries from six organisms.

    Nucleic acids research 2004;32;Database issue;D109-11

  • Mismatch repair genes identified using genetic screens in Blm-deficient embryonic stem cells.

    Guo G, Wang W and Bradley A

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.

    Phenotype-driven recessive genetic screens in diploid organisms require a strategy to render the mutation homozygous. Although homozygous mutant mice can be generated by breeding, a reliable method to make homozygous mutations in cultured cells has not been available, limiting recessive screens in culture. Cultured embryonic stem (ES) cells provide access to all of the genes required to elaborate the fundamental components and physiological systems of a mammalian cell. Here we have exploited the high rate of mitotic recombination in Bloom's syndrome protein (Blm)-deficient ES cells to generate a genome-wide library of homozygous mutant cells from heterozygous mutations induced with a revertible gene trap retrovirus. We have screened this library for cells with defects in DNA mismatch repair (MMR), a system that detects and repairs base-base mismatches. We demonstrate the recovery of cells with homozygous mutations in known and novel MMR genes. We identified Dnmt1(ref. 5) as a novel MMR gene and confirmed that Dnmt1-deficient ES cells exhibit micro-satellite instability, providing a mechanistic explanation for the role of Dnmt1 in cancer. The combination of insertional mutagenesis in Blm-deficient ES cells establishes a new approach for phenotype-based recessive genetic screens in ES cells.

    Nature 2004;429;6994;891-5

  • MANSC: a seven-cysteine-containing domain present in animal membrane and extracellular proteins.

    Guo J, Chen S, Huang C, Chen L, Studholme DJ, Zhao S and Yu L

    State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Handan Road 220, Shanghai 200-433, China.

    Trends in biochemical sciences 2004;29;4;172-4

  • A probabilistic model of 3' end formation in Caenorhabditis elegans.

    Hajarnavis A, Korf I and Durbin R

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.

    The 3' ends of mRNAs terminate with a poly(A) tail. This post-transcriptional modification is directed by sequence features present in the 3'-untranslated region (3'-UTR). We have undertaken a computational analysis of 3' end formation in Caenorhabditis elegans. By aligning cDNAs that diverge from genomic sequence at the poly(A) tract, we accurately identified a large set of true cleavage sites. When there are many transcripts aligned to a particular locus, local variation of the cleavage site over a span of a few bases is frequently observed. We find that in addition to the well-known AAUAAA motif there are several regions with distinct nucleotide compositional biases. We propose a generalized hidden Markov model that describes sequence features in C.elegans 3'-UTRs. We find that a computer program employing this model accurately predicts experimentally observed 3' ends even when there are multiple AAUAAA motifs and multiple cleavage sites. We have made available a complete set of polyadenylation site predictions for the C.elegans genome, including a subset of 6570 supported by aligned transcripts.

    Nucleic acids research 2004;32;11;3392-9

  • Accelerated screening of phage-display output with alkaline phosphatase fusions.

    Han Z, Karatan E, Scholle MD, McCafferty J and Kay BK

    Combinatorial Biology Unit, Biosciences Division, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA.

    When using multiple targets and libraries, selection of affinity reagents from phage-displayed libraries is a relatively time-consuming process. Herein, we describe an automation-amenable approach to accelerate the process by using alkaline phosphatase (AP) fusion proteins in place of the phage ELISA screening and subsequent confirmation steps with purified protein. After two or three rounds of affinity selection, the open reading frames that encode the affinity selected molecules (i.e., antibody fragments, engineered scaffold proteins, combinatorial peptides) are amplified from the phage or phagemid DNA molecules by PCR and cloned en masse by a Ligation Independent Cloning (LIC) method into a plasmid encoding a highly active variant of E. coli AP. This time-saving process identifies affinity reagents that work out of context of the phage and that can be used in various downstream enzyme linked binding assays. The utility of this approach was demonstrated by analyzing single-chain antibodies (scFvs), engineered fibronectin type III domains (FN3), and combinatorial peptides that were selected for binding to the Epsin N-terminal Homology (ENTH) domain of epsin 1, the c-Src SH3 domain, and the appendage domain of the gamma subunit of the clathrin adaptor complex, AP-1, respectively.

    Combinatorial chemistry & high throughput screening 2004;7;1;55-62

  • Genetic equity.

    Harris J and Sulston J

    Institute of Medicine, Law and Bioethics, School of Law, University of Manchester, Oxford Road, Manchester M13 9PL, UK.

    This paper proposes, elaborates and defends a principle of genetic equity. In doing so it articulates, explains and justifies what might be meant by the concept of 'human dignity' in a way that is clear, defensible and consistent with, but by no means the same as, the plethora of appeals to human dignity found in contemporary bioethics, and more particularly in international instruments on bioethics. We propose the following principle of genetic equity: humans are born equal; they are entitled to freedom from discrimination and equality of opportunity to flourish; genetic information may not be used to limit that equality.

    Nature reviews. Genetics 2004;5;10;796-800

  • The Gene Ontology (GO) database and informatics resource.

    Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, Richter J, Rubin GM, Blake JA, Bult C, Dolan M, Drabkin H, Eppig JT, Hill DP, Ni L, Ringwald M, Balakrishnan R, Cherry JM, Christie KR, Costanzo MC, Dwight SS, Engel S, Fisk DG, Hirschman JE, Hong EL, Nash RS, Sethuraman A, Theesfeld CL, Botstein D, Dolinski K, Feierbach B, Berardini T, Mundodi S, Rhee SY, Apweiler R, Barrell D, Camon E, Dimmer E, Lee V, Chisholm R, Gaudet P, Kibbe W, Kishore R, Schwarz EM, Sternberg P, Gwinn M, Hannick L, Wortman J, Berriman M, Wood V, de la Cruz N, Tonellato P, Jaiswal P, Seigfried T, White R and Gene Ontology Consortium

    GO-EBI, Hinxton, UK.

    The Gene Ontology (GO) project (http://www. provides structured, controlled vocabularies and classifications that cover several domains of molecular and cellular biology and are freely available for community use in the annotation of genes, gene products and sequences. Many model organism databases and genome annotation groups use the GO and contribute their annotation sets to the GO resource. The GO database integrates the vocabularies and contributed annotations and provides full access to this information in several formats. Members of the GO Consortium continually work collectively, involving outside experts as needed, to expand and update the GO vocabularies. The GO Web resource also provides access to extensive documentation about the GO project and links to applications that use GO data for functional analyses.

    Funded by: NHGRI NIH HHS: HG02273

    Nucleic acids research 2004;32;Database issue;D258-61

  • WormBase: a multi-species resource for nematode biology and genomics.

    Harris TW, Chen N, Cunningham F, Tello-Ruiz M, Antoshechkin I, Bastiani C, Bieri T, Blasiar D, Bradnam K, Chan J, Chen CK, Chen WJ, Davis P, Kenny E, Kishore R, Lawson D, Lee R, Muller HM, Nakamura C, Ozersky P, Petcherski A, Rogers A, Sabo A, Schwarz EM, Van Auken K, Wang Q, Durbin R, Spieth J, Sternberg PW and Stein LD

    Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY 11724, USA.

    WormBase ( is the central data repository for information about Caenorhabditis elegans and related nematodes. As a model organism database, WormBase extends beyond the genomic sequence, integrating experimental results with extensively annotated views of the genome. The WormBase Consortium continues to expand the biological scope and utility of WormBase with the inclusion of large-scale genomic analyses, through active data and literature curation, through new analysis and visualization tools, and through refinement of the user interface. Over the past year, the nearly complete genomic sequence and comparative analyses of the closely related species Caenorhabditis briggsae have been integrated into WormBase, including gene predictions, ortholog assignments and a new synteny viewer to display the relationships between the two species. Extensive site-wide refinement of the user interface now provides quick access to the most frequently accessed resources and a consistent browsing experience across the site. Unified single-page views now provide complete summaries of commonly accessed entries like genes. These advances continue to increase the utility of WormBase for C.elegans researchers, as well as for those researchers exploring problems in functional and comparative genomics in the context of a powerful genetic system.

    Funded by: NHGRI NIH HHS: P41-HG02223

    Nucleic acids research 2004;32;Database issue;D411-7

  • Continuing tsetse and Trypanosoma genome sequencing projects.

    Hertz-Fowler C and Berriman M

    Trends in parasitology 2004;20;7;308-9; author reply 309-10

  • Parasite genome databases and web-based resources.

    Hertz-Fowler C and Hall N

    Pathogen Sequencing Unit, Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, UK.

    In the last decade, high-throughput genome sequencing and complementary techniques such as microarray and proteomics have generated, and will continue to generate, ever-increasing amounts of data. These technologies of gene discovery, expression, and functional analysis have been applied to a vast array of organisms, including parasites. In most instances, the data are freely available via the Internet, and researchers are becoming increasingly reliant on up-to-date, centralized data repositories to complement wet bench science. This chapter presents an overview of resources relevant to researchers with an interest in para-site genomics and biology. After briefly touching on some of the publicly available nucleotide and protein sequence as well as domain databases, the focus turns to parasite genome projects and associated Web-based resources. A list of parasite sequencing projects current at the time of writing, including relevant Web site addresses, is provided. The available resources range from network sites and project pages at sequencing institutes to databases that integrate and curate sequence data and associated annotation with diverse biological datasets. Particular attention is given to three databases, GeneDB (, PlasmoDB (http://plasmodb. org/), and tigr db, detailing the scope of each database and the tools available for data querying and retrieval.

    Methods in molecular biology (Clifton, N.J.) 2004;270;45-74

  • GeneDB: a resource for prokaryotic and eukaryotic organisms.

    Hertz-Fowler C, Peacock CS, Wood V, Aslett M, Kerhornou A, Mooney P, Tivey A, Berriman M, Hall N, Rutherford K, Parkhill J, Ivens AC, Rajandream MA and Barrell B

    The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    GeneDB ( is a genome database for prokaryotic and eukaryotic organisms. The resource provides a portal through which data generated by the Pathogen Sequencing Unit at the Wellcome Trust Sanger Institute and other collaborating sequencing centres can be made publicly available. It combines data from finished and ongoing genome and expressed sequence tag (EST) projects with curated annotation, that can be searched, sorted and downloaded, using a single web based resource. The current release stores 11 datasets of which six are curated and maintained by biologists, who review and incorporate information from the scientific literature, public databases and the respective research communities.

    Nucleic acids research 2004;32;Database issue;D339-43

  • Pathogenomics of non-pathogens.

    Holden M, Crossman L, Cerdeño-Tárraga A and Parkhill J

    Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Analysing the genomes of non-pathogenic microorganisms, in addition to its basic and applied scientific interest, can also shed considerable light on the study of pathogenic microorganisms. Two of the three microorganisms described here are rarely pathogenic, but carry genetic determinants that have previously been identified as being important for the pathogenicity of other microorganisms. This underlines the growing understanding that many so-called 'virulence genes' are probably involved in more general interactions between the microorganism and the host or the environment.

    Nature reviews. Microbiology 2004;2;2;91

  • Complete genomes of two clinical Staphylococcus aureus strains: evidence for the rapid evolution of virulence and drug resistance.

    Holden MT, Feil EJ, Lindsay JA, Peacock SJ, Day NP, Enright MC, Foster TJ, Moore CE, Hurst L, Atkin R, Barron A, Bason N, Bentley SD, Chillingworth C, Chillingworth T, Churcher C, Clark L, Corton C, Cronin A, Doggett J, Dowd L, Feltwell T, Hance Z, Harris B, Hauser H, Holroyd S, Jagels K, James KD, Lennard N, Line A, Mayes R, Moule S, Mungall K, Ormond D, Quail MA, Rabbinowitsch E, Rutherford K, Sanders M, Sharp S, Simmonds M, Stevens K, Whitehead S, Barrell BG, Spratt BG and Parkhill J

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.

    Staphylococcus aureus is an important nosocomial and community-acquired pathogen. Its genetic plasticity has facilitated the evolution of many virulent and drug-resistant strains, presenting a major and constantly changing clinical challenge. We sequenced the approximately 2.8-Mbp genomes of two disease-causing S. aureus strains isolated from distinct clinical settings: a recent hospital-acquired representative of the epidemic methicillin-resistant S. aureus EMRSA-16 clone (MRSA252), a clinically important and globally prevalent lineage; and a representative of an invasive community-acquired methicillin-susceptible S. aureus clone (MSSA476). A comparative-genomics approach was used to explore the mechanisms of evolution of clinically important S. aureus genomes and to identify regions affecting virulence and drug resistance. The genome sequences of MRSA252 and MSSA476 have a well conserved core region but differ markedly in their accessory genetic elements. MRSA252 is the most genetically diverse S. aureus strain sequenced to date: approximately 6% of the genome is novel compared with other published genomes, and it contains several unique genetic elements. MSSA476 is methicillin-susceptible, but it contains a novel Staphylococcal chromosomal cassette (SCC) mec-like element (designated SCC(476)), which is integrated at the same site on the chromosome as SCCmec elements in MRSA strains but encodes a putative fusidic acid resistance protein. The crucial role that accessory elements play in the rapid evolution of S. aureus is clearly illustrated by comparing the MSSA476 genome with that of an extremely closely related MRSA community-acquired strain; the differential distribution of large mobile elements carrying virulence and drug-resistance determinants may be responsible for the clinically important phenotypic differences in these strains.

    Proceedings of the National Academy of Sciences of the United States of America 2004;101;26;9786-91

  • Genomic plasticity of the causative agent of melioidosis, Burkholderia pseudomallei.

    Holden MT, Titball RW, Peacock SJ, Cerdeño-Tárraga AM, Atkins T, Crossman LC, Pitt T, Churcher C, Mungall K, Bentley SD, Sebaihia M, Thomson NR, Bason N, Beacham IR, Brooks K, Brown KA, Brown NF, Challis GL, Cherevach I, Chillingworth T, Cronin A, Crossett B, Davis P, DeShazer D, Feltwell T, Fraser A, Hance Z, Hauser H, Holroyd S, Jagels K, Keith KE, Maddison M, Moule S, Price C, Quail MA, Rabbinowitsch E, Rutherford K, Sanders M, Simmonds M, Songsivilai S, Stevens K, Tumapa S, Vesaratchavest M, Whitehead S, Yeats C, Barrell BG, Oyston PC and Parkhill J

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.

    Burkholderia pseudomallei is a recognized biothreat agent and the causative agent of melioidosis. This Gram-negative bacterium exists as a soil saprophyte in melioidosis-endemic areas of the world and accounts for 20% of community-acquired septicaemias in northeastern Thailand where half of those affected die. Here we report the complete genome of B. pseudomallei, which is composed of two chromosomes of 4.07 megabase pairs and 3.17 megabase pairs, showing significant functional partitioning of genes between them. The large chromosome encodes many of the core functions associated with central metabolism and cell growth, whereas the small chromosome carries more accessory functions associated with adaptation and survival in different niches. Genomic comparisons with closely and more distantly related bacteria revealed a greater level of gene order conservation and a greater number of orthologous genes on the large chromosome, suggesting that the two replicons have distinct evolutionary origins. A striking feature of the genome was the presence of 16 genomic islands (GIs) that together made up 6.1% of the genome. Further analysis revealed these islands to be variably present in a collection of invasive and soil isolates but entirely absent from the clonally related organism B. mallei. We propose that variable horizontal gene acquisition by B. pseudomallei is an important feature of recent genetic evolution and that this has resulted in a genetically diverse pathogenic species.

    Proceedings of the National Academy of Sciences of the United States of America 2004;101;39;14240-5

  • Gene map of the extended human MHC.

    Horton R, Wilming L, Rand V, Lovering RC, Bruford EA, Khodiyar VK, Lush MJ, Povey S, Talbot CC, Wright MW, Wain HM, Trowsdale J, Ziegler A and Beck S

    Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    The major histocompatibility complex (MHC) is the most important region in the vertebrate genome with respect to infection and autoimmunity, and is crucial in adaptive and innate immunity. Decades of biomedical research have revealed many MHC genes that are duplicated, polymorphic and associated with more diseases than any other region of the human genome. The recent completion of several large-scale studies offers the opportunity to assimilate the latest data into an integrated gene map of the extended human MHC. Here, we present this map and review its content in relation to paralogy, polymorphism, immune function and disease.

    Funded by: Multiple Sclerosis Society: 588

    Nature reviews. Genetics 2004;5;12;889-99

  • Whole genome DNA copy number changes identified by high density oligonucleotide arrays.

    Huang J, Wei W, Zhang J, Liu G, Bignell GR, Stratton MR, Futreal PA, Wooster R, Jones KW and Shapero MH

    Affymetrix, Inc., 3380 Central Expressway, Santa Clara, CA 95051, USA.

    Changes in DNA copy number are one of the hallmarks of the genetic instability common to most human cancers. Previous microarray-based methods have been used to identify chromosomal gains and losses; however, they are unable to genotype alleles at the level of single nucleotide polymorphisms (SNPs). Here we describe a novel algorithm that uses a recently developed high-density oligonucleotide array-based SNP genotyping method, whole genome sampling analysis (WGSA), to identify genome-wide chromosomal gains and losses at high resolution. WGSA simultaneously genotypes over 10,000 SNPs by allele-specific hybridisation to perfect match (PM) and mismatch (MM) probes synthesised on a single array. The copy number algorithm jointly uses PM intensity and discrimination ratios between paired PM and MM intensity values to identify and estimate genetic copy number changes. Values from an experimental sample are compared with SNP-specific distributions derived from a reference set containing over 100 normal individuals to gain statistical power. Genomic regions with statistically significant copy number changes can be identified using both single point analysis and contiguous point analysis of SNP intensities. We identified multiple regions of amplification and deletion using a panel of human breast cancer cell lines. We verified these results using an independent method based on quantitative polymerase chain reaction and found that our approach is both sensitive and specific and can tolerate samples which contain a mixture of both tumour and normal DNA. In addition, by using known allele frequencies from the reference set, statistically significant genomic intervals can be identified containing contiguous stretches of homozygous markers, potentially allowing the detection of regions undergoing loss of heterozygosity (LOH) without the need for a matched normal control sample. The coupling of LOH analysis, via SNP genotyping, with copy number estimations using a single array provides additional insight into the structure of genomic alterations. With mean and median inter-SNP euchromatin distances of 244 kilobases (kb) and 119 kb, respectively, this method affords a resolution that is not easily achievable with non-oligonucleotide-based experimental approaches.

    Human genomics 2004;1;4;287-99

  • A new trade framework for global healthcare R&D.

    Hubbard T and Love J

    Wellcome Trust Sanger Institute in Hinxton, United Kingdom.

    PLoS biology 2004;2;2;E52

  • DNA sequence and analysis of human chromosome 9.

    Humphray SJ, Oliver K, Hunt AR, Plumb RW, Loveland JE, Howe KL, Andrews TD, Searle S, Hunt SE, Scott CE, Jones MC, Ainscough R, Almeida JP, Ambrose KD, Ashwell RI, Babbage AK, Babbage S, Bagguley CL, Bailey J, Banerjee R, Barker DJ, Barlow KF, Bates K, Beasley H, Beasley O, Bird CP, Bray-Allen S, Brown AJ, Brown JY, Burford D, Burrill W, Burton J, Carder C, Carter NP, Chapman JC, Chen Y, Clarke G, Clark SY, Clee CM, Clegg S, Collier RE, Corby N, Crosier M, Cummings AT, Davies J, Dhami P, Dunn M, Dutta I, Dyer LW, Earthrowl ME, Faulkner L, Fleming CJ, Frankish A, Frankland JA, French L, Fricker DG, Garner P, Garnett J, Ghori J, Gilbert JG, Glison C, Grafham DV, Gribble S, Griffiths C, Griffiths-Jones S, Grocock R, Guy J, Hall RE, Hammond S, Harley JL, Harrison ES, Hart EA, Heath PD, Henderson CD, Hopkins BL, Howard PJ, Howden PJ, Huckle E, Johnson C, Johnson D, Joy AA, Kay M, Keenan S, Kershaw JK, Kimberley AM, King A, Knights A, Laird GK, Langford C, Lawlor S, Leongamornlert DA, Leversha M, Lloyd C, Lloyd DM, Lovell J, Martin S, Mashreghi-Mohammadi M, Matthews L, McLaren S, McLay KE, McMurray A, Milne S, Nickerson T, Nisbett J, Nordsiek G, Pearce AV, Peck AI, Porter KM, Pandian R, Pelan S, Phillimore B, Povey S, Ramsey Y, Rand V, Scharfe M, Sehra HK, Shownkeen R, Sims SK, Skuce CD, Smith M, Steward CA, Swarbreck D, Sycamore N, Tester J, Thorpe A, Tracey A, Tromans A, Thomas DW, Wall M, Wallis JM, West AP, Whitehead SL, Willey DL, Williams SA, Wilming L, Wray PW, Young L, Ashurst JL, Coulson A, Blöcker H, Durbin R, Sulston JE, Hubbard T, Jackson MJ, Bentley DR, Beck S, Rogers J and Dunham I

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Chromosome 9 is highly structurally polymorphic. It contains the largest autosomal block of heterochromatin, which is heteromorphic in 6-8% of humans, whereas pericentric inversions occur in more than 1% of the population. The finished euchromatic sequence of chromosome 9 comprises 109,044,351 base pairs and represents >99.6% of the region. Analysis of the sequence reveals many intra- and interchromosomal duplications, including segmental duplications adjacent to both the centromere and the large heterochromatic block. We have annotated 1,149 genes, including genes implicated in male-to-female sex reversal, cancer and neurodegenerative disease, and 426 pseudogenes. The chromosome contains the largest interferon gene cluster in the human genome. There is also a region of exceptionally high gene and G + C content including genes paralogous to those in the major histocompatibility complex. We have also detected recently duplicated genes that exhibit different rates of sequence divergence, presumably reflecting natural selection.

    Nature 2004;429;6990;369-74

  • Gene duplication: the genomic trade in spare parts.

    Hurles M

    Wellcome Trust Sanger Institute near Cambridge in the United Kingdom.

    PLoS biology 2004;2;7;E206

  • Origins of chromosomal rearrangement hotspots in the human genome: evidence from the AZFa deletion hotspots.

    Hurles ME, Willey D, Matthews L and Hussain SS

    Molecular Genetics Laboratory, McDonald Institute for Archaeological Research, University of Cambridge, Downing Street, Cambridge, CB2 3ER, UK.

    Background: The origins of the recombination hotspots that are a common feature of both allelic and non-allelic homologous recombination in the human genome are poorly understood. We have investigated, by comparative sequencing, the evolution of two hotspots of non-allelic homologous recombination on the Y chromosome that lie within paralogous sequences known to sponsor deletions resulting in male infertility.

    Results: These recombination hotspots are characterized by signatures of concerted evolution, which indicate that gene conversion between paralogs has been predominant in shaping their recent evolution. By contrast, the paralogous sequences that surround the hotspots exhibit little evidence of gene conversion. A second feature of these rearrangement hotspots is the extreme interspecific sequence divergence (around 2.5%) that places them among the most divergent orthologous sequences between humans and chimpanzees.

    Conclusions: Several hominid-specific gene conversion events have rendered these hotspots better substrates for chromosomal rearrangements in humans than in chimpanzees or gorillas. Monte Carlo simulations of sequence evolution suggest that extreme sequence divergence is a direct consequence of gene conversion between paralogs. We propose that the coincidence of signatures of concerted evolution and recurrent breakpoints of chromosomal rearrangement (mapped at the sequence level) may enable the identification of putative rearrangement hotspots from analysis of comparative sequences from great apes.

    Genome biology 2004;5;8;R55

  • High-resolution analysis of genomic copy number alterations in bladder cancer by microarray-based comparative genomic hybridization.

    Hurst CD, Fiegler H, Carr P, Williams S, Carter NP and Knowles MA

    Cancer Research UK Clinical Centre, St James's University Hospital, Beckett St, Leeds LS9 7TF, UK.

    We have screened 22 bladder tumour-derived cell lines and one normal urothelium-derived cell line for genome-wide copy number changes using array comparative genomic hybridization (CGH). Comparison of array CGH with existing multiplex-fluorescence in situ hybridization (M-FISH) results revealed excellent concordance. Regions of gain and loss were defined more accurately by array CGH, and several small regions of deletion were detected that were not identified by M-FISH. Numerous genetic changes were identified, many of which were compatible with previous results from conventional CGH and loss of heterozygosity analyses on bladder tumours. The most frequent changes involved complete or partial loss of 4q (83%) and gain of 20q (78%). Other frequent losses were of 18q (65%), 8p (65%), 2q (61%), 6q (61%), 3p (56%), 13q (56%), 4p (52%), 6p (52%), 10p (52%), 10q (52%) and 5p (43%). We have refined the localization of a region of deletion at 8p21.2-p21.3 to an interval of approximately 1 Mb. Five homozygous deletions of tumour suppressor genes were confirmed, and several potentially novel homozygous deletions were identified. In all, 15 high-level amplifications were detected, with a previously reported amplification at 6p22.3 being the most frequent. Real-time PCR analysis revealed a novel candidate gene with consistent overexpression in all cell lines with the 6p22.3 amplicon.

    Oncogene 2004;23;12;2250-63

  • Integrative annotation of 21,037 human genes validated by full-length cDNA clones.

    Imanishi T, Itoh T, Suzuki Y, O'Donovan C, Fukuchi S, Koyanagi KO, Barrero RA, Tamura T, Yamaguchi-Kabata Y, Tanino M, Yura K, Miyazaki S, Ikeo K, Homma K, Kasprzyk A, Nishikawa T, Hirakawa M, Thierry-Mieg J, Thierry-Mieg D, Ashurst J, Jia L, Nakao M, Thomas MA, Mulder N, Karavidopoulou Y, Jin L, Kim S, Yasuda T, Lenhard B, Eveno E, Suzuki Y, Yamasaki C, Takeda J, Gough C, Hilton P, Fujii Y, Sakai H, Tanaka S, Amid C, Bellgard M, Bonaldo Mde F, Bono H, Bromberg SK, Brookes AJ, Bruford E, Carninci P, Chelala C, Couillault C, de Souza SJ, Debily MA, Devignes MD, Dubchak I, Endo T, Estreicher A, Eyras E, Fukami-Kobayashi K, Gopinath GR, Graudens E, Hahn Y, Han M, Han ZG, Hanada K, Hanaoka H, Harada E, Hashimoto K, Hinz U, Hirai M, Hishiki T, Hopkinson I, Imbeaud S, Inoko H, Kanapin A, Kaneko Y, Kasukawa T, Kelso J, Kersey P, Kikuno R, Kimura K, Korn B, Kuryshev V, Makalowska I, Makino T, Mano S, Mariage-Samson R, Mashima J, Matsuda H, Mewes HW, Minoshima S, Nagai K, Nagasaki H, Nagata N, Nigam R, Ogasawara O, Ohara O, Ohtsubo M, Okada N, Okido T, Oota S, Ota M, Ota T, Otsuki T, Piatier-Tonneau D, Poustka A, Ren SX, Saitou N, Sakai K, Sakamoto S, Sakate R, Schupp I, Servant F, Sherry S, Shiba R, Shimizu N, Shimoyama M, Simpson AJ, Soares B, Steward C, Suwa M, Suzuki M, Takahashi A, Tamiya G, Tanaka H, Taylor T, Terwilliger JD, Unneberg P, Veeramachaneni V, Watanabe S, Wilming L, Yasuda N, Yoo HS, Stodolsky M, Makalowski W, Go M, Nakai K, Takagi T, Kanehisa M, Sakaki Y, Quackenbush J, Okazaki Y, Hayashizaki Y, Hide W, Chakraborty R, Nishikawa K, Sugawara H, Tateno Y, Chen Z, Oishi M, Tonellato P, Apweiler R, Okubo K, Wagner L, Wiemann S, Strausberg RL, Isogai T, Auffray C, Nomura N, Gojobori T and Sugano S

    Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan.

    The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.

    Funded by: NHLBI NIH HHS: R01 HL064541

    PLoS biology 2004;2;6;e162

  • Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution.

    International Chicken Genome Sequencing Consortium

    Genome Sequencing Center, Washington University School of Medicine, Campus Box 8501, 4444 Forest Park Avenue, St Louis, Missouri 63108, USA.

    We present here a draft genome sequence of the red jungle fowl, Gallus gallus. Because the chicken is a modern descendant of the dinosaurs and the first non-mammalian amniote to have its genome sequenced, the draft sequence of its genome--composed of approximately one billion base pairs of sequence and an estimated 20,000-23,000 genes--provides a new perspective on vertebrate genome evolution, while also improving the annotation of mammalian genomes. For example, the evolutionary distance between chicken and human provides high specificity in detecting functional elements, both non-coding and coding. Notably, many conserved non-coding sequences are far from genes and cannot be assigned to defined functional classes. In coding regions the evolutionary dynamics of protein domains and orthologous groups illustrate processes that distinguish the lineages leading to birds and mammals. The distinctive properties of avian microchromosomes, together with the inferred patterns of conserved synteny, provide additional insights into vertebrate chromosome architecture.

    Nature 2004;432;7018;695-716

  • Integrating ethics and science in the International HapMap Project.

    International HapMap Consortium

    Funded by: NHGRI NIH HHS: R01 HG002189-01, R01 HG002189-02, R01 HG002189-03

    Nature reviews. Genetics 2004;5;6;467-75

  • Finishing the euchromatic sequence of the human genome.

    International Human Genome Sequencing Consortium

    The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers approximately 99% of the euchromatic genome and is accurate to an error rate of approximately 1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human genome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead.

    Nature 2004;431;7011;931-45

  • The zebrafish genome project: sequence analysis and annotation.

    Jekosch K

    Wellcome Trust Sanger Institute, Cambridge CB10 1SA, United Kingdom.

    Funded by: NHGRI NIH HHS: P41 HG002659

    Methods in cell biology 2004;77;225-39

  • Human MicroRNA targets.

    John B, Enright AJ, Aravin A, Tuschl T, Sander C and Marks DS

    Computational Biology Center, Memorial Sloan-Kettering Cancer Center, New York, New York, USA.

    MicroRNAs (miRNAs) interact with target mRNAs at specific sites to induce cleavage of the message or inhibit translation. The specific function of most mammalian miRNAs is unknown. We have predicted target sites on the 3' untranslated regions of human gene transcripts for all currently known 218 mammalian miRNAs to facilitate focused experiments. We report about 2,000 human genes with miRNA target sites conserved in mammals and about 250 human genes conserved as targets between mammals and fish. The prediction algorithm optimizes sequence complementarity using position-specific rules and relies on strict requirements of interspecies conservation. Experimental support for the validity of the method comes from known targets and from strong enrichment of predicted targets in mRNAs associated with the fragile X mental retardation protein in mammals. This is consistent with the hypothesis that miRNAs act as sequence-specific adaptors in the interaction of ribonuclear particles with translationally regulated messages. Overrepresented groups of targets include mRNAs coding for transcription factors, components of the miRNA machinery, and other proteins involved in translational regulation, as well as components of the ubiquitin machinery, representing novel feedback loops in gene regulation. Detailed information about target genes, target processes, and open-source software for target prediction (miRanda) is available at Our analysis suggests that miRNA genes, which are about 1% of all human genes, regulate protein production for 10% or more of all human genes.

    PLoS biology 2004;2;11;e363

  • Gene array analysis of Yersinia enterocolitica FlhD and FlhC: regulation of enzymes affecting synthesis and degradation of carbamoylphosphate.

    Kapatral V, Campbell JW, Minnich SA, Thomson NR, Matsumura P and Prüss BM

    Integrated Genomics, Inc., 2201 West Campbell Park Dr., Chicago, IL 60612, USA.

    This paper focuses on global gene regulation by FlhD/FlhC in enteric bacteria. Even though Yersinia enterocolitica FlhD/FlhC can complement an Escherichia coli flhDC mutant for motility, it is not known if the Y. enterocolitica FlhD/FlhC complex has an effect on metabolism similar to E. coli. To study metabolic gene regulation, a partial Yersinia enterocolitica 8081c microarray was constructed and the expression patterns of wild-type cells were compared to an flhDC mutant strain at 25 and 37 degrees C. The overlap between the E. coli and Y. enterocolitica FlhD/FlhC regulated genes was 25 %. Genes that were regulated at least fivefold by FlhD/FlhC in Y. enterocolitica are genes encoding urocanate hydratase (hutU), imidazolone propionase (hutI), carbamoylphosphate synthetase (carAB) and aspartate carbamoyltransferase (pyrBI). These enzymes are part of a pathway that is involved in the degradation of L-histidine to L-glutamate and eventually leads into purine/pyrimidine biosynthesis via carbamoylphosphate and carbamoylaspartate. A number of other genes were regulated at a lower rate. In two additional experiments, the expression of wild-type cells grown at 4 or 25 degrees C was compared to the same strain grown at 37 degrees C. The expression of the flagella master operon flhD was not affected by temperature, whereas the flagella-specific sigma factor fliA was highly expressed at 25 degrees C and reduced at 4 and 37 degrees C. Several other flagella genes, all of which are under the control of FliA, exhibited a similar temperature profile. These data are consistent with the hypothesis that temperature regulation of flagella genes might be mediated by the flagella-specific sigma factor FliA and not the flagella master regulator FlhD/FlhC.

    Funded by: NCRR NIH HHS: P20 RR16454; NIGMS NIH HHS: GM59484

    Microbiology (Reading, England) 2004;150;Pt 7;2289-300

  • EnsMart: a generic system for fast and flexible access to biological data.

    Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T and Birney E

    European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SH, UK.

    The EnsMart system ( provides a generic data warehousing solution for fast and flexible querying of large biological data sets and integration with third-party data and tools. The system consists of a query-optimized database and interactive, user-friendly interfaces. EnsMart has been applied to Ensembl, where it extends its genomic browser capabilities, facilitating rapid retrieval of customized data sets. A wide variety of complex queries, on various types of annotations, for numerous species are supported. These can be applied to many research problems, ranging from SNP selection for candidate gene screening, through cross-species evolutionary comparisons, to microarray annotation. Users can group and refine biological data according to many criteria, including cross-species analyses, disease links, sequence variations, and expression patterns. Both tabulated list data and biological sequence output can be generated dynamically, in HTML, text, Microsoft Excel, and compressed formats. A wide range of sequence types, such as cDNA, peptides, coding regions, UTRs, and exons, with additional upstream and downstream regions, can be retrieved. The EnsMart database can be accessed via a public Web site, or through a Java application suite. Both implementations and the database are freely available for local installation, and can be extended or adapted to 'non-Ensembl' data sets.

    Genome research 2004;14;1;160-9

  • A comprehensive survey of human Y-chromosomal microsatellites.

    Kayser M, Kittler R, Erler A, Hedman M, Lee AC, Mohyuddin A, Mehdi SQ, Rosser Z, Stoneking M, Jobling MA, Sajantila A and Tyler-Smith C

    Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany.

    We have screened the nearly complete DNA sequence of the human Y chromosome for microsatellites (short tandem repeats) that meet the criteria of having a repeat-unit size of > or = 3 and a repeat count of > or = 8 and thus are likely to be easy to genotype accurately and to be polymorphic. Candidate loci were tested in silico for novelty and for probable Y specificity, and then they were tested experimentally to identify Y-specific loci and to assess their polymorphism. This yielded 166 useful new Y-chromosomal microsatellites, 139 of which were polymorphic, in a sample of eight diverse Y chromosomes representing eight Y-SNP haplogroups. This large sample of microsatellites, together with 28 previously known markers analyzed here--all sharing a common evolutionary history--allowed us to investigate the factors influencing their variation. For simple microsatellites, the average repeat count accounted for the highest proportion of repeat variance (approximately 34%). For complex microsatellites, the largest proportion of the variance (again, approximately 34%) was explained by the average repeat count of the longest homogeneous array, which normally is variable. In these complex microsatellites, the additional repeats outside the longest homogeneous array significantly increased the variance, but this was lower than the variance of a simple microsatellite with the same total repeat count. As a result of this work, a large number of new, highly polymorphic Y-chromosomal microsatellites are now available for population-genetic, evolutionary, genealogical, and forensic investigations.

    Funded by: Wellcome Trust: 057559

    American journal of human genetics 2004;74;6;1183-97

  • Efficiency and consistency of haplotype tagging of dense SNP maps in multiple samples.

    Ke X, Durrant C, Morris AP, Hunt S, Bentley DR, Deloukas P and Cardon LR

    Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, OX3 7BN, UK.

    Haplotype tagging is a means of retaining most of the information in high density marker maps, while reducing genotyping requirements. Estimates of the numbers of tagging SNPs required to cover the human genome have varied widely, ranging from 100,000 to 1,000,000. Tagging has been applied to a number of gene-based datasets but has not been evaluated in contexts reflecting those of genome-wide association studies--large chromosome regions and multiple samples drawn from the same population. We analysed 5000 common markers across a 10 Mb segment of human chromosome 20 in three samples (UK Caucasian, CEPH Caucasian, African American) to evaluate tagging efficiency and consistency. Overall, the results indicate a high degree of efficiency, yielding 3-5-fold savings in Caucasians and 2-3-fold savings in African Americans. These levels varied according to linkage disequilibrium (LD) levels, tagging thresholds and allele frequencies, but in high LD regions they did not vary markedly due to marker density. However, a strong positive relationship between marker density and tagging was observed, relating to the fact that increasing marker density yields greater sequence coverage in high LD, thus requiring more tag SNPs to cover a greater fraction of the genome. Encouragingly, whatever the density employed, a high level of robustness was observed between UK and CEPH samples, as most of the htSNPs selected in one sample were also appropriate as tags in the other.

    Funded by: PHS HHS: NEI-12562

    Human molecular genetics 2004;13;21;2557-65

  • The impact of SNP density on fine-scale patterns of linkage disequilibrium.

    Ke X, Hunt S, Tapper W, Lawrence R, Stavrides G, Ghori J, Whittaker P, Collins A, Morris AP, Bentley D, Cardon LR and Deloukas P

    Wellcome Trust Centre for Human Genetics, University of Oxford, UK.

    Linkage disequilibrium (LD) is a measure of the degree of association between alleles in a population. The detection of disease-causing variants by association with neighbouring single nucleotide polymorphisms (SNPs) depends on the existence of strong LD between them. Previous studies have indicated that the extent of LD is highly variable in different chromosome regions and different populations, demonstrating the importance of genome-wide accurate measurement of LD at high resolution throughout the human genome. A uniform feature of these studies has been the inability to detect LD in regions of low marker density. To investigate the dependence of LD patterns on marker selection we performed a high-resolution study in African-American, Asian and UK Caucasian populations. We selected over 5000 SNPs with an average spacing of approximately 1 SNP per 2 kb after validating ca 12 000 SNPs derived from a dense SNP collection (1 SNP per 0.3 kb on average). Applications of different statistical methods of LD assessment highlight similar areas of high and low LD. However, at high resolution, features such as overall sequence coverage in LD blocks and block boundaries vary substantially with respect to marker density. Model-based linkage disequilibrium unit (LDU) maps appear robust to marker density and consistently influenced by marker allele frequency. The results suggest that very dense marker sets will be required to yield stable views of fine-scale LD in the human genome.

    Funded by: NEI NIH HHS: EY-126562

    Human molecular genetics 2004;13;6;577-88

  • The Wnt co-receptors Lrp5 and Lrp6 are essential for gastrulation in mice.

    Kelly OG, Pinson KI and Skarnes WC

    Department of Molecular and Cell Biology, University of California at Berkeley, Berkeley, CA 94720-3200, USA.

    Recent work has identified LDL receptor-related family members, Lrp5 and Lrp6, as co-receptors for the transduction of Wnt signals. Our analysis of mice carrying mutations in both Lrp5 and Lrp6 demonstrates that the functions of these genes are redundant and are essential for gastrulation. Lrp5;Lrp6 double homozygous mutants fail to establish a primitive streak, although the anterior visceral endoderm and anterior epiblast fates are specified. Thus, Lrp5 and Lrp6 are required for posterior patterning of the epiblast, consistent with a role in transducing Wnt signals in the early embryo. Interestingly, Lrp5(+/-);Lrp6(-/-) embryos die shortly after gastrulation and exhibit an accumulation of cells at the primitive streak and a selective loss of paraxial mesoderm. A similar phenotype is observed in Fgf8 and Fgfr1 mutant embryos and provides genetic evidence in support of a molecular link between the Fgf and Wnt signaling pathways in patterning nascent mesoderm. Lrp5(+/-);Lrp6(-/-) embryos also display an expansion of anterior primitive streak derivatives and anterior neurectoderm that correlates with increased Nodal expression in these embryos. The effect of reducing, but not eliminating, Wnt signaling in Lrp5(+/-);Lrp6(-/-) mutant embryos provides important insight into the interplay between Wnt, Fgf and Nodal signals in patterning the early mouse embryo.

    Development (Cambridge, England) 2004;131;12;2803-15

  • Fibronectin binding to the Salmonella enterica serotype Typhimurium ShdA autotransporter protein is inhibited by a monoclonal antibody recognizing the A3 repeat.

    Kingsley RA, Abi Ghanem D, Puebla-Osorio N, Keestra AM, Berghman L and Bäumler AJ

    Department of Medical Microbiology and Immunology, College of Medicine, College Station, TX 77843, USA.

    ShdA is a large outer membrane protein of the autotransporter family whose passenger domain binds the extracellular matrix proteins fibronectin and collagen I, possibly by mimicking the host ligand heparin. The ShdA passenger domain consists of approximately 1,500 amino acid residues that can be divided into two regions based on features of the primary amino acid sequence: an N-terminal nonrepeat region followed by a repeat region composed of two types of imperfect direct amino acid repeats, called type A and type B. The repeat region bound bovine fibronectin with an affinity similar to that for the complete ShdA passenger domain, while the nonrepeat region exhibited comparatively low fibronectin-binding activity. A number of fusion proteins containing truncated fragments of the repeat region did not bind bovine fibronectin. However, binding of the passenger domain to fibronectin was inhibited in the presence of immune serum raised to one truncated fragment of the repeat region that contained repeats A2, B8, A3, and B9. Furthermore, a monoclonal antibody that specifically recognized an epitope in a recombinant protein containing the A3 repeat inhibited binding of ShdA to fibronectin.

    Funded by: NIAID NIH HHS: AI40124, AI44170

    Journal of bacteriology 2004;186;15;4931-9

  • Two new mouse chromosome 11 balancers.

    Klysik J, Dinh C and Bradley A

    Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA.

    Segmental inversions causing recombination suppression are an essential feature of balancer chromosomes. Meiotic crossing over between homologous chromosomes within an inversion interval will lead to nonviable gametes, while gametes generated from recombination events elsewhere on the chromosome will be unaffected. This apparent recombination suppression has been widely exploited in genetic studies in Drosophila to maintain and analyze stocks carrying recessive lethal mutations. Balancers are particularly useful in mutagenesis screens since they help to establish the approximate genomic location of alleles of genes causing phenotypes. Using the Cre-loxP recombination system, we have constructed two mouse balancer chromosomes carrying 8- and 30-cM inversions between Wnt3 and D11Mit69 and between Trp53 and EgfR loci, respectively. The Wnt3-D11Mit69 inversion mutates the Wnt3 locus and is therefore homozygous lethal. The Trp53-EgfR inversion is homozygous viable, since the EgfR locus is intact and mutations in p53 are homozygous viable. A dominantly acting K14-agouti minigene tags both rearrangements, which enables these balancer chromosomes to be visibly tracked in mouse stocks. With the addition of these balancers to the previously reported Trp53-Wnt3 balancer, most of mouse chromosome 11 is now available in balancer stocks.

    Genomics 2004;83;2;303-10

  • Bacterial artificial chromosome (BAC) clones and the current clone map of the zebrafish genome.

    Koch R, Rauch GJ, Humphray S, Geisler TR and Plasterk R

    Hubrecht Laboratory, Uppsalalaan 8, 3584 CT Utrecht, The Netherlands.

    Methods in cell biology 2004;77;295-304

  • Gene finding in novel genomes.

    Korf I

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK.

    Background: Computational gene prediction continues to be an important problem, especially for genomes with little experimental data.

    Results: I introduce the SNAP gene finder which has been designed to be easily adaptable to a variety of genomes. In novel genomes without an appropriate gene finder, I demonstrate that employing a foreign gene finder can produce highly inaccurate results, and that the most compatible parameters may not come from the nearest phylogenetic neighbor. I find that foreign gene finders are more usefully employed to bootstrap parameter estimation and that the resulting parameters can be highly accurate.

    Conclusion: Since gene prediction is sensitive to species-specific parameters, every genome needs a dedicated gene finder.

    Funded by: PHS HHS: K22-00064-01

    BMC bioinformatics 2004;5;59

  • Somite polarity and segmental patterning of the peripheral nervous system.

    Kuan CY, Tannahill D, Cook GM and Keynes RJ

    Department of Anatomy, University of Cambridge, Downing Street, Cambridge CB2 3DY, UK.

    The analysis of the outgrowth pattern of spinal axons in the chick embryo has shown that somites are polarized into anterior and posterior halves. This polarity dictates the segmental development of the peripheral nervous system: migrating neural crest cells and outgrowing spinal axons traverse exclusively the anterior halves of the somite-derived sclerotomes, ensuring a proper register between spinal axons, their ganglia and the segmented vertebral column. Much progress has been made recently in understanding the molecular basis for somite polarization, and its linkage with Notch/Delta, Wnt and Fgf signalling. Contact-repulsive molecules expressed by posterior half-sclerotome cells provide critical guidance cues for axons and neural crest cells along the anterior-posterior axis. Diffusible repellents from surrounding tissues, particularly the dermomyotome and notochord, orient outgrowing spinal axons in the dorso-ventral axis ('surround repulsion'). Repulsive forces therefore guide axons in three dimensions. Although several molecular systems have been identified that may guide neural crest cells and axons in the sclerotome, it remains unclear whether these operate together with considerable overall redundancy, or whether any one system predominates in vivo.

    Mechanisms of development 2004;121;9;1055-68

  • Novel microsatellite markers and single nucleotide polymorphisms refine the tylosis with oesophageal cancer (TOC) minimal region on 17q25 to 42.5 kb: sequencing does not identify the causative gene.

    Langan JE, Cole CG, Huckle EJ, Byrne S, McRonald FE, Rowbottom L, Ellis A, Shaw JM, Leigh IM, Kelsell DP, Dunham I, Field JK and Risk JM

    Molecular Genetics and Oncology Group, Department of Clinical Dental Sciences, University of Liverpool, Edward's Building, Daulby Street, L69 3GN, Liverpool, UK.

    Tylosis (focal non-epidermolytic palmoplantar keratoderma) is associated with the early onset of squamous cell oesophageal cancer in three families. Linkage and haplotype analyses have previously mapped the tylosis with oesophageal cancer ( TOC) locus to a 500-kb region on chromosome 17q25 that has also been implicated in sporadically occurring squamous cell oesophageal cancer. In the current study, 17 additional putative microsatellite markers were identified within this 500-kb region by using sequence data and seven of these were shown to be polymorphic in the UK and US families. In addition, our complete sequence analysis of the non-repetitive parts of the TOC minimal region identified 53 novel and six known single nucleotide polymorphisms (SNPs) in one or both of these families. Further fine mapping of the TOC disease locus by haplotype analysis of the seven polymorphic markers and 21 of the 59 SNPs allowed the reduction of the minimal region to 42.5 kb. One known and two putative genes are located within this region but none of these genes shows tylosis-specific mutations within their protein-coding regions. Alternative mechanisms of disease gene action must therefore be considered.

    Human genetics 2004;114;6;534-40

  • Separation, digestion, and cloning of intact parasite chromosomes embedded in agarose.

    Leech V, Quail MA and Melville SE

    Department of Pathology, University of Cambridge, UK.

    The chromosomes of most protozoan parasites cannot be visualized using conventional microscopy because they are too small and do not condense sufficiently at metaphase. Therefore, the development of pulsed field gel electrophoresis allowed the resolution of many parasite karyotypes for the first time. The ability to prepare intact chromosomes in agarose plugs and to isolate individual homologs by electrophoresis has led to many new applications in parasite genomic analysis. This chapter describes the preparation of chromosome plugs from single-celled protozoan parasites, providing numerous tips on how to achieve the highest-quality preparations that will last for years. We also provide detailed protocols for the manipulation of individual excised chromosomes, including restriction mapping and preparation of chromosome shotgun libraries as used in many of the genomic sequencing projects. The protocols provided here underpin several of the advanced methods of genomic analysis and manipulation described in this volume of parasite genomics protocols.

    Methods in molecular biology (Clifton, N.J.) 2004;270;335-52

  • A first-draft human protein-interaction map.

    Lehner B and Fraser AG

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.

    Background: Protein-interaction maps are powerful tools for suggesting the cellular functions of genes. Although large-scale protein-interaction maps have been generated for several invertebrate species, projects of a similar scale have not yet been described for any mammal. Because many physical interactions are conserved between species, it should be possible to infer information about human protein interactions (and hence protein function) using model organism protein-interaction datasets.

    Results: Here we describe a network of over 70,000 predicted physical interactions between around 6,200 human proteins generated using the data from lower eukaryotic protein-interaction maps. The physiological relevance of this network is supported by its ability to preferentially connect human proteins that share the same functional annotations, and we show how the network can be used to successfully predict the functions of human proteins. We find that combining interaction datasets from a single organism (but generated using independent assays) and combining interaction datasets from two organisms (but generated using the same assay) are both very effective ways of further improving the accuracy of protein-interaction maps.

    Conclusions: The complete network predicts interactions for a third of human genes, including 448 human disease genes and 1,482 genes of unknown function, and so provides a rich framework for biomedical research.

    Genome biology 2004;5;9;R63

  • 5,000 RNAi experiments on a chip.

    Lehner B and Fraser AG

    Nature methods 2004;1;2;103-4

  • Protein domains enriched in mammalian tissue-specific or widely expressed genes.

    Lehner B and Fraser AG

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.

    In multicellular organisms some genes are expressed in essentially all tissues, whereas others are expressed predominantly in only one or a few tissues. In this study, we investigate the relationship between the tissue-specificity of gene expression and the type of protein encoded. We find that many protein domains are found to be enriched in either tissue-specific or widely expressed genes. Domains enriched in tissue-specific genes tend to be metazoan-specific; these same domains are also enriched in genes that are not essential for cell viability. These findings identify families of proteins that are probably used in the development or terminal differentiation of many different tissue types.

    Trends in genetics : TIG 2004;20;10;468-72

  • Technique review: how to use RNA interference.

    Lehner B, Fraser AG and Sanderson CM

    Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK.

    RNA interference (RNAi) has been rapidly adopted as a general method for inhibiting gene expression in most laboratory organisms. This paper discusses how libraries of RNAi reagents are being used to perform genome-wide reverse genetic screens in both model organisms and mammalian cells. Guidelines for designing effective small interfering RNAs and appropriate controls for mammalian RNAi experiments will also be discussed.

    Briefings in functional genomics & proteomics 2004;3;1;68-83

  • Genome-wide RNAi identifies p53-dependent and -independent regulators of germ cell apoptosis in C. elegans.

    Lettre G, Kritikou EA, Jaeggi M, Calixto A, Fraser AG, Kamath RS, Ahringer J and Hengartner MO

    Institute for Molecular Biology, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland.

    We used genome-wide RNA interference (RNAi) to identify genes that affect apoptosis in the C. elegans germ line. RNAi-mediated knockdown of 21 genes caused a moderate to strong increase in germ cell death. Genetic epistasis studies with these RNAi candidates showed that a large subset (16/21) requires p53 to activate germ cell apoptosis. Apoptosis following knockdown of the genes in the p53-dependent class also depended on a functional DNA damage response pathway, suggesting that these genes might function in DNA repair or to maintain genome integrity. As apoptotic pathways are conserved, orthologues of the worm germline apoptosis genes presented here could be involved in the maintenance of genomic stability, p53 activation, and fertility in mammals.

    Funded by: NIGMS NIH HHS: GM52240; Wellcome Trust: 054523

    Cell death and differentiation 2004;11;11;1198-203

  • A map of the interactome network of the metazoan C. elegans.

    Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual JF, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berriz GF, Jacotot L, Vaglio P, Reboul J, Hirozane-Kishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mango SE, Saxton WM, Strome S, Van Den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE and Vidal M

    Dana-Farber Cancer Institute and Department of Genetics, Harvard Medical School, 44 Binney Street, Boston, MA 02115, USA.

    To initiate studies on how protein-protein interaction (or "interactome") networks relate to multicellular functions, we have mapped a large fraction of the Caenorhabditis elegans interactome network. Starting with a subset of metazoan-specific proteins, more than 4000 interactions were identified from high-throughput, yeast two-hybrid (HT=Y2H) screens. Independent coaffinity purification assays experimentally validated the overall quality of this Y2H data set. Together with already described Y2H interactions and interologs predicted in silico, the current version of the Worm Interactome (WI5) map contains approximately 5500 interactions. Topological and biological features of this interactome network, as well as its integration with phenome and transcriptome data sets, lead to numerous biological hypotheses.

    Funded by: NIA NIH HHS: R01 AG011085; NIGMS NIH HHS: R01 GM034059, R01 GM034059-18

    Science (New York, N.Y.) 2004;303;5657;540-3

  • Genetics of the DST-mediated mRNA decay pathway using a transgene-based selection.

    Lidder P, Johnson MA, Sullivan ML, Thompson DM, Pérez-Amador MA, Howard CJ and Green PJ

    Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA.

    mRNA sequences that control abundance, localization and translation initiation have been identified, yet the factors that recognize these sequences are largely unknown. In this report, a transgene-based strategy designed to isolate mutants of Arabidopsis thaliana that fail to recognize these sequences is described. In this strategy, a selectable gene and a screenable marker gene are put under the control of the sequence element being analysed and mutants are selected with altered abundance of the corresponding marker RNAs. The selection of mutants deficient in recognition of the DST (downstream) mRNA degradation signal is used as a test-case to illustrate some of the technical aspects that have facilitated success. Using this strategy, we report the isolation of a new mutant, dst3, deficient in the DST-mediated mRNA decay pathway. The targeted genetic strategy described circumvents certain technical limitations of biochemical approaches. Hence, it provides a means to investigate a variety of other mechanisms responsible for post-transcriptional regulation.

    Biochemical Society transactions 2004;32;Pt 4;575-7

  • Staphylococcus aureus: superbug, super genome?

    Lindsay JA and Holden MT

    Department of Cellular & Molecular Medicine, St George's Hospital Medical School, Cranmer Terrace, London, UK.

    Staphylococcus aureus is a common cause of infection in both hospitals and the community, and it is becoming increasingly virulent and resistant to antibiotics. The recent sequencing of seven strains of S. aureus provides unprecedented information about its genome diversity. Subtle differences in core (stable) regions of the genome have been exploited by multi-locus sequence typing (MLST) to understand S. aureus population structure. Dramatic differences in the carriage and spread of accessory genes, including those involved in virulence and resistance, contribute to the emergence of new strains with healthcare implications. Understanding the differences between S. aureus genomes and the controls that govern these changes is helping to improve our knowledge of S. aureus pathogenicity and to predict the evolution of super-superbugs.

    Trends in microbiology 2004;12;8;378-85

  • Genomic and genetic analysis of Bordetella bacteriophages encoding reverse transcriptase-mediated tropism-switching cassettes.

    Liu M, Gingery M, Doulatov SR, Liu Y, Hodes A, Baker S, Davis P, Simmonds M, Churcher C, Mungall K, Quail MA, Preston A, Harvill ET, Maskell DJ, Eiserling FA, Parkhill J and Miller JF

    Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, Los Angeles, California 90095, USA.

    Liu et al. recently described a group of related temperate bacteriophages that infect Bordetella subspecies and undergo a unique template-dependent, reverse transcriptase-mediated tropism switching phenomenon (Liu et al., Science 295: 2091-2094, 2002). Tropism switching results from the introduction of single nucleotide substitutions at defined locations in the VR1 (variable region 1) segment of the mtd (major tropism determinant) gene, which determines specificity for receptors on host bacteria. In this report, we describe the complete nucleotide sequences of the 42.5- to 42.7-kb double-stranded DNA genomes of three related phage isolates and characterize two additional regions of variability. Forty-nine coding sequences were identified. Of these coding sequences, bbp36 contained VR2 (variable region 2), which is highly dynamic and consists of a variable number of identical 19-bp repeats separated by one of three 5-bp spacers, and bpm encodes a DNA adenine methylase with unusual site specificity and a homopolymer tract that functions as a hotspot for frameshift mutations. Morphological and sequence analysis suggests that these Bordetella phage are genetic hybrids of P22 and T7 family genomes, lending further support to the idea that regions encoding protein domains, single genes, or blocks of genes are readily exchanged between bacterial and phage genomes. Bordetella bacteriophages are capable of transducing genetic markers in vitro, and by using animal models, we demonstrated that lysogenic conversion can take place in the mouse respiratory tract during infection.

    Funded by: NIAID NIH HHS: 2-T32-AI07323, AI38417; NIGMS NIH HHS: GM-08042

    Journal of bacteriology 2004;186;5;1503-17

  • NCD3G: a novel nine-cysteine domain in family 3 GPCRs.

    Liu X, He Q, Studholme DJ, Wu Q, Liang S and Yu L

    State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Handan Road 220, Shanghai 200433, P.R. China.

    The NCD3G [for nine-cysteine domain of family 3 G-protein-coupled receptors (GPCRs)] domain is a novel protein domain that is conserved in family 3 GPCRs, including metabotropic glutamate receptors, calcium-sensing receptors, pheromone receptors and taste receptors, with the exception of GABA(B) receptors. The NCD3G domain contains nine highly conserved cysteine residues. Structural predictions suggest that NCD3G might possess four beta strands and three disulfide bridges. The structural model of NCD3G highlights the conserved residues co-segregated with certain familial diseases.

    Trends in biochemical sciences 2004;29;9;458-61

  • Hotspots of homologous recombination in the human genome: not all homologous sequences are equal.

    Lupski JR

    Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA.

    Homologous recombination between alleles or non-allelic paralogous sequences does not occur uniformly but is concentrated in 'hotspots' with high recombination rates. Recent studies of these hotspots show that they do not share common sequence motifs, but they do have other features in common.

    Genome biology 2004;5;10;242

  • Organization and evolution of a gene-rich region of the mouse genome: a 12.7-Mb region deleted in the Del(13)Svea36H mouse.

    Mallon AM, Wilming L, Weekes J, Gilbert JG, Ashurst J, Peyrefitte S, Matthews L, Cadman M, McKeone R, Sellick CA, Arkell R, Botcherby MR, Strivens MA, Campbell RD, Gregory S, Denny P, Hancock JM, Rogers J and Brown SD

    Medical Research Council Mammalian Genetics Unit, Harwell, Oxfordshire, United Kingdom.

    Del(13)Svea36H (Del36H) is a deletion of approximately 20% of mouse chromosome 13 showing conserved synteny with human chromosome 6p22.1-6p22.3/6p25. The human region is lost in some deletion syndromes and is the site of several disease loci. Heterozygous Del36H mice show numerous phenotypes and may model aspects of human genetic disease. We describe 12.7 Mb of finished, annotated sequence from Del36H. Del36H has a higher gene density than the draft mouse genome, reflecting high local densities of three gene families (vomeronasal receptors, serpins, and prolactins) which are greatly expanded relative to human. Transposable elements are concentrated near these gene families. We therefore suggest that their neighborhoods are gene factories, regions of frequent recombination in which gene duplication is more frequent. The gene families show different proportions of pseudogenes, likely reflecting different strengths of purifying selection and/or gene conversion. They are also associated with relatively low simple sequence concentrations, which vary across the region with a periodicity of approximately 5 Mb. Del36H contains numerous evolutionarily conserved regions (ECRs). Many lie in noncoding regions, are detectable in species as distant as Ciona intestinalis, and therefore are candidate regulatory sequences. This analysis will facilitate functional genomic analysis of Del36H and provides insights into mouse genome evolution.

    Genome research 2004;14;10A;1888-901

  • GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes.

    Martin DM, Berriman M and Barton GJ

    Post-Genomics and Molecular Interactions Centre, School of Life Sciences, University of Dundee, Dow Street, Dundee DD1 5EH, UK.

    Background: The function of a novel gene product is typically predicted by transitive assignment of annotation from similar sequences. We describe a novel method, GOtcha, for predicting gene product function by annotation with Gene Ontology (GO) terms. GOtcha predicts GO term associations with term-specific probability (P-score) measures of confidence. Term-specific probabilities are a novel feature of GOtcha and allow the identification of conflicts or uncertainty in annotation.

    Results: The GOtcha method was applied to the recently sequenced genome for Plasmodium falciparum and six other genomes. GOtcha was compared quantitatively for retrieval of assigned GO terms against direct transitive assignment from the highest scoring annotated BLAST search hit (TOPBLAST). GOtcha exploits information deep into the 'twilight zone' of similarity search matches, making use of much information that is otherwise discarded by more simplistic approaches. At a P-score cutoff of 50%, GOtcha provided 60% better recovery of annotation terms and 20% higher selectivity than annotation with TOPBLAST at an E-value cutoff of 10(-4).

    Conclusions: The GOtcha method is a useful tool for genome annotators. It has identified both errors and omissions in the original Plasmodium falciparum annotation and is being adopted by many other genome sequencing projects.

    BMC bioinformatics 2004;5;178

  • Automated comparative sequence analysis identifies mutations in 89% of NF1 patients and confirms a mutation cluster in exons 11-17 distinct from the GAP related domain.

    Mattocks C, Baralle D, Tarpey P, ffrench-Constant C, Bobrow M and Whittaker J

    Department of Medical Genetics, Box 134, Addenbrooke's Hospital, Cambridge, UK.

    Journal of medical genetics 2004;41;4;e48

  • Specific deletion of focal adhesion kinase suppresses tumor formation and blocks malignant progression.

    McLean GW, Komiyama NH, Serrels B, Asano H, Reynolds L, Conti F, Hodivala-Dilke K, Metzger D, Chambon P, Grant SG and Frame MC

    The Beatson Institute for Cancer Research, Garscube Estate, Bearsden, Glasgow, G61 1BD, United Kingdom.

    We have generated mice with a floxed fak allele under the control of keratin-14-driven Cre fused to a modified estrogen receptor (CreER(T2)). 4-Hydroxy-tamoxifen treatment induced fak deletion in the epidermis, and suppressed chemically induced skin tumor formation. Loss of fak induced once benign tumors had formed inhibited malignant progression. Although fak deletion was associated with reduced migration of keratinocytes in vitro, we found no effect on wound re-epithelialization in vivo. However, increased keratinocyte cell death was observed after fak deletion in vitro and in vivo. Our work provides the first experimental proof implicating FAK in tumorigenesis, and this is associated with enhanced apoptosis.

    Genes & development 2004;18;24;2998-3003

  • The fine-scale structure of recombination rate variation in the human genome.

    McVean GA, Myers SR, Hunt S, Deloukas P, Bentley DR and Donnelly P

    Department of Statistics, University of Oxford, Oxford OX1 3TG, UK.

    The nature and scale of recombination rate variation are largely unknown for most species. In humans, pedigree analysis has documented variation at the chromosomal level, and sperm studies have identified specific hotspots in which crossing-over events cluster. To address whether this picture is representative of the genome as a whole, we have developed and validated a method for estimating recombination rates from patterns of genetic variation. From extensive single-nucleotide polymorphism surveys in European and African populations, we find evidence for extreme local rate variation spanning four orders in magnitude, in which 50% of all recombination events take place in less than 10% of the sequence. We demonstrate that recombination hotspots are a ubiquitous feature of the human genome, occurring on average every 200 kilobases or less, but recombination occurs preferentially outside genes.

    Science (New York, N.Y.) 2004;304;5670;581-4

  • Gene structure conservation aids similarity based gene prediction.

    Meyer IM and Durbin R

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    One of the primary tasks in deciphering the functional contents of a newly sequenced genome is the identification of its protein coding genes. Existing computational methods for gene prediction include ab initio methods which use the DNA sequence itself as the only source of information, comparative methods using multiple genomic sequences, and similarity based methods which employ the cDNA or protein sequences of related genes to aid the gene prediction. We present here an algorithm implemented in a computer program called Projector which combines comparative and similarity approaches. Projector employs similarity information at the genomic DNA level by directly using known genes annotated on one DNA sequence to predict the corresponding related genes on another DNA sequence. It therefore makes explicit use of the conservation of the exon-intron structure between two related genes in addition to the similarity of their encoded amino acid sequences. We evaluate the performance of Projector by comparing it with the program Genewise on a test set of 491 pairs of independently confirmed mouse and human genes. It is more accurate than Genewise for genes whose proteins are <80% identical, and is suitable for use in a combined gene prediction system where other methods identify well conserved and non-conserved genes, and pseudogenes.

    Nucleic acids research 2004;32;2;776-83

  • Transferable antibiotic resistance elements in Haemophilus influenzae share a common evolutionary origin with a diverse family of syntenic genomic islands.

    Mohd-Zain Z, Turner SL, Cerdeño-Tárraga AM, Lilley AK, Inzana TJ, Duncan AJ, Harding RM, Hood DW, Peto TE and Crook DW

    Infectious Diseases and Clinical Microbiology, John Radcliffe Hospital, University of Oxford, Headington, Oxford, OX3 9DU, UK.

    Transferable antibiotic resistance in Haemophilus influenzae was first detected in the early 1970s. After this, resistance spread rapidly worldwide and was shown to be transferred by a large 40- to 60-kb conjugative element. Bioinformatics analysis of the complete sequence of a typical H. influenzae conjugative resistance element, ICEHin1056, revealed the shared evolutionary origin of this element. ICEHin1056 has homology to 20 contiguous sequences in the National Center for Biotechnology Information database. Systematic comparison of these homologous sequences resulted in identification of a conserved syntenic genomic island consisting of up to 33 core genes in 16 beta- and gamma-Proteobacteria. These diverse genomic islands shared a common evolutionary origin, insert into tRNA genes, and have diverged widely, with G+C contents ranging from 40 to 70% and amino acid homologies as low as 20 to 25% for shared core genes. These core genes are likely to account for the conjugative transfer of the genomic islands and may even encode autonomous replication. Accessory gene clusters were nestled among the core genes and encode the following diverse major attributes: antibiotic, metal, and antiseptic resistance; degradation of chemicals; type IV secretion systems; two-component signaling systems; Vi antigen capsule synthesis; toxin production; and a wide range of metabolic functions. These related genomic islands include the following well-characterized structures: SPI-7, found in Salmonella enterica serovar Typhi; PAP1 or pKLC102, found in Pseudomonas aeruginosa; and the clc element, found in Pseudomonas sp. strain B13. This is the first report of a diverse family of related syntenic genomic islands with a deep evolutionary origin, and our findings challenge the view that genomic islands consist only of independently evolving modules.

    Funded by: NIAID NIH HHS: R01-AI45091

    Journal of bacteriology 2004;186;23;8114-22

  • Position effect on PLP1 may cause a subset of Pelizaeus-Merzbacher disease symptoms.

    Muncke N, Wogatzky BS, Breuning M, Sistermans EA, Endris V, Ross M, Vetrie D, Catsman-Berrevoets CE and Rappold G

    Journal of medical genetics 2004;41;12;e121

  • Interaction between differentially methylated regions partitions the imprinted genes Igf2 and H19 into parent-specific chromatin loops.

    Murrell A, Heeson S and Reik W

    Laboratory of Developmental Genetics and Imprinting, Developmental Genetics Programme, The Babraham Institute, Cambridge CB2 4AT, UK.

    Imprinted genes are expressed from only one of the parental alleles and are marked epigenetically by DNA methylation and histone modifications. The paternally expressed gene insulin-like growth-factor 2 (Igf2) is separated by approximately 100 kb from the maternally expressed noncoding gene H19 on mouse distal chromosome 7. Differentially methylated regions in Igf2 and H19 contain chromatin boundaries, silencers and activators and regulate the reciprocal expression of the two genes in a methylation-sensitive manner by allowing them exclusive access to a shared set of enhancers. Various chromatin models have been proposed that separate Igf2 and H19 into active and silent domains. Here we used a GAL4 knock-in approach as well as the chromosome conformation capture technique to show that the differentially methylated regions in the imprinted genes Igf2 and H19 interact in mice. These interactions are epigenetically regulated and partition maternal and paternal chromatin into distinct loops. This generates a simple epigenetic switch for Igf2 through which it moves between an active and a silent chromatin domain.

    Nature genetics 2004;36;8;889-93

  • A clinical, microbiological, and pathological study of intestinal perforation associated with typhoid fever.

    Nguyen QC, Everest P, Tran TK, House D, Murch S, Parry C, Connerton P, Phan VB, To SD, Mastroeni P, White NJ, Tran TH, Vo VH, Dougan G, Farrar JJ and Wain J

    Dong Thap Provincial Hospital, Dong Thap, Ho Chi Minh City, Vietnam.

    One of the most serious complications of typhoid fever is intestinal perforation. Of 27 patients admitted to a provincial hospital in the Mekong Delta region of Vietnam who had gastrointestinal perforation secondary to suspected typhoid fever, 67% were male, with a median age of 23 years and a median duration of illness of 10 days. Salmonella enterica subspecies enterica serotype Typhi (S. Typhi) was isolated from 11 (41%) of 27 patients; of 27 patients, only 4 (15%) had positive cultures from gut biopsies. S. Typhi DNA was detected by polymerase chain reaction for all perforation biopsy samples. Detailed histological examination of the gastrointestinal mucosa at the site of perforation in all cases showed a combination of discrete acute and chronic inflammation. Acute inflammation at the serosal surface indicated additional tissue damage after perforation. Immunohistochemical results showed that the predominant infiltrating cell types at the site of perforation were CD68+ leukocytes (macrophages) or CD3+ leukocytes (T lymphocytes).

    Clinical infectious diseases : an official publication of the Infectious Diseases Society of America 2004;39;1;61-7

  • Eukaryotes: not beyond compare.

    Pain A, Bentley S and Parkhill J

    Nature reviews. Microbiology 2004;2;11;856-7

  • Strength in diversity.

    Pain A, Crossman L, Sebaihia M, Cerdeño-Tárraga A and Parkhill J

    Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.

    Nature reviews. Microbiology 2004;2;5;358-9

  • Insight into the genome of Aspergillus fumigatus: analysis of a 922 kb region encompassing the nitrate assimilation gene cluster.

    Pain A, Woodward J, Quail MA, Anderson MJ, Clark R, Collins M, Fosker N, Fraser A, Harris D, Larke N, Murphy L, Humphray S, O'Neil S, Pertea M, Price C, Rabbinowitsch E, Rajandream MA, Salzberg S, Saunders D, Seeger K, Sharp S, Warren T, Denning DW, Barrell B and Hall N

    The Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Aspergillus fumigatus is the most ubiquitous opportunistic filamentous fungal pathogen of human. As an initial step toward sequencing the entire genome of A. fumigatus, which is estimated to be approximately 30 Mb in size, we have sequenced a 922 kb region, contained within 16 overlapping bacterial artificial chromosome (BAC) clones. Fifty-four percent of the DNA is predicted to be coding with 341 putative protein coding genes. Functional classification of the proteins showed the presence of a higher proportion of enzymes and membrane transporters when compared to those of Saccharomyces cerevisiae. In addition to the nitrate assimilation gene cluster, the quinate utilisation gene cluster is also present on this 922 kb genomic sequence. We observed large scale synteny between A. fumigatus and Aspergillus nidulans by comparing this sequence to the A. nidulans genetic map of linkage group VIII.

    Fungal genetics and biology : FG & B 2004;41;4;443-53

  • A transcriptomic analysis of the phylum Nematoda.

    Parkinson J, Mitreva M, Whitton C, Thomson M, Daub J, Martin J, Schmid R, Hall N, Barrell B, Waterston RH, McCarter JP and Blaxter ML

    Hospital for Sick Children, 555 University Avenue, Departments of Biochemistry and Medical Genetics and Microbiology, University of Toronto, Toronto, Ontario M5G 1X8, Canada.

    The phylum Nematoda occupies a huge range of ecological niches, from free-living microbivores to human parasites. We analyzed the genomic biology of the phylum using 265,494 expressed-sequence tag sequences, corresponding to 93,645 putative genes, from 30 species, including 28 parasites. From 35% to 70% of each species' genes had significant similarity to proteins from the model nematode Caenorhabditis elegans. More than half of the putative genes were unique to the phylum, and 23% were unique to the species from which they were derived. We have not yet come close to exhausting the genomic diversity of the phylum. We identified more than 2,600 different known protein domains, some of which had differential abundances between major taxonomic groups of nematodes. We also defined 4,228 nematode-specific protein families from nematode-restricted genes: this class of genes probably underpins species- and higher-level taxonomic disparity. Nematode-specific families are particularly interesting as drug and vaccine targets.

    Nature genetics 2004;36;12;1259-67

  • Gene expression profiling in the myelodysplastic syndromes using cDNA microarray technology.

    Pellagatti A, Esoof N, Watkins F, Langford CF, Vetrie D, Campbell LJ, Fidler C, Cavenagh JD, Eagleton H, Gordon P, Woodcock B, Pushkaran B, Kwan M, Wainscoat JS and Boultwood J

    Leukaemia Research Fund Molecular Haematology Unit, Nuffield Department of Clinical Laboratory Sciences, John Radcliffe Hospital, Oxford, UK.

    The myelodysplastic syndromes (MDS) comprise a heterogeneous group of clonal disorders of the haematopoietic stem cell and primarily involve cells of the myeloid lineage. Using cDNA microarrays comprising 6000 human genes, we studied the gene expression profiles in the neutrophils of 21 MDS patients, seven of which had the 5q- syndrome, and two acute myeloid leukaemia (AML) patients when compared with the neutrophils from pooled healthy controls. Data analysis showed a high level of heterogeneity of gene expression between MDS patients, most probably reflecting the underlying karyotypic and genetic heterogeneity. Nevertheless, several genes were commonly up or down-regulated in MDS. The most up-regulated genes included RAB20, ARG1, ZNF183 and ACPL. The RAB20 gene is a member of the Ras gene superfamily and ARG1 promotes cellular proliferation. The most down-regulated genes include COX2, CD18, FOS and IL7R. COX2 is anti-apoptotic and promotes cell survival. Many genes were identified that are differentially expressed in the different MDS subtypes and AML. A subset of genes was able to discriminate patients with the 5q- syndrome from patients with refractory anaemia and a normal karyotype. The microarray expression results for several genes were confirmed by real-time quantitative polymerase chain reaction. The MDS-specific expression changes identified are likely to be biologically important in the pathophysiology of this disorder.

    British journal of haematology 2004;125;5;576-83

  • The Ensembl analysis pipeline.

    Potter SC, Clarke L, Curwen V, Keenan S, Mongin E, Searle SM, Stabenau A, Storey R and Clamp M

    The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

    The Ensembl pipeline is an extension to the Ensembl system which allows automated annotation of genomic sequence. The software comprises two parts. First, there is a set of Perl modules ("Runnables" and "RunnableDBs") which are 'wrappers' for a variety of commonly used analysis tools. These retrieve sequence data from a relational database, run the analysis, and write the results back to the database. They inherit from a common interface, which simplifies the writing of new wrapper modules. On top of this sits a job submission system (the "RuleManager") which allows efficient and reliable submission of large numbers of jobs to a compute farm. Here we describe the fundamental software components of the pipeline, and we also highlight some features of the Sanger installation which were necessary to enable the pipeline to scale to whole-genome analysis.

    Genome research 2004;14;5;934-41

  • The bordetellae: lessons from genomics.

    Preston A, Parkhill J and Maskell DJ

    Department of Microbiology, University of Guelph, Guelph, Ontario N1G 2W1, Canada.

    Nature reviews. Microbiology 2004;2;5;379-90

  • Where west meets east: the complex mtDNA landscape of the southwest and Central Asian corridor.

    Quintana-Murci L, Chaix R, Wells RS, Behar DM, Sayar H, Scozzari R, Rengo C, Al-Zahery N, Semino O, Santachiara-Benerecetti AS, Coppa A, Ayub Q, Mohyuddin A, Tyler-Smith C, Qasim Mehdi S, Torroni A and McElreavey K

    Centre National de la Recherche Scientifique (CNRS) URA 1961, Institut Pasteur, 75724 Paris Cedex 15, France.

    The southwestern and Central Asian corridor has played a pivotal role in the history of humankind, witnessing numerous waves of migration of different peoples at different times. To evaluate the effects of these population movements on the current genetic landscape of the Iranian plateau, the Indus Valley, and Central Asia, we have analyzed 910 mitochondrial DNAs (mtDNAs) from 23 populations of the region. This study has allowed a refinement of the phylogenetic relationships of some lineages and the identification of new haplogroups in the southwestern and Central Asian mtDNA tree. Both lineage geographical distribution and spatial analysis of molecular variance showed that populations located west of the Indus Valley mainly harbor mtDNAs of western Eurasian origin, whereas those inhabiting the Indo-Gangetic region and Central Asia present substantial proportions of lineages that can be allocated to three different genetic components of western Eurasian, eastern Eurasian, and south Asian origin. In addition to the overall composite picture of lineage clusters of different origin, we observed a number of deep-rooting lineages, whose relative clustering and coalescent ages suggest an autochthonous origin in the southwestern Asian corridor during the Pleistocene. The comparison with Y-chromosome data revealed a highly complex genetic and demographic history of the region, which includes sexually asymmetrical mating patterns, founder effects, and female-specific traces of the East African slave trade.

    Funded by: Telethon: E.0890

    American journal of human genetics 2004;74;5;827-45

  • DNA methylation profiling of the human major histocompatibility complex: a pilot study for the human epigenome project.

    Rakyan VK, Hildmann T, Novik KL, Lewin J, Tost J, Cox AV, Andrews TD, Howe KL, Otto T, Olek A, Fischer J, Gut IG, Berlin K and Beck S

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom.

    The Human Epigenome Project aims to identify, catalogue, and interpret genome-wide DNA methylation phenomena. Occurring naturally on cytosine bases at cytosine-guanine dinucleotides, DNA methylation is intimately involved in diverse biological processes and the aetiology of many diseases. Differentially methylated cytosines give rise to distinct profiles, thought to be specific for gene activity, tissue type, and disease state. The identification of such methylation variable positions will significantly improve our understanding of genome biology and our ability to diagnose disease. Here, we report the results of the pilot study for the Human Epigenome Project entailing the methylation analysis of the human major histocompatibility complex. This study involved the development of an integrated pipeline for high-throughput methylation analysis using bisulphite DNA sequencing, discovery of methylation variable positions, epigenotyping by matrix-assisted laser desorption/ionisation mass spectrometry, and development of an integrated public database available at Our analysis of DNA methylation levels within the major histocompatibility complex, including regulatory exonic and intronic regions associated with 90 genes in multiple tissues and individuals, reveals a bimodal distribution of methylation profiles (i.e., the vast majority of the analysed regions were either hypo- or hypermethylated), tissue specificity, inter-individual variation, and correlation with independent gene expression data.

    PLoS biology 2004;2;12;e405

  • Evolutionary pressures on apicoplast transit peptides.

    Ralph SA, Foth BJ, Hall N and McFadden GI

    Plant Cell Biology Research Centre, School of Botany, University of Melbourne, Parkville, Victoria, Australia.

    Malaria parasites (species of the genus Plasmodium) harbor a relict chloroplast (the apicoplast) that is the target of novel antimalarials. Numerous nuclear-encoded proteins are translocated into the apicoplast courtesy of a bipartite N-terminal extension. The first component of the bipartite leader resembles a standard signal peptide present at the N-terminus of secreted proteins that enter the endomembrane system. Analysis of the second portion of the bipartite leaders of P. falciparum, the so-called transit peptide, indicates similarities to plant transit peptides, although the amino acid composition of P. falciparum transit peptides shows a strong bias, which we rationalize by the extraordinarily high AT content of P. falciparum DNA. 786 plastid transit peptides were also examined from several other apicomplexan parasites, as well as from angiosperm plants. In each case, amino acid biases were correlated with nucleotide AT content. A comparison of a spectrum of organisms containing primary and secondary plastids also revealed features unique to secondary plastid transit peptides. These unusual features are explained in the context of secondary plastid trafficking via the endomembrane system.

    Funded by: NIAID NIH HHS: AI05093

    Molecular biology and evolution 2004;21;12;2183-94

  • Evolutionary families of peptidase inhibitors.

    Rawlings ND, Tolle DP and Barrett AJ

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.

    The proteins that inhibit peptidases are of great importance in medicine and biotechnology, but there has never been a comprehensive system of classification for them. Some of the terminology currently in use is potentially confusing. In the hope of facilitating the exchange, storage and retrieval of information about this important group of proteins, we now describe a system wherein the inhibitor units of the peptidase inhibitors are assigned to 48 families on the basis of similarities detectable at the level of amino acid sequence. Then, on the basis of three-dimensional structures, 31 of the families are assigned to 26 clans. A simple system of nomenclature is introduced for reference to each clan, family and inhibitor. We briefly discuss the specificities and mechanisms of the interactions of the inhibitors in the various families with their target enzymes. The system of families and clans of inhibitors described has been implemented in the MEROPS peptidase database (, and this will provide a mechanism for updating it as new information becomes available.

    The Biochemical journal 2004;378;Pt 3;705-16

  • MEROPS: the peptidase database.

    Rawlings ND, Tolle DP and Barrett AJ

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Peptidases (proteolytic enzymes) are of great relevance to biology, medicine and biotechnology. This practical importance creates a need for an integrated source of information about them, and also about their natural inhibitors. The MEROPS database ( aims to fill this need. The organizational principle of the database is a hierarchical classification in which homologous sets of the proteins of interest are grouped in families and the homologous families are grouped in clans. Each peptidase, family and clan has a unique identifier. The database has recently been expanded to include the protein inhibitors of peptidases, and these are classified in much the same way as the peptidases. Forms of information recently added include new links to other databases, summary alignments for peptidase clans, displays to show the distribution of peptidases and inhibitors among organisms, substrate cleavage sites and indexes for expressed sequence tag libraries containing peptidases. A new way of making hyperlinks to the database has been devised and a BlastP search of our library of peptidase and inhibitor sequences has been added.

    Nucleic acids research 2004;32;Database issue;D160-4

  • Chromosome loops, insulators, and histone methylation: new insights into regulation of imprinting in clusters.

    Reik W, Murrell A, Lewis A, Mitsuya K, Umlauf D, Dean W, Higgins M and Feil R

    Laboratory of Developmental Genetics and Imprinting, The Babraham Institute, Cambridge CB2 4AT, United Kingdom.

    Cold Spring Harbor symposia on quantitative biology 2004;69;29-37

  • A Myo7a mutation cosegregates with stereocilia defects and low-frequency hearing impairment.

    Rhodes CR, Hertzano R, Fuchs H, Bell RE, de Angelis MH, Steel KP and Avraham KB

    MRC Institute of Hearing Research, University Park, NG7 2RD, Nottingham, UK.

    A phenotype-driven approach was adopted in the mouse to identify molecules involved in ear development and function. Mutant mice were obtained using N-ethyl- N-nitrosourea (ENU) mutagenesis and were screened for dominant mutations that affect hearing and/or balance. Heterozygote headbanger ( Hdb/+) mutants display classic behavior indicative of vestibular dysfunction including hyperactivity and head bobbing, and they show a Preyer reflex in response to sound but have raised cochlear thresholds especially at low frequencies. Scanning electron microscopy of the surface of the organ of Corti revealed abnormal stereocilia bundle development from an early age that was more severe in the apex than the base. Utricular stereocilia were long, thin, and wispy. Homozygotes showed a similar but more severe phenotype. The headbanger mutation has been mapped to a 1.5-cM region on mouse Chromosome 7 in the region of the unconventional myosin gene Myo7a, and mutation screening revealed an A>T transversion that is predicted to cause an isoleucine-to-phenylalanine amino acid substitution (I178F) in a conserved region in the motor-encoding domain of the gene. Protein analysis revealed reduced levels of myosin VIIa expression in inner ears of headbanger mice. Headbanger represents a novel inner ear phenotype and provides a potential model for low-frequency-type human hearing loss.

    Mammalian genome : official journal of the International Mammalian Genome Society 2004;15;9;686-97

  • Complete nucleotide sequence of the conjugative tetracycline resistance plasmid pFBAOT6, a member of a group of IncU plasmids with global ubiquity.

    Rhodes G, Parkhill J, Bird C, Ambrose K, Jones MC, Huys G, Swings J and Pickup RW

    Centre for Ecology and Hydrology, Lancaster, United Kingdom.

    This study presents the first complete sequence of an IncU plasmid, pFBAOT6. This plasmid was originally isolated from a strain of Aeromonas caviae from hospital effluent (Westmorland General Hospital, Kendal, United Kingdom) in September 1997 (G. Rhodes, G. Huys, J. Swings, P. McGann, M. Hiney, P. Smith, and R. W. Pickup, Appl. Environ. Microbiol. 66:3883-3890, 2000) and belongs to a group of related plasmids with global ubiquity. pFBAOT6 is 84,748 bp long and has 94 predicted coding sequences, only 12 of which do not have a possible function that has been attributed. Putative replication, maintenance, and transfer functions have been identified and are located in a region in the first 31 kb of the plasmid. The replication region is poorly understood but exhibits some identity at the protein level with replication proteins from the gram-positive bacteria Bacillus and Clostridium. The mating pair formation system is a virB homologue, type IV secretory pathway that is similar in its structural organization to the mating pair formation systems of the related broad-host-range (BHR) environmental plasmids pIPO2, pXF51, and pSB102 from plant-associated bacteria. Partitioning and maintenance genes are homologues of genes in IncP plasmids. The DNA transfer genes and the putative oriT site also exhibit high levels of similarity with those of plasmids pIPO2, pXF51, and pSB102. The genetic load region encompasses 54 kb, comprises the resistance genes, and includes a class I integron, an IS630 relative, and other transposable elements in a 43-kb region that may be a novel Tn1721-flanked composite transposon. This region also contains 24 genes that exhibit the highest levels of identity to chromosomal genes of several plant-associated bacteria. The features of the backbone of pFBAOT6 that are shared with this newly defined group of environmental BHR plasmids suggest that pFBAOT6 may be a relative of this group, but a relative that was isolated from a clinical bacterial environment rather than a plant-associated bacterial environment.

    Applied and environmental microbiology 2004;70;12;7497-510

  • Cloning of a new familial t(3;8) translocation associated with conventional renal cell carcinoma reveals a 5 kb microdeletion and no gene involved in the rearrangement.

    Rodríguez-Perales S, Meléndez B, Gribble SM, Valle L, Carter NP, Santamaría I, Conde L, Urioste M, Benítez J and Cigudosa JC

    Cytogenetics Unit, Biotechnology Programme, Centro Nacional de Investigaciones Oncológicas, Madrid, 28029, Spain.

    This study describes the molecular cloning of a familial translocation, t(3;8)(p14.2;q24.2), that segregates with the conventional renal cell carcinoma (conventional RCC). We had previously reported the family history and, through loss of heterozygosity and comparative genomic hybridization, detected the loss of the 3p chromosome arm and somatic mutation in the retained von Hippel-Lindau gene in some members of the family. With the help of array painting and sequence tagged site-PCR on flow-sorted derivative chromosomes, we have cloned the breakpoints of the translocation. We have studied the junctions on both derivative chromosomes at the genomic and expression levels. The analysis of the sequence revealed a 5 kb microdeletion at the chromosome 3 breakpoint together with a high density of repetitive motifs (Alu, short interspersed nuclear element) and an AT-rich region. Both chromosome 3 and 8 rearranged regions were very poor in gene content. We tested an expressed sequence tag, two predicted genes, one novel gene and LRIG1, a gene located more than 200 kb apart from the breakpoint on chromosome 3. None of these genes, except LRIG1, showed expression in any of the tested tissues (including normal adult and fetal kidney, sporadic kidney tumours and tumour samples from the proband's family). Taken together, all these data suggest that, rather than deregulation of specific genes that may be rearranged by the translocation, the proposed three-step model of tumour development (translocation, loss of the 3p chromosome, and mutation in a tumour suppressor gene located within that region) could be the biological mechanism that takes place in this familial form of conventional RCC.

    Human molecular genetics 2004;13;9;983-90

  • Identification of mammalian microRNA host genes and transcription units.

    Rodriguez A, Griffiths-Jones S, Ashurst JL and Bradley A

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.

    To derive a global perspective on the transcription of microRNAs (miRNAs) in mammals, we annotated the genomic position and context of this class of noncoding RNAs (ncRNAs) in the human and mouse genomes. Of the 232 known mammalian miRNAs, we found that 161 overlap with 123 defined transcription units (TUs). We identified miRNAs within introns of 90 protein-coding genes with a broad spectrum of molecular functions, and in both introns and exons of 66 mRNA-like noncoding RNAs (mlncRNAs). In addition, novel families of miRNAs based on host gene identity were identified. The transcription patterns of all miRNA host genes were curated from a variety of sources illustrating spatial, temporal, and physiological regulation of miRNA expression. These findings strongly suggest that miRNAs are transcribed in parallel with their host transcripts, and that the two different transcription classes of miRNAs ('exonic' and 'intronic') identified here may require slightly different mechanisms of biogenesis.

    Genome research 2004;14;10A;1902-10

  • Tetrasomy 21pter-->q21.2 in a male infant without typical Down's syndrome dysmorphic features but moderate mental retardation.

    Rost I, Fiegler H, Fauth C, Carr P, Bettecken T, Kraus J, Meyer C, Enders A, Wirtz A, Meitinger T, Carter NP and Speicher MR

    Journal of medical genetics 2004;41;3;e26

  • Periodic gene expression program of the fission yeast cell cycle.

    Rustici G, Mata J, Kivinen K, Lió P, Penkett CJ, Burns G, Hayles J, Brazma A, Nurse P and Bähler J

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.

    Cell-cycle control of transcription seems to be universal, but little is known about its global conservation and biological significance. We report on the genome-wide transcriptional program of the Schizosaccharomyces pombe cell cycle, identifying 407 periodically expressed genes of which 136 show high-amplitude changes. These genes cluster in four major waves of expression. The forkhead protein Sep1p regulates mitotic genes in the first cluster, including Ace2p, which activates transcription in the second cluster during the M-G1 transition and cytokinesis. Other genes in the second cluster, which are required for G1-S progression, are regulated by the MBF complex independently of Sep1p and Ace2p. The third cluster coincides with S phase and a fourth cluster contains genes weakly regulated during G2 phase. Despite conserved cell-cycle transcription factors, differences in regulatory circuits between fission and budding yeasts are evident, revealing evolutionary plasticity of transcriptional control. Periodic transcription of most genes is not conserved between the two yeasts, except for a core set of approximately 40 genes that seem to be universally regulated during the eukaryotic cell cycle and may have key roles in cell-cycle progression.

    Funded by: Cancer Research UK: A6517; Wellcome Trust: 077118

    Nature genetics 2004;36;8;809-17

  • Methylation of histone H4 lysine 20 controls recruitment of Crb2 to sites of DNA damage.

    Sanders SL, Portoso M, Mata J, Bähler J, Allshire RC and Kouzarides T

    The Wellcome Trust/Cancer Research UK Gurdon Institute and Department of Pathology, Tennis Court Road, Cambridge CB2 1QN, United Kingdom.

    Histone lysine methylation is a key regulator of gene expression and heterochromatin function, but little is known as to how this modification impinges on other chromatin activities. Here we demonstrate that a previously uncharacterized SET domain protein, Set9, is responsible for H4-K20 methylation in the fission yeast Schizosaccharomyces pombe. Surprisingly, H4-K20 methylation does not have any apparent role in the regulation of gene expression or heterochromatin function. Rather, we find the modification has a role in DNA damage response. Loss of Set9 activity or mutation of H4-K20 markedly impairs cell survival after genotoxic challenge and compromises the ability of cells to maintain checkpoint mediated cell cycle arrest. Genetic experiments link Set9 to Crb2, a homolog of the mammalian checkpoint protein 53BP1, and the enzyme is required for Crb2 localization to sites of DNA damage. These results argue that H4-K20 methylation functions as a "histone mark" required for the recruitment of the checkpoint protein Crb2.

    Funded by: Cancer Research UK: A6517; Wellcome Trust: 077118

    Cell 2004;119;5;603-14

  • Folate-sensitive fragile site FRA10A is due to an expansion of a CGG repeat in a novel gene, FRA10AC1, encoding a nuclear protein.

    Sarafidou T, Kahl C, Martinez-Garay I, Mangelsdorf M, Gesk S, Baker E, Kokkinaki M, Talley P, Maltby EL, French L, Harder L, Hinzmann B, Nobile C, Richkind K, Finnis M, Deloukas P, Sutherland GR, Kutsche K, Moschonas NK, Siebert R, Gécz J and European Collaborative Consortium for the Study of ADLTE

    Department of Biology, University of Crete, and Institute of Molecular Biology and Biotechnology(IMBB), Foundation of Research and Technology (FORTH-GR), P.O. Box 2208, 714 09 Heraklion, Crete, Greece.

    Fragile sites appear visually as nonstaining gaps on chromosomes that are inducible by specific cell culture conditions. Expansion of CGG/CCG repeats has been shown to be the molecular basis of all five folate-sensitive fragile sites characterized molecularly so far, i.e., FRAXA, FRAXE, FRAXF, FRA11B, and FRA16A. In the present study we have refined the localization of the FRA10A folate-sensitive fragile site by fluorescence in situ hybridization. Sequence analysis of a BAC clone spanning FRA10A identified a single, imperfect, but polymorphic CGG repeat that is part of a CpG island in the 5'UTR of a novel gene named FRA10AC1. The number of CGG repeats varied in the population from 8 to 13. Expansions exceeding 200 repeat units were methylated in all FRA10A fragile site carriers tested. The FRA10AC1 gene consists of 19 exons and is transcribed in the centromeric direction from the FRA10A repeat. The major transcript of approximately 1450 nt is ubiquitously expressed and codes for a highly conserved protein, FRA10AC1, of unknown function. Several splice variants leading to alternative 3' ends were identified (particularly in testis). These give rise to FRA10AC1 proteins with altered COOH-termini. Immunofluorescence analysis of full-length, recombinant EGFP-tagged FRA10AC1 protein showed that it was present exclusively in the nucleoplasm. We show that the expression of FRA10A, in parallel to the other cloned folate-sensitive fragile sites, is caused by an expansion and subsequent methylation of an unstable CGG trinucleotide repeat. Taking advantage of three cSNPs within the FRA10AC1 gene we demonstrate that one allele of the gene is not transcribed in a FRA10A carrier. Our data also suggest that in the heterozygous state FRA10A is likely a benign folate-sensitive fragile site.

    Genomics 2004;84;1;69-81

  • The otter annotation system.

    Searle SM, Gilbert J, Iyer V and Clamp M

    The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    With the completion of the human genome sequence and genome sequence available for other vertebrate genomes, the task of manual annotation at the large genome scale has become a priority. Possibly even more important, is the requirement to curate and improve this annotation in the light of future data. For this to be possible, there is a need for tools to access and manage the annotation. Ensembl provides an excellent means for storing gene structures, genome features, and sequence, but it does not support the extra textual data necessary for manual annotation. We have extended Ensembl to create the Otter manual annotation system. This comprises a relational database schema for storing the manual annotation data, an application-programming interface (API) to access it, an extensible markup language (XML) format to allow transfer of the data, and a server to allow multiuser/multimachine access to the data. We have also written a data-adaptor plugin for the Apollo Browser/Editor to enable it to utilize an Otter server. The otter database is currently used by the Vertebrate Genome Annotation (VEGA) site (, which provides access to manually curated human chromosomes. Support is also being developed for using the AceDB annotation editor, FMap, via a perl wrapper called Lace. The Human and Vertebrate Annotation (HAVANA) group annotators at the Sanger center are using this to annotate human chromosomes 1 and 20.

    Genome research 2004;14;5;963-70

  • DomIns: a web resource for domain insertions in known protein structures.

    Selvam RA and Sasidharan R

    The Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Proteins can be formed by single or multiple domains. The process of recombination at the molecular level has generated a wide variety of multi-domain proteins with specific domain organization to cater to the functional requirements of an organism. The functional and structural costs of inserting a domain into another means that multi-domain proteins are usually formed by covalently linking the N-terminus of one domain to the C-terminus of the preceding domain. While this is true in a large proportion of multi-domain proteins, we find a significant fraction of proteins that are the result of domain insertion. The inserted domain breaks the sequence contiguity of the domain into which it is inserted leading to a novel domain organization. This web resource aims to document domain insertions in known protein structures that are classified in the SCOP database. The web server can be accessed from

    Nucleic acids research 2004;32;Database issue;D193-5

  • Microarray based comparative genomic hybridisation (array-CGH) detects submicroscopic chromosomal deletions and duplications in patients with learning disability/mental retardation and dysmorphic features.

    Shaw-Smith C, Redon R, Rickman L, Rio M, Willatt L, Fiegler H, Firth H, Sanlaville D, Winter R, Colleaux L, Bobrow M and Carter NP

    University of Cambridge Department of Medical Genetics, Addenbrooke's Hospital, Hills Road, Cambridge, UK.

    The underlying causes of learning disability and dysmorphic features in many patients remain unidentified despite extensive investigation. Routine karyotype analysis is not sensitive enough to detect subtle chromosome rearrangements (less than 5 Mb). The presence of subtle DNA copy number changes was investigated by array-CGH in 50 patients with learning disability and dysmorphism, employing a DNA microarray constructed from large insert clones spaced at approximately 1 Mb intervals across the genome. Twelve copy number abnormalities were identified in 12 patients (24% of the total): seven deletions (six apparently de novo and one inherited from a phenotypically normal parent) and five duplications (one de novo and four inherited from phenotypically normal parents). Altered segments ranged in size from those involving a single clone to regions as large as 14 Mb. No recurrent deletion or duplication was identified within this cohort of patients. On the basis of these results, we anticipate that array-CGH will become a routine method of genome-wide screening for imbalanced rearrangements in children with learning disability.

    Journal of medical genetics 2004;41;4;241-8

  • Comparative genomic analysis of two avian (quail and chicken) MHC regions.

    Shiina T, Shimizu S, Hosomichi K, Kohara S, Watanabe S, Hanzawa K, Beck S, Kulski JK and Inoko H

    Department of Molecular Life Science, Division of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine, Bohseidai, Isehara, Kanagawa, Japan.

    We mapped two different quail Mhc haplotypes and sequenced one of them (haplotype A) for comparative genomic analysis with a previously sequenced haplotype of the chicken Mhc. The quail haplotype A spans 180 kb of genomic sequence, encoding a total of 41 genes compared with only 19 genes within the 92-kb chicken Mhc. Except for two gene families (B30 and tRNA), both species have the same basic set of gene family members that were previously described in the chicken "minimal essential" Mhc. The two Mhc regions have a similar overall organization but differ markedly in that the quail has an expanded number of duplicated genes with 7 class I, 10 class IIB, 4 NK, 6 lectin, and 8 B-G genes. Comparisons between the quail and chicken Mhc class I and class II gene sequences by phylogenetic analysis showed that they were more closely related within species than between species, suggesting that the quail Mhc genes were duplicated after the separation of these two species from their common ancestor. The proteins encoded by the NK and class I genes are known to interact as ligands and receptors, but unlike in the quail and the chicken, the genes encoding these proteins in mammals are found on different chromosomes. The finding of NK-like genes in the quail Mhc strongly suggests an evolutionary connection between the NK C-type lectin-like superfamily and the Mhc, providing support for future studies on the NK, lectin, class I, and class II interaction in birds.

    Journal of immunology (Baltimore, Md. : 1950) 2004;172;11;6751-63

  • The Eimeria genome projects: a sequence of events.

    Shirley MW, Ivens A, Gruber A, Madeira AM, Wan KL, Dear PH and Tomley FM

    Institute for Animal Health, Compton Laboratory, Compton, Nr Newbury, Berkshire RG20 7NN, UK.

    Trends in parasitology 2004;20;5;199-201

  • A public gene trap resource for mouse functional genomics.

    Skarnes WC, von Melchner H, Wurst W, Hicks G, Nord AS, Cox T, Young SG, Ruiz P, Soriano P, Tessier-Lavigne M, Conklin BR, Stanford WL, Rossant J and International Gene Trap Consortium

    Funded by: Wellcome Trust: 077188

    Nature genetics 2004;36;6;543-4

  • The Ensembl Web site: mechanics of a genome browser.

    Stalker J, Gibbins B, Meidl P, Smith J, Spooner W, Hotz HR and Cox AV

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK.

    The Ensembl Web site ( is the principal user interface to the data of the Ensembl project, and currently serves >500,000 pages (approximately 2.5 million hits) per week, providing access to >80 GB (gigabyte) of data to users in more than 80 countries. Built atop an open-source platform comprising Apache/mod_perl and the MySQL relational database management system, it is modular, extensible, and freely available. It is being actively reused and extended in several different projects, and has been downloaded and installed in companies and academic institutions worldwide. Here, we describe some of the technical features of the site, with particular reference to its dynamic configuration that enables it to handle disparate data from multiple species.

    Genome research 2004;14;5;951-5

  • The notochord.

    Stemple DL

    Vertebrate Development and Genetics (Team 31), Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Current biology : CB 2004;14;20;R873-4

  • TILLING--a high-throughput harvest for functional genomics.

    Stemple DL

    Vertebrate Development and Genetics (Team 31), Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Nature reviews. Genetics 2004;5;2;145-50

  • Pattern formation and developmental mechanisms: super-models of development.

    Stemple DL and Vincent JP

    Current opinion in genetics & development 2004;14;4;325-7

  • Lung cancer: intragenic ERBB2 kinase mutations in tumours.

    Stephens P, Hunter C, Bignell G, Edkins S, Davies H, Teague J, Stevens C, O'Meara S, Smith R, Parker A, Barthorpe A, Blow M, Brackenbury L, Butler A, Clarke O, Cole J, Dicks E, Dike A, Drozd A, Edwards K, Forbes S, Foster R, Gray K, Greenman C, Halliday K, Hills K, Kosmidou V, Lugg R, Menzies A, Perry J, Petty R, Raine K, Ratford L, Shepherd R, Small A, Stephens Y, Tofts C, Varian J, West S, Widaa S, Yates A, Brasseur F, Cooper CS, Flanagan AM, Knowles M, Leung SY, Louis DN, Looijenga LH, Malkowicz B, Pierotti MA, Teh B, Chenevix-Trench G, Weber BL, Yuen ST, Harris G, Goldstraw P, Nicholson AG, Futreal PA, Wooster R and Stratton MR

    Cancer Genome Project, Wellcome Trust Sanger Institute, Hinxton CB10 1SA, UK.

    The protein-kinase family is the most frequently mutated gene family found in human cancer and faulty kinase enzymes are being investigated as promising targets for the design of antitumour therapies. We have sequenced the gene encoding the transmembrane protein tyrosine kinase ERBB2 (also known as HER2 or Neu) from 120 primary lung tumours and identified 4% that have mutations within the kinase domain; in the adenocarcinoma subtype of lung cancer, 10% of cases had mutations. ERBB2 inhibitors, which have so far proved to be ineffective in treating lung cancer, should now be clinically re-evaluated in the specific subset of patients with lung cancer whose tumours carry ERBB2 mutations.

    Nature 2004;431;7008;525-6

  • Complete MHC haplotype sequencing for common disease gene mapping.

    Stewart CA, Horton R, Allcock RJ, Ashurst JL, Atrazhev AM, Coggill P, Dunham I, Forbes S, Halls K, Howson JM, Humphray SJ, Hunt S, Mungall AJ, Osoegawa K, Palmer S, Roberts AN, Rogers J, Sims S, Wang Y, Wilming LG, Elliott JF, de Jong PJ, Sawcer S, Todd JA, Trowsdale J and Beck S

    Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.

    The future systematic mapping of variants that confer susceptibility to common diseases requires the construction of a fully informative polymorphism map. Ideally, every base pair of the genome would be sequenced in many individuals. Here, we report 4.75 Mb of contiguous sequence for each of two common haplotypes of the major histocompatibility complex (MHC), to which susceptibility to >100 diseases has been mapped. The autoimmune disease-associated-haplotypes HLA-A3-B7-Cw7-DR15 and HLA-A1-B8-Cw7-DR3 were sequenced in their entirety through a bacterial artificial chromosome (BAC) cloning strategy using the consanguineous cell lines PGF and COX, respectively. The two sequences were annotated to encompass all described splice variants of expressed genes. We defined the complete variation content of the two haplotypes, revealing >18,000 variations between them. Average SNP densities ranged from less than one SNP per kilobase to >60. Acquisition of complete and accurate sequence data over polymorphic regions such as the MHC from large-insert cloned DNA provides a definitive resource for the construction of informative genetic maps, and avoids the limitation of chromosome regions that are refractory to PCR amplification.

    Funded by: Multiple Sclerosis Society: 588

    Genome research 2004;14;6;1176-87

  • Cancer: understanding the target.

    Stratton MR and Futreal PA

    Nature 2004;430;6995;30

  • The BRAF gene is frequently mutated in malignant melanoma.

    Stratton MR, Wooster RW and Futreal PA

    The Cancer Genome Project Wellcome Trust Sanger Institute, Hinxton, UK.

    Journal of drugs in dermatology : JDD 2004;3;5;573-5

  • In silico analysis of the sigma54-dependent enhancer-binding proteins in Pirellula species strain 1.

    Studholme DJ and Dixon R

    Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.

    The planctomycetes are a phylogenetically distinct group of bacteria, widespread in aquatic and terrestrial environments. Their cell walls lack peptidoglycan and their compartmentalised cells undergo a yeast-like budding cell division process. Many bacteria regulate a subset of their genes by an enhancer-dependent mechanism involving the alternative sigma factor sigma54 (RpoN, sigmaN) in association with sigma54-dependent transcriptional activators known as enhancer-binding proteins (EBPs). The sigma54-dependent regulon has previously been studied in several groups of bacteria, but not in the planctomycetes. We wished to exploit the recently published complete genome sequence of Pirellula species strain 1 to predict and analyse the sigma54-dependent regulon in this interesting group of bacteria. The genome of Pirellula species strain 1 encodes one homologue of sigma54, and 16 sigma54-dependent EBPs, including 10 two-component response regulators and a homologue of Escherichia coli RtcR. Two EBPs contain forkhead-associated domains, representing a novel protein domain combination not previously observed in bacterial EBPs and suggesting a novel link between the enhancer-dependent regulon and 'eukaryotic-like' protein phosphorylation in bacterial signal transduction. We identified several potential sigma54-dependent promoters upstream of genes and operons including two homologues of csrA, which encodes the global regulator CsrA, and rtcBA, encoding a RNA 3'-terminal phosphate cyclase. Phylogenetic analysis of EBP sequences from a wide range of bacterial taxa suggested that planctomycete EBPs fall into several distinct clades. Also the phylogeny of the sigma54 factors is broadly consistent with that of the host organisms. These results are consistent with a very ancient origin of sigma54 within the bacterial lineage. The repertoire of functions predicted to be under the control of the sigma54-dependent regulon in Pirellula shares some similarities (e.g. rtcBA) as well as exhibiting differences with that in other taxonomic groups of bacteria, reinforcing the evolutionarily dynamic nature of this regulon.

    FEMS microbiology letters 2004;230;2;215-25

  • Bioinformatic identification of novel regulatory DNA sequence motifs in Streptomyces coelicolor.

    Studholme DJ, Bentley SD and Kormanec J

    Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK.

    Background: Streptomyces coelicolor is a bacterium with a vast repertoire of metabolic functions and complex systems of cellular development. Its genome sequence is rich in genes that encode regulatory proteins to control these processes in response to its changing environment. We wished to apply a recently published bioinformatic method for identifying novel regulatory sequence signals to gain new insights into regulation in S. coelicolor.

    Results: The method involved production of position-specific weight matrices from alignments of over-represented words of DNA sequence. We generated 2497 weight matrices, each representing a candidate regulatory DNA sequence motif. We scanned the genome sequence of S. coelicolor against each of these matrices. A DNA sequence motif represented by one of the matrices was found preferentially in non-coding sequences immediately upstream of genes involved in polysaccharide degradation, including several that encode chitinases. This motif (TGGTCTAGACCA) was also found upstream of genes encoding components of the phosphoenolpyruvate phosphotransfer system (PTS). We hypothesise that this DNA sequence motif represents a regulatory element that is responsive to availability of carbon-sources. Other motifs of potential biological significance were found upstream of genes implicated in secondary metabolism (TTAGGTtAGgCTaACCTAA), sigma factors (TGACN19TGAC), DNA replication and repair (ttgtCAGTGN13TGGA), nucleotide conversions (CTACgcNCGTAG), and ArsR (TCAGN12TCAG). A motif found upstream of genes involved in chromosome replication (TGTCagtgcN7Tagg) was similar to a previously described motif found in UV-responsive promoters.

    Conclusions: We successfully applied a recently published in silico method to identify conserved sequence motifs in S. coelicolor that may be biologically significant as regulatory elements. Our data are broadly consistent with and further extend data from previously published studies. We invite experimental testing of our hypotheses in vitro and in vivo.

    BMC microbiology 2004;4;14

  • Novel protein domains and motifs in the marine planctomycete Rhodopirellula baltica.

    Studholme DJ, Fuerst JA and Bateman A

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    The planctomycetes are a phylum of bacteria that have a unique cell compartmentalisation and yeast-like budding cell division and peptidoglycan-less proteinaceous cell walls. We wished to further our understanding of these unique organisms at the molecular level by searching for conserved amino acid sequence motifs and domains in the proteins encoded by Rhodopirellula baltica. Using BLAST and single-linkage clustering, we have discovered several new protein domains and sequence motifs in this planctomycete. R. baltica has multiple members of the newly discovered GEFGR protein family and the ASPIC C-terminal domain family, whilst most other organisms for which whole genome sequence is available have no more than one. Many of the domains and motifs appear to be restricted to the planctomycetes. It is possible that these protein domains and motifs may have been lost or replaced in other phyla, or they may have undergone multiple duplication events in the planctomycete lineage. One of the novel motifs probably represents a novel N-terminal export signal peptide. With their unique cell biology, it may be that the planctomycete cell compartmentalisation plan in particular needs special membrane transport mechanisms. The discovery of these new domains and motifs, many of which are associated with secretion and cell-surface functions, will help to stimulate experimental work and thus enhance further understanding of this fascinating group of organisms.

    FEMS microbiology letters 2004;236;2;333-40

  • Mutations in the DLG3 gene cause nonsyndromic X-linked mental retardation.

    Tarpey P, Parnau J, Blow M, Woffendin H, Bignell G, Cox C, Cox J, Davies H, Edkins S, Holden S, Korny A, Mallya U, Moon J, O'Meara S, Parker A, Stephens P, Stevens C, Teague J, Donnelly A, Mangelsdorf M, Mulley J, Partington M, Turner G, Stevenson R, Schwartz C, Young I, Easton D, Bobrow M, Futreal PA, Stratton MR, Gecz J, Wooster R and Raymond FL

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, United Kingdom.

    We have identified truncating mutations in the human DLG3 (neuroendocrine dlg) gene in 4 of 329 families with moderate to severe X-linked mental retardation. DLG3 encodes synapse-associated protein 102 (SAP102), a member of the membrane-associated guanylate kinase protein family. Neuronal SAP102 is expressed during early brain development and is localized to the postsynaptic density of excitatory synapses. It is composed of three amino-terminal PDZ domains, an src homology domain, and a carboxyl-terminal guanylate kinase domain. The PDZ domains interact directly with the NR2 subunits of the NMDA glutamate receptor and with other proteins responsible for NMDA receptor localization, immobilization, and signaling. The mutations identified in this study all introduce premature stop codons within or before the third PDZ domain, and it is likely that this impairs the ability of SAP102 to interact with the NMDA receptor and/or other proteins involved in downstream NMDA receptor signaling pathways. NMDA receptors have been implicated in the induction of certain forms of synaptic plasticity, such as long-term potentiation and long-term depression, and these changes in synaptic efficacy have been proposed as neural mechanisms underlying memory and learning. The disruption of NMDA receptor targeting or signaling, as a result of the loss of SAP102, may lead to altered synaptic plasticity and may explain the intellectual impairment observed in individuals with DLG3 mutations.

    Funded by: NICHD NIH HHS: HD 26202

    American journal of human genetics 2004;75;2;318-24

  • Autocatalytic RNA cleavage in the human beta-globin pre-mRNA promotes transcription termination.

    Teixeira A, Tahiri-Alaoui A, West S, Thomas B, Ramadass A, Martianov I, Dye M, James W, Proudfoot NJ and Akoulitchev A

    Sir William Dunn School of Pathology, University of Oxford, South Parks Road, Oxford OX1 3RE, UK.

    New evidence indicates that termination of transcription is an important regulatory step, closely related to transcriptional interference and even transcriptional initiation. However, how this occurs is poorly understood. Recently, in vivo analysis of transcriptional termination for the human beta-globin gene revealed a new phenomenon--co-transcriptional cleavage (CoTC). This primary cleavage event within beta-globin pre-messenger RNA, downstream of the poly(A) site, is critical for efficient transcriptional termination by RNA polymerase II. Here we show that the CoTC process in the human beta-globin gene involves an RNA self-cleaving activity. We characterize the autocatalytic core of the CoTC ribozyme and show its functional role in efficient termination in vivo. The identified core CoTC is highly conserved in the 3' flanking regions of other primate beta-globin genes. Functionally, it resembles the 3' processive, self-cleaving ribozymes described for the protein-encoding genes from the myxomycetes Didymium iridis and Physarum polycephalum, indicating evolutionary conservation of this molecular process. We predict that regulated autocatalytic cleavage elements within pre-mRNAs may be a general phenomenon and that functionally it may provide the entry point for exonucleases involved in mRNA maturation, turnover and, in particular, transcriptional termination.

    Nature 2004;432;7016;526-30

  • The use of genome annotation data and its impact on biological conclusions.

    Tettelin H and Parkhill J

    Nature genetics 2004;36;10;1028-9

  • The role of prophage-like elements in the diversity of Salmonella enterica serovars.

    Thomson N, Baker S, Pickard D, Fookes M, Anjum M, Hamlin N, Wain J, House D, Bhutta Z, Chan K, Falkow S, Parkhill J, Woodward M, Ivens A and Dougan G

    The Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.

    The Salmonella enterica serovar Typhi CT18 (S.Typhi) chromosome harbours seven distinct prophage-like elements, some of which may encode functional bacteriophages. In silico analyses were used to investigate these regions in S.Typhi CT18, and ultimately compare these integrated bacteriophages against 40 other Salmonella isolates using DNA microarray technology. S.Typhi CT18 contains prophages that show similarity to the lambda, Mu, P2 and P4 bacteriophage families. When compared to other S.Typhi isolates, these elements were generally conserved, supporting a clonal origin of this serovar. However, distinct variation was detected within a broad range of Salmonella serovars; many of the prophage regions are predicted to be specific to S.Typhi. Some of the P2 family prophage analysed have the potential to carry non-essential "cargo" genes within the hyper-variable tail region, an observation that suggests that these bacteriophage may confer a level of specialisation on their host. Lysogenic bacteriophages therefore play a crucial role in the generation of genetic diversity within S.enterica.

    Journal of molecular biology 2004;339;2;279-300

  • Shrinking genomics.

    Thomson NR, Sebaihia M, Cerdeño-Tárraga AM, Holden MT and Parkhill J

    Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Two bacteria are featured this month, and both are at the lower end of the genome size scale. The first, Mycoplasma gallisepticum, belongs to a group of bacteria that have been studied both as important human and animal pathogens and in the pursuit of understanding the essential functions of a self-replicating minimal cell. The second, Nanoarchaeum equitans, is an obligate symbiont that only grows in co-culture with another archaeon. N. equitans seems to be the coelacanth of the microbial world--it has been assigned to a new phylum and represents a primitive form of prokaroytic life.

    Nature reviews. Microbiology 2004;2;1;11

  • A giant novel gene undergoing extensive alternative splicing is severed by a Cornelia de Lange-associated translocation breakpoint at 3q26.3.

    Tonkin ET, Smith M, Eichhorn P, Jones S, Imamwerdi B, Lindsay S, Jackson M, Wang TJ, Ireland M, Burn J, Krantz ID, Carr P and Strachan T

    Institute of Human Genetics, International Centre for Life, University of Newcastle, NE1 3BZ, Newcastle upon Tyne, Central Parkway, UK.

    Cornelia de Lange syndrome (CdLS) is a rare developmental malformation syndrome characterised by mental handicap, growth retardation, distinctive facial features and limb reduction defects. The vast majority of CdLS cases are sporadic. We carried out a high density bacterial artificial chromosome (BAC) microarray comparative genome hybridisation screen but no evidence was found for a consistent pattern of microdeletion/microduplication. As an alternative, we focused on identifying chromosomal regions spanning associated translocation breakpoints. We prioritised the distal 3q region because of the occurrence, in a classical CdLS patient, of a de novo balanced translocation with a breakpoint at 3q26.3 and of reports of phenotypic overlap between cases of mild CdLS and individuals trisomic for the 3q26-q27 region. We show that the 3q26.3 breakpoint severs a previously uncharacterised giant gene, NAALADL2, containing at least 32 exons spanning 1.37 Mb. Northern blot analysis identified up to six different transcripts in the 1-10 kb range with strongest expression in kidney and placenta; embryonic expression was largely confined to duodenal and stomach endoderm, mesonephros, metanephros and pancreas. Transcript analysis identified extensive alternative splicing leading to multiple 5' and 3' untranslated regions and variable coding sequences. Multiple protein isoforms were defined by different N-terminal regions (with at least four alternative initiating methionine codons), and by differential protein truncation/use of alternative C-terminal sequences attributable to alternative splicing/polyadenylation. Outside the N-terminal regions, the predicted proteins showed significant homology to N-acetylated alpha-linked acidic dipeptidase and transferrin receptors. Mutation screening of NAALADL2 in a panel of CdLS patient DNA samples failed to identify patient-specific mutations. We discuss the possibility that the 3q26.3 translocation could nevertheless contribute to pathogenesis.

    Human genetics 2004;115;2;139-48

  • DNA binding activity of the Escherichia coli nitric oxide sensor NorR suggests a conserved target sequence in diverse proteobacteria.

    Tucker NP, D'Autréaux B, Studholme DJ, Spiro S and Dixon R

    John Innes Centre, Colney, Norwich, United Kingdom.

    The Escherichia coli nitric oxide sensor NorR was shown to bind to the promoter region of the norVW transcription unit, forming at least two distinct complexes detectable by gel retardation. Three binding sites for NorR and two integration host factor binding sites were identified in the norR-norV intergenic region. The derived consensus sequence for NorR binding sites was used to search for novel members of the E. coli NorR regulon and to show that NorR binding sites are partially conserved in other members of the proteobacteria.

    Journal of bacteriology 2004;186;19;6656-60

  • NaCl restriction upregulates renal Slc26a4 through subcellular redistribution: role in Cl- conservation.

    Wall SM, Kim YH, Stanley L, Glapion DM, Everett LA, Green ED and Verlander JW

    Department of Medicine, Emory University, Atlanta, Ga 30322, USA.

    Slc26a4 (Pds, pendrin) is an anion transporter expressed in the apical region of type B and non-A, non-B intercalated cells of the distal nephron. It is upregulated by aldosterone analogues and is critical in the development of mineralocorticoid-induced hypertension. Thus, Slc26a4 expression and its role in blood pressure and fluid and electrolyte homeostasis was explored during NaCl restriction, a treatment model in which aldosterone is appropriately increased. Ultrastructural immunolocalization, balance studies, and cortical collecting ducts (CCDs) perfused in vitro were used. With moderate physiological NaCl restriction, Slc26a4 expression in the apical plasma membrane increased 2- to 3-fold in type B intercalated cells. Because Slc26a4 transports Cl-, we tested whether NaCl balance differs in Slc26a4(+/+) and Slc26a4(-/-) mice during NaCl restriction. Cl- absorption was observed in CCDs from Slc26a4(+/+) but not from Slc26a4(-/-) mice. After moderate NaCl restriction, urinary volume and Cl- excretion were increased in Slc26a4(-/-) relative to Slc26a4(+/+) mice. Moreover, Slc26a4(-/-) mice had evidence of relative vascular volume depletion because they had a higher arterial pH, hematocrit, and blood urea nitrogen than wild-type mice. With moderate NaCl restriction, blood pressure was similar in Slc26a4(+/+) and Slc26a4(-/-) mice. However, on a severely restricted intake of NaCl, Slc26a4(-/-) mice were hypotensive relative to wild-type mice. We conclude that Slc26a4 is upregulated with NaCl restriction and is critical in the maintenance of acid-base balance and in the renal conservation of Cl- and water during NaCl restriction.

    Funded by: NIDDK NIH HHS: DK 52935

    Hypertension 2004;44;6;982-7

  • Mechanism of activation of the RAF-ERK signaling pathway by oncogenic mutations of B-RAF.

    Wan PT, Garnett MJ, Roe SM, Lee S, Niculescu-Duvaz D, Good VM, Jones CM, Marshall CJ, Springer CJ, Barford D, Marais R and Cancer Genome Project

    Section of Structural Biology, The Institute of Cancer Research, Chester Beatty Laboratories, 237 Fulham Road, London SW3 6JB, UK.

    Over 30 mutations of the B-RAF gene associated with human cancers have been identified, the majority of which are located within the kinase domain. Here we show that of 22 B-RAF mutants analyzed, 18 have elevated kinase activity and signal to ERK in vivo. Surprisingly, three mutants have reduced kinase activity towards MEK in vitro but, by activating C-RAF in vivo, signal to ERK in cells. The structures of wild type and oncogenic V599EB-RAF kinase domains in complex with the RAF inhibitor BAY43-9006 show that the activation segment is held in an inactive conformation by association with the P loop. The clustering of most mutations to these two regions suggests that disruption of this interaction converts B-RAF into its active conformation. The high activity mutants signal to ERK by directly phosphorylating MEK, whereas the impaired activity mutants stimulate MEK by activating endogenous C-RAF, possibly via an allosteric or transphosphorylation mechanism.

    Cell 2004;116;6;855-67

  • Polybromo protein BAF180 functions in mammalian cardiac chamber maturation.

    Wang Z, Zhai W, Richardson JA, Olson EN, Meneses JJ, Firpo MT, Kang C, Skarnes WC and Tjian R

    Department of Molecular and Cell Biology, University of California, Berkeley, California 94720-3204, USA.

    BAF and PBAF are two related mammalian chromatin remodeling complexes essential for gene expression and development. PBAF, but not BAF, is able to potentiate transcriptional activation in vitro mediated by nuclear receptors, such as RXRalpha, VDR, and PPARgamma. Here we show that the ablation of PBAF-specific subunit BAF180 in mouse embryos results in severe hypoplastic ventricle development and trophoblast placental defects, similar to those found in mice lacking RXRalpha and PPARgamma. Embryonic aggregation analyses reveal that in contrast to PPARgamma-deficient mice, the heart defects are likely a direct result of BAF180 ablation, rather than an indirect consequence of trophoblast placental defects. We identified potential target genes for BAF180 in heart development, such as S100A13 as well as retinoic acid (RA)-induced targets RARbeta2 and CRABPII. Importantly, BAF180 is recruited to the promoter of these target genes and BAF180 deficiency affects the RA response for CRABPII and RARbeta2. These studies reveal unique functions of PBAF in cardiac chamber maturation.

    Genes & development 2004;18;24;3106-16

  • Tissue microarrays: fast-tracking protein expression at the cellular level.

    Warford A

    Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK.

    Tissue microarrays maximize returns in cellular pathology whilst minimizing the use of cells and tissues. They are made by arraying cores of tissue taken from multiple donor blocks into a single recipient block. Accordingly, the histology and pathology of several hundred tissues can be represented in one tissue microarray that, when stained by immunohistochemistry, provides comprehensive topographic information on protein expression. Used with complimentary techniques, such as complementary DNA microarray analysis, tissue microarrays are providing valuable data for the identification of new markers of disease and assisting in the discovery of therapeutic targets. They are also leading a revolution in cellular pathology as high-throughput technology is introduced to maximize the information provided.

    Expert review of proteomics 2004;1;3;283-92

  • Global gene expression responses of fission yeast to ionizing radiation.

    Watson A, Mata J, Bähler J, Carr A and Humphrey T

    Genome Damage and Stability Centre, University of Sussex, Brighton BN1 9RQ, United Kingdom.

    A coordinated transcriptional response to DNA-damaging agents is required to maintain genome stability. We have examined the global gene expression responses of the fission yeast Schizosaccharomyces pombe to ionizing radiation (IR) by using DNA microarrays. We identified approximately 200 genes whose transcript levels were significantly altered at least twofold in response to 500 Gy of gamma IR in a temporally defined manner. The majority of induced genes were core environmental stress response genes, whereas the remaining genes define a transcriptional response to DNA damage in fission yeast. Surprisingly, few DNA repair and checkpoint genes were transcriptionally modulated in response to IR. We define a role for the stress-activated mitogen-activated protein kinase Sty1/Spc1 and the DNA damage checkpoint kinase Rad3 in regulating core environmental stress response genes and IR-specific response genes, both independently and in concert. These findings suggest a complex network of regulatory pathways coordinate gene expression responses to IR in eukaryotes.

    Funded by: Cancer Research UK: A6517; Wellcome Trust: 077118

    Molecular biology of the cell 2004;15;2;851-60

  • Tyrosine site-specific recombinases mediate DNA inversions affecting the expression of outer surface proteins of Bacteroides fragilis.

    Weinacht KG, Roche H, Krinos CM, Coyne MJ, Parkhill J and Comstock LE

    Channing Laboratory, Brigham and Women's Hospital, Harvard Medical School, 181 Longwood Avenue, Boston, MA 02115, USA.

    The chromosome of Bacteroides fragilis has been shown to undergo 13 distinct DNA inversions affecting the expression of capsular polysaccharides and mediated by a serine site-specific recombinase designated Mpi. In this study, we demonstrate that members of the tyrosine site-specific recombinase family, conserved in B. fragilis, mediate additional DNA inversions of the B. fragilis genome. These DNA invertases flip promoter regions in their immediate downstream region. The genetic organization of the genes regulated by these invertible promoter regions suggests that they are operons and many of the products are predicted to be outer membrane proteins. Phenotypic analysis of a deletion mutant of one of these DNA invertases, tsr15 (aapI), which resulted in the promoter region for the downstream genes being locked ON, confirmed the synthesis of multiple surface proteins by this operon. In addition, this deletion mutant demonstrated an autoaggregative phenotype and showed significantly greater adherence than wild-type organisms in a biofilm assay, suggesting a possible functional role for these phase-variable outer surface proteins. This study demonstrates that DNA inversion is a universal mechanism used by this commensal microorganism to phase vary expression of its surface molecules and involves at least three conserved DNA invertases from two evolutionarily distinct families.

    Funded by: NIAID NIH HHS: AI44193, R01 AI044193-05, R01 AI044193-06

    Molecular microbiology 2004;53;5;1319-30

  • A genome sequence survey of the filarial nematode Brugia malayi: repeats, gene discovery, and comparative genomics.

    Whitton C, Daub J, Quail M, Hall N, Foster J, Ware J, Ganatra M, Slatko B, Barrell B and Blaxter M

    Ashworth Laboratories, Institution of Cell, Animal and Polulation Biology, School of Biological Sciences, University of Edinburgh, King's Buildings, Edinburgh EH9 3JT, UK.

    Comparative nematode genomics has thus far been largely constrained to the genus Caenorhabditis, but a huge diversity of other nematode species, and genomes, exist. The Brugia malayi genome is approximately 100 Mb in size, and distributed across five chromosome pairs. Previous genomic investigations have included definition of major repeat classes and sequencing of selected genes. We have generated over 18,000 sequences from the ends of large-insert clones from bacterial artificial chromosome libraries. These end sequences, totalling over 10 Mb of sequence, contain just under 8 Mb of unique sequence. We identified the known Mbo I and Hha I repeat families in the sequence data, and also identified several new repeats based on their abundance. Genomic copies of 17% of B. malayi genes defined by expressed sequence tags have been identified. Nearly one quarter of end sequences can encode peptides with significant similarity to protein sequences in the public databases, and we estimate that we have identified more than 2700 new B. malayi genes. Importantly, 459 end sequences had homologues in other organisms, but lacked a match in the completely sequenced genomes of Caenorhabditis briggsae and Caenorhabditis elegans, emphasising the role of gene loss in genome evolution. B. malayi is estimated to have over 18,500 protein-coding genes.

    Molecular and biochemical parasitology 2004;137;2;215-27

  • Fine mapping, gene content, comparative sequencing, and expression analyses support Ctla4 and Nramp1 as candidates for Idd5.1 and Idd5.2 in the nonobese diabetic mouse.

    Wicker LS, Chamberlain G, Hunter K, Rainbow D, Howlett S, Tiffen P, Clark J, Gonzalez-Munoz A, Cumiskey AM, Rosa RL, Howson JM, Smink LJ, Kingsnorth A, Lyons PA, Gregory S, Rogers J, Todd JA and Peterson LB

    Juvenile Diabetes Research Foundation/Wellcome Trust Diabetes and Inflammation Laboratory, Department of Medical Genetics, Cambridge Institute for Medical Research, University of Cambridge, Cambridge CB2 2XY, UK.

    At least two loci that determine susceptibility to type 1 diabetes in the NOD mouse have been mapped to chromosome 1, Idd5.1 (insulin-dependent diabetes 5.1) and Idd5.2. In this study, using a series of novel NOD.B10 congenic strains, Idd5.1 has been defined to a 2.1-Mb region containing only four genes, Ctla4, Icos, Als2cr19, and Nrp2 (neuropilin-2), thereby excluding a major candidate gene, Cd28. Genomic sequence comparison of the two functional candidate genes, Ctla4 and Icos, from the B6 (resistant at Idd5.1) and the NOD (susceptible at Idd5.1) strains revealed 62 single nucleotide polymorphisms (SNPs), only two of which were in coding regions. One of these coding SNPs, base 77 of Ctla4 exon 2, is a synonymous SNP and has been correlated previously with type 1 diabetes susceptibility and differential expression of a CTLA-4 isoform. Additional expression studies in this work support the hypothesis that this SNP in exon 2 is the genetic variation causing the biological effects of Idd5.1. Analysis of additional congenic strains has also localized Idd5.2 to a small region (1.52 Mb) of chromosome 1, but in contrast to the Idd5.1 interval, Idd5.2 contains at least 45 genes. Notably, the Idd5.2 region still includes the functionally polymorphic Nramp1 gene. Future experiments to test the identity of Idd5.1 and Idd5.2 as Ctla4 and Nramp1, respectively, can now be justified using approaches to specifically alter or mimic the candidate causative SNPs.

    Journal of immunology (Baltimore, Md. : 1950) 2004;173;1;164-73

  • Replication timing of the human genome.

    Woodfine K, Fiegler H, Beare DM, Collins JE, McCann OT, Young BD, Debernardi S, Mott R, Dunham I and Carter NP

    The Welcome Trust Sanger Institute, Welcome Genome Campus, Cambridge, UK.

    We have developed a directly quantitative method utilizing genomic clone DNA microarrays to assess the replication timing of sequences during the S phase of the cell cycle. The genomic resolution of the replication timing measurements is limited only by the genomic clone size and density. We demonstrate the power of this approach by constructing a genome-wide map of replication timing in human lymphoblastoid cells using an array with clones spaced at 1 Mb intervals and a high-resolution replication timing map of 22q with an array utilizing overlapping sequencing tile path clones. We show a positive correlation, both genome-wide and at a high resolution, between replication timing and a range of genome parameters including GC content, gene density and transcriptional activity.

    Human molecular genetics 2004;13;2;191-202

  • Delta proteins and MAGI proteins: an interaction of Notch ligands with intracellular scaffolding molecules and its significance for zebrafish development.

    Wright GJ, Leslie JD, Ariza-McNaughton L and Lewis J

    Vertebrate Development Laboratory, Cancer Research UK London Research Institute, 44 Lincoln's Inn Fields, London WC2A 3PX, UK.

    Delta proteins activate Notch through a binding reaction that depends on their extracellular domains; but the intracellular (C-terminal) domains of the Deltas also have significant functions. All classes of vertebrates possess a subset of Delta proteins with a conserved ATEV* motif at their C termini. These ATEV Deltas include Delta1 and Delta4 in mammals and DeltaD and DeltaC in the zebrafish. We show that these Deltas associate with the membrane-associated scaffolding proteins MAGI1, MAGI2 and MAGI3, through a direct interaction between the C termini of the Deltas and a specific PDZ domain (PDZ4) of the MAGIs. In cultured cells and in subsets of cells in the intact zebrafish embryo, DeltaD and MAGI1 are co-localized at the plasma membrane. The interaction and the co-localization can be abolished by injection of a morpholino that blocks the mRNA splicing reaction that gives DeltaD its terminal valine, on which the interaction depends. Embryos treated in this way appear normal with respect to some known functions of DeltaD as a Notch ligand, including the control of somite segmentation, neurogenesis, and hypochord formation. They do, however, show an anomalous distribution of Rohon-Beard neurons in the dorsal neural tube, suggesting that the Delta-MAGI interaction may play some part in the control of neuron migration.

    Development (Cambridge, England) 2004;131;22;5659-69

  • Reduced penetrance of craniofacial anomalies as a function of deletion size and genetic background in a chromosome engineered partial mouse model for Smith-Magenis syndrome.

    Yan J, Keener VW, Bi W, Walz K, Bradley A, Justice MJ and Lupski JR

    Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA.

    Smith-Magenis syndrome (SMS) is a multiple congenital anomaly/mental retardation syndrome associated with del(17)(p11.2p11.2). The phenotype is variable even in patients with deletions of the same size. RAI1 has been recently suggested as a major gene for majority of the SMS phenotypes, but its role in the full spectrum of the phenotype remains unclear. Df(11)17/+ mice contain a heterozygous deletion in the mouse region syntenic to the SMS common deletion, and exhibit craniofacial abnormalities, seizures and marked obesity, partially reproducing the SMS phenotype. To further study the genetic basis for the phenotype, we constructed three lines of mice with smaller deletions [Df(11)17-1, Df(11)17-2 and Df(11)17-3] using retrovirus-mediated chromosome engineering to create nested deletions. Both craniofacial abnormalities and obesity have been observed, but the penetrance of the craniofacial phenotype was markedly reduced when compared with Df(11)17/+ mice. Overt seizures were not observed. Phenotypic variation has been observed in mice with the same deletion size in the same and in different genetic backgrounds, which may reflect the variation documented in the patients. These results indicate that the smaller deletions contain the gene(s), most likely Rai1, causing craniofacial abnormalities and obesity. However, genes or regulatory elements in the larger deletion, which are not located in the smaller deletions, as well as genes located elsewhere, also influence penetrance and expressivity of the phenotype. Our mouse models refined the genomic region important for a portion of the SMS phenotype and provided a basis for further molecular analysis of genes associated with SMS.

    Funded by: NCI NIH HHS: P01 CA75719; NIDCR NIH HHS: R01 DE015210

    Human molecular genetics 2004;13;21;2613-24

  • Identification of PSD-95 as a regulator of dopamine-mediated synaptic and behavioral plasticity.

    Yao WD, Gainetdinov RR, Arbuckle MI, Sotnikova TD, Cyr M, Beaulieu JM, Torres GE, Grant SG and Caron MG

    Howard Hughes Medical Institute Laboratories, Department of Cell Biology, Duke University Medical Center, Durham, NC 27710, USA.

    To identify the molecular mechanisms underlying psychostimulant-elicited plasticity in the brain reward system, we undertook a phenotype-driven approach using genome-wide microarray profiling of striatal transcripts from three genetic and one pharmacological mouse models of psychostimulant or dopamine supersensitivity. A small set of co-affected genes was identified. One of these genes encoding the synaptic scaffolding protein PSD-95 is downregulated in the striatum of all three mutants and in chronically, but not acutely, cocaine-treated mice. At the synaptic level, enhanced long-term potentiation (LTP) of the frontocortico-accumbal glutamatergic synapses correlates with PSD-95 reduction in every case. Finally, targeted deletion of PSD-95 in an independent line of mice enhances LTP, augments the acute locomotor-stimulating effects of cocaine, but leads to no further behavioral plasticity in response to chronic cocaine. Our findings uncover a previously unappreciated role of PSD-95 in psychostimulant action and identify a molecular and cellular mechanism shared between drug-related plasticity and learning.

    Funded by: NIDA NIH HHS: DA13511

    Neuron 2004;41;4;625-38

  • The PepSY domain: a regulator of peptidase activity in the microbial environment?

    Yeats C, Rawlings ND and Bateman A

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.

    Trends in biochemical sciences 2004;29;4;169-72

  • Resolution of the novel immune-type receptor gene cluster in zebrafish.

    Yoder JA, Litman RT, Mueller MG, Desai S, Dobrinski KP, Montgomery JS, Buzzeo MP, Ota T, Amemiya CT, Trede NS, Wei S, Djeu JY, Humphray S, Jekosch K, Hernandez Prada JA, Ostrov DA and Litman GW

    Department of Molecular Biomedical Sciences, College of Veterinary Medicine, North Carolina State University, 4700 Hillsborough Street, Raleigh, NC 27606, USA.

    The novel immune-type receptor (NITR) genes encode a unique multigene family of leukocyte regulatory receptors, which possess an extracellular Ig variable (V) domain and may function in innate immunity. Artificial chromosomes that encode zebrafish NITRs have been assembled into a contig spanning approximately 350 kb. Resolution of the complete NITR gene cluster has led to the identification of eight previously undescribed families of NITRs and has revealed the presence of C-type lectins within the locus. A maximum haplotype of 36 NITR genes (138 gene sequences in total) can be grouped into 12 distinct families, including inhibitory and activating receptors. An extreme level of interindividual heterozygosity is reflected in allelic polymorphisms, haplotype variation, and family-specific isoform complexity. In addition, the exceptional diversity of NITR sequences among species suggests divergent evolution of this multigene family with a birth-and-death process of member genes. High-confidence modeling of Nitr V-domain structures reveals a significant shift in the spatial orientation of the Ig fold, in the region of highest interfamily variation, compared with Ig V domains. These studies resolve a complete immune gene cluster in zebrafish and indicate that the NITRs represent the most complex family of activating/inhibitory surface receptors thus far described.

    Proceedings of the National Academy of Sciences of the United States of America 2004;101;44;15706-11

  • Repression of nodal expression by maternal B1-type SOXs regulates germ layer formation in Xenopus and zebrafish.

    Zhang C, Basta T, Hernandez-Lagunas L, Simpson P, Stemple DL, Artinger KB and Klymkowsky MW

    Molecular, Cellular and Developmental Biology, University of Colorado, Boulder, 80309-0347, USA.

    B1-type SOXs (SOXs 1, 2, and 3) are the most evolutionarily conserved subgroup of the SOX transcription factor family. To study their maternal functions, we used the affinity-purified antibody antiSOX3c, which inhibits the binding of Xenopus SOX3 to target DNA sequences [Development. 130(2003)5609]. The antibody also cross-reacts with zebrafish embryos. When injected into fertilized Xenopus or zebrafish eggs, antiSOX3c caused a profound gastrulation defect; this defect could be rescued by the injection of RNA encoding SOX3DeltaC-EnR, a SOX3-engrailed repression domain chimera. In antiSOX3c-injected Xenopus embryos, normal animal-vegetal patterning of mesodermal and endodermal markers was disrupted, expression domains were shifted toward the animal pole, and the levels of the endodermal markers SOX17 and endodermin increased. In Xenopus, SOX3 acts as a negative regulator of Xnr5, which encodes a nodal-related TGFbeta-family protein. Two nodal-related proteins are expressed in the early zebrafish embryo, squint and cyclops; antiSOX3c-injection leads to an increase in the level of cyclops expression. In both Xenopus and zebrafish, the antiSOX3c phenotype was rescued by the injection of RNA encoding the nodal inhibitor Cerberus-short (CerS). In Xenopus, antiSOX3c's effects on endodermin expression were suppressed by injection of RNA encoding a dominant negative version of Mixer or a morpholino against SOX17alpha2, both of which act downstream of nodal signaling in the endoderm specification pathway. Based on these data, it appears that maternal B1-type SOX functions together with the VegT/beta-catenin system to regulate nodal expression and to establish the normal pattern of germ layer formation in Xenopus. A mechanistically conserved system appears to act in a similar manner in the zebrafish.

    Funded by: NIDCR NIH HHS: K22DE14200; NIGMS NIH HHS: GM54001

    Developmental biology 2004;273;1;23-37

  • Impact of population structure, effective bottleneck time, and allele frequency on linkage disequilibrium maps.

    Zhang W, Collins A, Gibson J, Tapper WJ, Hunt S, Deloukas P, Bentley DR and Morton NE

    Human Genetics Division, Duthie Building (Mailpoint 808), Southampton General Hospital, Tremona Road, Southampton SO16 6YD, United Kingdom.

    Genetic maps in linkage disequilibrium (LD) units play the same role for association mapping as maps in centimorgans provide at much lower resolution for linkage mapping. Association mapping of genes determining disease susceptibility and other phenotypes is based on the theory of LD, here applied to relations with three phenomena. To test the theory, markers at high density along a 10-Mb continuous segment of chromosome 20q were studied in African-American, Asian, and Caucasian samples. Population structure, whether created by pooling samples from divergent populations or by the mating pattern in a mixed population, is accurately bioassayed from genotype frequencies. The effective bottleneck time for Eurasians is substantially less than for migration out of Africa, reflecting later bottlenecks. The classical dependence of allele frequency on mutation age does not hold for the generally shorter time span of inbreeding and LD. Limitation of the classical theory to mutation age justifies the assumption of constant time in a LD map, except for alleles that were rare at the effective bottleneck time or have arisen since. This assumption is derived from the Malecot model and verified in all samples. Tested measures of relative efficiency, support intervals, and localization error determine the operating characteristics of LD maps that are applicable to every sexually reproducing species, with implications for association mapping, high-resolution linkage maps, evolutionary inference, and identification of recombinogenic sequences.

    Funded by: NIGMS NIH HHS: GM42947

    Proceedings of the National Academy of Sciences of the United States of America 2004;101;52;18075-80

  • Mutations of BRAF and KRAS in gastric cancer and their association with microsatellite instability.

    Zhao W, Chan TL, Chu KM, Chan AS, Stratton MR, Yuen ST and Leung SY

    International journal of cancer. Journal international du cancer 2004;108;1;167-9

  • Shotgun optical mapping of the entire Leishmania major Friedlin genome.

    Zhou S, Kile A, Kvikstad E, Bechner M, Severin J, Forrest D, Runnheim R, Churas C, Anantharaman TS, Myler P, Vogt C, Ivens A, Stuart K and Schwartz DC

    Laboratory for Molecular and Computational Genomics, UW Biotechnology Center, University of Wisconsin-Madison, 425 Henry Mall, Madison, WI 53706, USA.

    Leishmania is a group of protozoan parasites which causes a broad spectrum of diseases resulting in widespread human suffering and death, as well as economic loss from the infection of some domestic animals and wildlife. To further understand the fundamental genomic architecture of this parasite, and to accelerate the on-going sequencing project, a whole-genome XbaI restriction map was constructed using the optical mapping system. This map supplemented traditional physical maps that were generated by fingerprinting and hybridization of cosmid and P1 clone libraries. Thirty-six optical map contigs were constructed for the corresponding known 36 chromosomes of the Leishmania major Friedlin genome. The chromosome sizes ranged from 326.9 to 2821.3 kb, with a total genome size of 34.7 Mb; the average XbaI restriction fragment was 25.3 kb, and ranged from 15.7 to 77.8 kb on a per chromosomes basis. Comparison between the optical maps and the in silico maps of sequence drawn from completed, nearly finished, or large sequence contigs showed that optical maps served several useful functions within the path to create finished sequence by: guiding aspects of the sequence assembly, identifying misassemblies, detection of cosmid or PAC clones misplacements to chromosomes, and validation of sequence stemming from varying degrees of finishing. Our results also showed the potential use of optical maps as a means to detect and characterize map segmental duplication within genomes.

    Molecular and biochemical parasitology 2004;138;1;97-106

* quick link -