New database of 660,000 assembled bacterial genomes sheds light on the evolution of bacteria  

A vast, curated collection of bacterial genomes has been created that allows the community unprecedented access to data. 

Email newsletter

News and blog updates

Sign up

Ninety per cent of the bacterial genomes sequenced belong to a restricted set of only 20 bacterial species, out of an estimated 45,000*, highlighting the knowledge gaps in available genomic data and showing how this distorts our view of bacterial diversity, new research has suggested.

In a new study, from the Wellcome Sanger Institute and EMBL’s European Bioinformatics Institute (EMBL-EBI), researchers standardised all bacterial genome data held in the European Nucleotide Archive (ENA) before 2019, creating a searchable and accessible database of genomic assemblies.

In the research, published today (9 November 2021) in PLOS Biology, researchers reviewed all of the bacterial data available as of November 2018 and assembled it into over 660,000 genomes. This has been released as a new open access database designed to help scientists all around the world answer basic questions on bacterial evolution, by considering all data in a standardised and comprehensive manner.

In addition to this, over 300,000 of these genomes had never been fully assembled before. This study highlights the composition of the current genomic data resources, showing biases in the data submitted to these archives and therefore our window into bacterial diversity.

Genomic data exist in public archives as unprocessed raw sequences, or assembled data that have been processed with multiple different techniques. When these are assembled in a standardised and comprehensive way, people can search and analyse all existing data i.e. the whole genetic picture. When the whole database is processed in this way it allows data to be seen in this wider context, rather than being limited to looking at snap shots of genomic data archives in isolation.

While analysing the data contained in the public archives, the researchers were surprised to find that the majority of data come from the same 20 species of bacteria. Notably, almost one third of the total data came from Salmonella enterica, a bacterium well known to causes foodborne illness.

Whilst Salmonella infections can lead to hospitalisations and are important causes of deaths worldwide, there are many other important pathogens that are not well represented in this data archive. There is also a lack of data on the bacteria known to keep us healthy such as those making up the gut microbiome.

By highlighting the gaps in the data, researchers hope to ensure that others are aware how the data are skewed, how this might impact on our interpretation of the data, and to encourage discussion around these issues in research. The dataset is now live and available for free access across the globe.

“The exercise gave us a detailed overview of the bacteria sequenced over the last 30 years. It confirms that researchers have been focusing on a small number of pathogens from a restricted number of sources. Such a narrow focus restricts our ability to truly understand key questions in bacterial evolution and public health, including the sources of antimicrobial resistance. We know that the genes that confer antimicrobial resistance exist in a much wider range of species than just those few pathogens that are the focus of attention for funders. By expanding and standardising the archive data, we can get a clearer picture of what is going on. This study highlights the need to widen the range of bacterial species we sequence, and to create better mechanisms for sharing the data with the community, to help answer priority questions for researchers and public health authorities alike.”

Dr Zamin Iqbal, co senior author and Group Leader at EMBL-EBI

“I study genomic elements that are able to move freely between different bacteria, many of which can contribute to the spread of antimicrobial resistance genes. To do this, I need to search and analyse as many bacterial genomes as possible in a simple and fast way. Public data can be quite messy and need to be processed uniformly, including quality control, before they can be used for this type of analysis. So along with a few colleagues, we decided to ‘tidy up’ the data and make it easier for everyone to ask essential research questions.”

Dr Grace Blackwell, first author and joint EMBL-EBI and Sanger Institute Postdoctoral (ESPOD) Fellow

“We rely on the genomic archives to provide the context to our research on public heath questions and for basic science. It is against these data we identify new species, view the emergence of new pathogens or antimicrobial resistance genes, or see the pathways through which bacteria move across the globe. It is our intellectual point of reference. By processing it uniformly we have tried to show the huge opportunities and wealth of biological data that are hidden in these genomes as well as making people aware of any possible limitations. This database will enable new opportunities for science and we want to ensure people are able to access it fully through this collaborative study.”

Professor Nicholas Thomson, co senior author and Head of the Parasites and Microbes Programme at the Wellcome Sanger Institute

More information

*Genome Taxonomy Database. Available here: [Accessed November 2021]

The curated ENA database of 661K bacterial genomes is available here:


Blackwell, G. A., et al. (2021). Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLOS Biology. DOI: 10.1371/journal.pbio.3001421


This work was supported by Wellcome. Martin Hunt was funded by a Wellcome Trust/Newton Fund-MRC Collaborative Award and an award from the Bill & Melinda Gates Foundation Trust.