The fast-growing field of metagenomics enables rapid identification and unbiased characterization of the microbial communities associated with different biological specimens, based on high-throughput DNA sequencing coupled with robust bioinformatics. While all steps of this multistep process (sample collection to data analysis) are important in achieving high identification accuracy and maintaining the proper biological representativeness of the community, bioinformatic analysis of complex metagenomic datasets represents one of the most critical and challenging steps.
While tools can produce either very aggressive or highly conservative predictions of community composition, to be reliably used in multi-disciplinary microbiome applications it is critical that overall classification accuracy and detection resolution of a tool maintain low rates of false positives and false negatives. A recent independent study, published in Genome Biology and led by a prestigious team of researchers from Weil Cornell Medical, HudsonAlpha, IBM, University of VT, University of CA, and NY Medical College evaluated several metagenomic classifiers and their performance on both in-silico and lab-created benchmarking datasets. The datasets are one of the largest shotgun metagenomics dataset collections used in a benchmarking study and are comprised of both synthetic (created by computer utilizing known reference genome sequences and error rates) and laboratory-constructed (known DNA or laboratory strains spiked in to simulate a metagenomic sample). Additionally, the microbial composition of all these data sets is known.
The study employed the following key evaluation criteria for the tools:
- Recall / Sensitivity – number of organisms correctly identified out of the number possible (number of relevant results returned)
- Selectivity / Precision – number of organisms correctly identified out of all possible
A trade-off is often made between precision and recall. That is, if a tool provides high recall but low precision, it will yield identifications, but a significantly large number will prove inaccurate (high false positives). A tool with high precision but low recall will provide a low number of identifications which will be accurate, but some identifications will be missed (high false negatives).
- F1 – the harmonic mean of recall and precision - the overall score used to judge performance
- AUPR – Area Under Precision and Recall – the area measured under the curve when precision and recall are graphically presented together
These same evaluation criteria were used with CosmosID cloud metagenomics on the 35 datasets, which demonstrates CosmosID as the best-In-class bioinformatics tool offering highest identification accuracy and unrivaled detection resolution.
As shown in the figure above, CosmosID clearly offers the best identification accuracy for the entire benchmarking dataset based on F1, precision, recall, and AUPR (16, 17). Most importantly, unlike other tools, identification accuracy is maintained at all taxonomic levels. Strikingly, most of the tools grossly fall short in classifying organisms at sub-species (and strain) level resolution, but CosmosID provided unrivaled accuracy at sub-species level. According to Dirk Gevers, a world-renowned microbiome expert, “The unit of microbial action is a strain, not a species. Being able to differentiate between different strains matters greatly when scientists work on ways to intervene with a microbial community by introducing new strains” (15).
As metagenomics is increasingly becoming a method of choice across multi-disciplinary applications, the importance of sub-species and strain level variation is becoming ever more apparent (1-9). For example, specific strains of Streptococcus mutans, produce hemorrhagic damage in the murine brain and other tissues (6, 9), whereas other strains are risk factors for ulcerative colitis (4). Likewise, different strains of the protozoan parasite Toxoplasma gondii manifest diverse pathologies and elicit altered host responses. Particular variants of Staphylococcus epidermidis (2) and Staphylococcus aureus (3) affect virulence and biofilm formation. Certain strains of Bifidobacterium longum, but not others, protect against pathogens like Escherichia coli (1), and still others elicit differential immunomodulatory properties (5). Similarly, strain-specific immunomodulatory effects are seen for Propionibacterium freudenreichii (10). And for another probiotic agent, Lactobacillus casei, variants derived from different ecological niches vary in their ability to bind foodborne carcinogens (11). The importance of strain resolution is actually much more apparent when assigning attribution, as exemplified in outbreaks of nosocomial infections such as Legionella pneumophila (12,13) and Klebsiella pneumoniae (14). These examples serve to underscore why sub-species and strain level identification is so crucial to our understanding of microbial symbiosis and dysbiosis, and thus demonstrate the power of CosmosID metagenomics in defining the microbiome composition at a finer taxonomic resolution – critical information needed in microbiome research, epidemiological studies, microbial forensics, and outbreak investigations.
CosmosID results (login: firstname.lastname@example.org, pw: 1600password)
CosmosID provides the most rapid and accurate metagenomics identification and relative abundance estimation among all of the tools used in this study. Furthermore, it offers the capability to detect organisms at sub-species and strain level with high accuracy – an attribute grossly missing by other competing tools. Such precise resolution of metagenomic identification was possible due to the use of CosmosID’s phylogenetically organized and expert curated genome databases. CosmosID has analyzed over 35,000 biological samples that range from human, animal, plant, food, water, and soil sources.
Try out the CosmosID microbial analysis platform at app.cosmosid.com. It is available as both a cloud-based application and as a command line API, providing automated identification, functional characterization (including antibiotic resistance and virulence factors), and an assortment of visualizations for individual samples and comparative analysis. Additionally, CosmosID offers in-house solutions as well, including study design, nucleic acid extraction, sequencing, bioinformatics, comparative statistical analyses, and customized publication-ready visualizations.
Try us out: app.cosmosid.com
Questions: email@example.com, chat with us below, or call (703) 995-9879
Solid (not SOLiD) paper  on metagenomic classifiers by the @mason_lab (cross-species contamination anyone?). Also ran @CosmosID on some of gold-standard data with similar impressive results as . #ScienceItWorks— Nils Homer (@nilshomer) January 12, 2018
1) Fukuda S, Toh H, Hase K, Oshima K, Nakanishi Y, et al. (2011) Bifidobacteria can protect from enteropathogenic infection through production of acetate. Nature 469: 543-547.
2) Gill SR, Fouts DE, Archer GL, Mongodin EF, Deboy RT, et al. (2005) Insights on evolution of virulence and resistance from the complete genome analysis of an early methicillin-resistant Staphylococcus aureus strain and a biofilm-producing methicillin-resistant Staphylococcus epidermidis strain. J Bacteriol 187: 2426-2438.
3) Iwase T, Uehara Y, Shinji H, Tajima A, Seo H, et al. (2010) Staphylococcus epidermidis Esp inhibits Staphylococcus aureus biofilm formation and nasal colonization. Nature 465: 346-349.
4) Kojima A, Nakano K, Wada K, Takahashi H, Katayama K, et al. (2012) Infection of specific strains of Streptococcus mutans, oral bacteria, confers a risk of ulcerative colitis. Sci Rep 2: 332.
5) Medina M, Izquierdo E, Ennahar S, Sanz Y (2007) Differential immunomodulatory properties of Bifidobacterium logum strains: relevance to probiotic selection and clinical applications. Clin Exp Immunol 150: 531-538.
6) Nakano K, Hokamura K, Taniguchi N, Wada K, Kudo C, et al. (2011) The collagen-binding protein of Streptococcus mutans is involved in haemorrhagic stroke. Nat Commun 2: 485.
7) Saeij JP, Boyle JP, Boothroyd JC (2005) Differences among the three major strains of Toxoplasma gondii and their specific interactions with the infected host. Trends Parasitol 21: 476-481.
8) Sharon I, Morowitz MJ, Thomas BC, Costello EK, Relman DA, et al. (2013) Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res 23: 111-120.
9) Wada K, Nakano, K., Ooshima, T. & Kamisaki, Y. (2010) Bacteremia by virulent oral bacteria is a potent risk factor for stroke under endothelial cell injury condition. J Pharmacol Sci 112.
10) Foligne B, Deutsch SM, Breton J, Cousin FJ, Dewulf J, et al. (2010) Promising immunomodulatory effects of selected strains of dairy propionibacteria as evidenced in vitro and in vivo. Appl Environ Microbiol 76: 8259-8264.
11) Hernandez-Mendoza A, Garcia HS, Steele JL (2009) Screening of Lactobacillus casei strains for their ability to bind aflatoxin B1. Food Chem Toxicol 47: 1064-1068.
12) Helbig JH, Kurtz JB, Pastoris MC, Pelaz C, Luck PC (1997) Antigenic lipopolysaccharide components of Legionella pneumophila recognized by monoclonal antibodies: possibilities and limitations for division of the species into serogroups. J Clin Microbiol 35: 2841-2845.
13) Visca P, Goldoni P, Luck PC, Helbig JH, Cattani L, et al. (1999) Multiple types of Legionella pneumophila serogroup 6 in a hospital heated-water system associated with sporadic infections. J Clin Microbiol 37: 2189-2196.
14) Snitkin ES, Zelazny AM, Thomas PJ, Stock F, Group NCSP, et al. (2012) Tracking a hospital outbreak of carbapenem-resistant Klebsiella pneumoniae with whole-genome sequencing. Sci Transl Med 4: 148ra116.
15) Marx, V (2016) Microbiology: the road to strain-level identification. Nature Methods 13: 401-404.
16) CosmosID results (login: firstname.lastname@example.org, pw: 1600password).
17) McIntyre, A (2017) Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biology 18: 182.