Caution is needed when combining viral identification methods: Insights from benchmarking in silico approaches for viral discovery
Bridget Hegarty 1*, James Riddell 2, Melissa Duhaime 3
- Civil and Environmental Engineering, Case Western Reserve University, Cleveland, OH, USA
- Department of Microbiology, The Ohio State University, Columbus, OH, USA
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, USA
Viral analyses require accurately differentiating viral from cellular sequences in complex metagenomes. As viruses have no universal marker genes, high mutation rates, and reference databases that do not well-represent their diversity, this remains challenging. An increasing number of bioinformatics tools have been developed to recover viral sequences from metagenomic datasets. Each of these tools has its own biases, making it challenging for researchers to know which tool(s) would be best suited to their specific application. Many researchers combine the output of multiple viral identification tools (“tools”) in an attempt to exploit the unique strengths of each to distill a set of higher confidence viral sequences. However, this approach has yet to be rigorously benchmarked.
Hypothesizing that a multi-tool approach would discover more viruses without greatly increasing contamination, we evaluated 27 published viral identification tools. We benchmarked 63 combinations of 6 tools (“rulesets”) that met our preliminary requirements using a mock environmental metagenomic dataset composed of publicly available viral, bacterial, archaeal, fungal, plasmid, and protist sequences. In addition to four single-tool rules (based on VirSorter, VirSorter2, DeepVirFinder, and VIBRANT), we also developed two multi-tool “tuning” rules. These rules use Kaiju, CheckV, and VirSorter2 to refine our predictions, adding sequences with particularly viral features (“tuning addition”) or removing sequences with particularly cellular features (“tuning removal”). We then applied these rulesets to different aquatic metagenomes (fresh and saltwater, drinking water, wastewater) to evaluate the impact of habitat on performance.
We found that 6 rule rulesets had MCCs that were statistically equivalent (padj ≥ 0.05) to the ruleset with the highest MCC (MCC=0.75). These “high MCC” rulesets all included VirSorter2, 5 of the 6 included our “tuning removal” rule, and none used more than 4 of our 6 rules. DeepVirFinder, VIBRANT, and VirSorter were each found once in the “high MCC” rulesets, but never in combination with each other. We further determined that 2 to 5 rule rulesets increased precision compared to the single-rule ones (padj ≤ 0.05); and that 4 and 5 rule rulesets increased recall compared to 1 to 3 rule ones (padj ≤ 0.05). From the environmental metagenomes, we found that different rulesets were better suited for virus vs cellular-enriched metagenomes. In this talk, I will share our ruleset recommendations for different environments and research questions, and provide a blueprint for intentional, data-driven validation of viral identification tool combinations.
Overall, we found that caution is necessary when combining viral identification tools. While combining tools does increase viral recall, it comes at the expense of more false positives. Ultimately, by increasing the number and quality of viruses identified from metagenomes through intentional, data-driven combination of tools, this work will improve ecological insights with far-reaching implications for human and environmental health.