Machine learning based phage discovery in sequencing data
Anastasiya Gæde *, Thomas Sicheritz-Pontén
- University of Copenhagen, GLOBE institute, Section for Hologenomics
- University of Copenhagen, GLOBE institute, Section for Hologenomics
Anastasiya Gæde, nastyashen06@gmail.com
Phages are present in every living environment and play a crucial role in steering microbial population dynamics. They are important entities in our ecosystem, yet the true diversity of phages remains largely unknown. Many phages remain undiscovered, making the process of discovering novel phages challenging, time- consuming, and expensive. To expedite this process, we are working on developing an in silico tool that can accurately and rapidly recognize phage genomes in a reference- and host- independent manner.We have gathered all publicly available phage genomes and fragmented each genome into several genes-long fragments to train our machine learning model. We generated 250 genomic features to describe these fragments. Our ultimate goal is to create an easy-to-use, fast, and efficient phage prediction tool. Such a tool will enable reference-free phage discovery in sequencing data, making the process more accessible.Additionally, by extracting the most valuable features selected during training, we can gain a better understanding of the unique genome peculiarities of phages.