Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo 108-8639, Japan.
M&D Data Science Center, Tokyo Medical and Dental University, Tokyo 113-8510, Japan.
Bioinformatics. 2022 Sep 15;38(18):4264-4270. doi: 10.1093/bioinformatics/btac509.
Bacteriophages/phages are the viruses that infect and replicate within bacteria and archaea, and rich in human body. To investigate the relationship between phages and microbial communities, the identification of phages from metagenome sequences is the first step. Currently, there are two main methods for identifying phages: database-based (alignment-based) methods and alignment-free methods. Database-based methods typically use a large number of sequences as references; alignment-free methods usually learn the features of the sequences with machine learning and deep learning models.
We propose INHERIT which uses a deep representation learning model to integrate both database-based and alignment-free methods, combining the strengths of both. Pre-training is used as an alternative way of acquiring knowledge representations from existing databases, while the BERT-style deep learning framework retains the advantage of alignment-free methods. We compare INHERIT with four existing methods on a third-party benchmark dataset. Our experiments show that INHERIT achieves a better performance with the F1-score of 0.9932. In addition, we find that pre-training two species separately helps the non-alignment deep learning model make more accurate predictions.
The codes of INHERIT are now available in: https://github.com/Celestial-Bai/INHERIT.
Supplementary data are available at Bioinformatics online.
噬菌体是感染和复制细菌和古菌的病毒,在人体中也很丰富。为了研究噬菌体与微生物群落之间的关系,从宏基因组序列中鉴定噬菌体是第一步。目前,鉴定噬菌体的主要方法有两种:基于数据库(基于比对)的方法和无比对方法。基于数据库的方法通常使用大量序列作为参考;无比对方法通常使用机器学习和深度学习模型来学习序列的特征。
我们提出了 INHERIT,它使用深度表示学习模型来整合基于数据库和无比对的方法,结合两者的优势。预训练作为从现有数据库中获取知识表示的一种替代方法,而 BERT 风格的深度学习框架保留了无比对方法的优势。我们在第三方基准数据集上比较了 INHERIT 与四种现有方法。我们的实验表明,INHERIT 的 F1 得分为 0.9932,性能更好。此外,我们发现分别对两个物种进行预训练有助于非比对深度学习模型做出更准确的预测。
INHERIT 的代码现在可在:https://github.com/Celestial-Bai/INHERIT。
补充数据可在生物信息学在线获得。