Voigt Benjamin, Fischer Oliver, Krumnow Christian, Herta Christian, Dabrowski Piotr Wojciech
Center for Bio-Medical image and Information processing (CBMI), HTW University of Applied Sciences, Berlin, Germany.
PLoS One. 2021 Dec 22;16(12):e0261548. doi: 10.1371/journal.pone.0261548. eCollection 2021.
Clinical metagenomics is a powerful diagnostic tool, as it offers an open view into all DNA in a patient's sample. This allows the detection of pathogens that would slip through the cracks of classical specific assays. However, due to this unspecific nature of metagenomic sequencing, a huge amount of unspecific data is generated during the sequencing itself and the diagnosis only takes place at the data analysis stage where relevant sequences are filtered out. Typically, this is done by comparison to reference databases. While this approach has been optimized over the past years and works well to detect pathogens that are represented in the used databases, a common challenge in analysing a metagenomic patient sample arises when no pathogen sequences are found: How to determine whether truly no evidence of a pathogen is present in the data or whether the pathogen's genome is simply absent from the database and the sequences in the dataset could thus not be classified? Here, we present a novel approach to this problem of detecting novel pathogens in metagenomic datasets by classifying the (segments of) proteins encoded by the sequences in the datasets. We train a neural network on the sequences of coding sequences, labeled by taxonomic domain, and use this neural network to predict the taxonomic classification of sequences that can not be classified by comparison to a reference database, thus facilitating the detection of potential novel pathogens.
临床宏基因组学是一种强大的诊断工具,因为它能对患者样本中的所有DNA进行全面分析。这使得能够检测出那些会从传统特异性检测方法的缝隙中溜走的病原体。然而,由于宏基因组测序的这种非特异性性质,在测序过程中会产生大量非特异性数据,而诊断仅在数据分析阶段进行,此时相关序列会被筛选出来。通常,这是通过与参考数据库进行比对来完成的。虽然这种方法在过去几年中已经得到优化,并且在检测所用数据库中存在的病原体方面效果良好,但在分析宏基因组患者样本时,当未发现病原体序列时,会出现一个常见的挑战:如何确定数据中是否真的没有病原体存在的证据,或者病原体的基因组是否只是不在数据库中,因此数据集中的序列无法被分类?在这里,我们提出了一种新的方法来解决在宏基因组数据集中检测新病原体的问题,即通过对数据集中序列编码的蛋白质(片段)进行分类。我们在由分类域标记的编码序列序列上训练一个神经网络,并使用这个神经网络来预测那些通过与参考数据库比对无法分类的序列的分类,从而便于检测潜在的新病原体。