利用 ResNets 和经过整理的真菌-宿主数据集检测新型真菌病原体的 DNA。

Detecting DNA of novel fungal pathogens using ResNets and a curated fungi-hosts data collection.

机构信息

Hasso Plattner Institute for Digital Engineering, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany.

Department of Mathematics and Computer Science, Free University of Berlin, Berlin 14195, Germany.

出版信息

Bioinformatics. 2022 Sep 16;38(Suppl_2):ii168-ii174. doi: 10.1093/bioinformatics/btac495.

DOI:10.1093/bioinformatics/btac495

PMID:36124807

Abstract

BACKGROUND

Emerging pathogens are a growing threat, but large data collections and approaches for predicting the risk associated with novel agents are limited to bacteria and viruses. Pathogenic fungi, which also pose a constant threat to public health, remain understudied. Relevant data remain comparatively scarce and scattered among many different sources, hindering the development of sequencing-based detection workflows for novel fungal pathogens. No prediction method working for agents across all three groups is available, even though the cause of an infection is often difficult to identify from symptoms alone.

RESULTS

We present a curated collection of fungal host range data, comprising records on human, animal and plant pathogens, as well as other plant-associated fungi, linked to publicly available genomes. We show that it can be used to predict the pathogenic potential of novel fungal species directly from DNA sequences with either sequence homology or deep learning. We develop learned, numerical representations of the collected genomes and visualize the landscape of fungal pathogenicity. Finally, we train multi-class models predicting if next-generation sequencing reads originate from novel fungal, bacterial or viral threats.

CONCLUSIONS

The neural networks trained using our data collection enable accurate detection of novel fungal pathogens. A curated set of over 1400 genomes with host and pathogenicity metadata supports training of machine-learning models and sequence comparison, not limited to the pathogen detection task.

AVAILABILITY AND IMPLEMENTATION

The data, models and code are hosted at https://zenodo.org/record/5846345, https://zenodo.org/record/5711877 and https://gitlab.com/dacs-hpi/deepac.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

背景

新出现的病原体是一个日益严重的威胁，但用于预测新型病原体相关风险的大型数据集和方法仅限于细菌和病毒。致病真菌也对公共卫生构成持续威胁，但研究却相对较少。相关数据仍然相对较少，且分散在许多不同的来源中，这阻碍了基于测序的新型真菌病原体检测工作流程的发展。尽管感染的原因通常仅凭症状难以确定，但对于横跨细菌、真菌和病毒这三个组的病原体，目前还没有预测方法。

结果

我们提出了一个真菌宿主范围数据的精选集，其中包括有关人类、动物和植物病原体以及其他与植物相关的真菌的记录，并与公开可用的基因组相关联。我们表明，它可以直接从 DNA 序列（包括序列同源性或深度学习）预测新型真菌物种的致病潜力。我们开发了所收集基因组的学习型、数值表示，并可视化了真菌致病性的景观。最后，我们训练了多类模型来预测下一代测序reads 是否源自新型真菌、细菌或病毒威胁。