Hasso Plattner Institute for Digital Engineering, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany.
Department of Mathematics and Computer Science, Free University of Berlin, Berlin 14195, Germany.
Bioinformatics. 2022 Sep 16;38(Suppl_2):ii168-ii174. doi: 10.1093/bioinformatics/btac495.
Emerging pathogens are a growing threat, but large data collections and approaches for predicting the risk associated with novel agents are limited to bacteria and viruses. Pathogenic fungi, which also pose a constant threat to public health, remain understudied. Relevant data remain comparatively scarce and scattered among many different sources, hindering the development of sequencing-based detection workflows for novel fungal pathogens. No prediction method working for agents across all three groups is available, even though the cause of an infection is often difficult to identify from symptoms alone.
We present a curated collection of fungal host range data, comprising records on human, animal and plant pathogens, as well as other plant-associated fungi, linked to publicly available genomes. We show that it can be used to predict the pathogenic potential of novel fungal species directly from DNA sequences with either sequence homology or deep learning. We develop learned, numerical representations of the collected genomes and visualize the landscape of fungal pathogenicity. Finally, we train multi-class models predicting if next-generation sequencing reads originate from novel fungal, bacterial or viral threats.
The neural networks trained using our data collection enable accurate detection of novel fungal pathogens. A curated set of over 1400 genomes with host and pathogenicity metadata supports training of machine-learning models and sequence comparison, not limited to the pathogen detection task.
The data, models and code are hosted at https://zenodo.org/record/5846345, https://zenodo.org/record/5711877 and https://gitlab.com/dacs-hpi/deepac.
Supplementary data are available at Bioinformatics online.
新出现的病原体是一个日益严重的威胁,但用于预测新型病原体相关风险的大型数据集和方法仅限于细菌和病毒。致病真菌也对公共卫生构成持续威胁,但研究却相对较少。相关数据仍然相对较少,且分散在许多不同的来源中,这阻碍了基于测序的新型真菌病原体检测工作流程的发展。尽管感染的原因通常仅凭症状难以确定,但对于横跨细菌、真菌和病毒这三个组的病原体,目前还没有预测方法。
我们提出了一个真菌宿主范围数据的精选集,其中包括有关人类、动物和植物病原体以及其他与植物相关的真菌的记录,并与公开可用的基因组相关联。我们表明,它可以直接从 DNA 序列(包括序列同源性或深度学习)预测新型真菌物种的致病潜力。我们开发了所收集基因组的学习型、数值表示,并可视化了真菌致病性的景观。最后,我们训练了多类模型来预测下一代测序reads 是否源自新型真菌、细菌或病毒威胁。
使用我们的数据集训练的神经网络能够准确检测新型真菌病原体。超过 1400 个带有宿主和致病性元数据的精选基因组集支持机器学习模型和序列比较的训练,而不仅仅限于病原体检测任务。
数据、模型和代码托管在 https://zenodo.org/record/5846345、https://zenodo.org/record/5711877 和 https://gitlab.com/dacs-hpi/deepac 上。
补充数据可在《生物信息学》在线获取。