应用人工神经网络后校正下一代测序数据中病毒分类群分布的估计。

Correcting the Estimation of Viral Taxa Distributions in Next-Generation Sequencing Data after Applying Artificial Neural Networks.

机构信息

Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, D-30559 Hannover, Germany.

出版信息

Genes (Basel). 2021 Oct 31;12(11):1755. doi: 10.3390/genes12111755.

DOI:10.3390/genes12111755

PMID:34828361

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8624964/

Abstract

Estimating the taxonomic composition of viral sequences in a biological samples processed by next-generation sequencing is an important step in comparative metagenomics. Mapping sequencing reads against a database of known viral reference genomes, however, fails to classify reads from novel viruses whose reference sequences are not yet available in public databases. Instead of a mapping approach, and in order to classify sequencing reads at least to a taxonomic level, the performance of artificial neural networks and other machine learning models was studied. Taxonomic and genomic data from the NCBI database were used to sample labelled sequencing reads as training data. The fitted neural network was applied to classify unlabelled reads of simulated and real-world test sets. Additional auxiliary test sets of labelled reads were used to estimate the conditional class probabilities, and to correct the prior estimation of the taxonomic distribution in the actual test set. Among the taxonomic levels, the biological order of viruses provided the most comprehensive data base to generate training data. The prediction accuracy of the artificial neural network to classify test reads to their viral order was considerably higher than that of a random classification. Posterior estimation of taxa frequencies could correct the primary classification results.

摘要

估算下一代测序处理的生物样本中病毒序列的分类组成是比较宏基因组学的重要步骤。然而，将测序读段映射到已知病毒参考基因组数据库上，无法对尚未在公共数据库中提供参考序列的新型病毒的读段进行分类。为了至少对分类级别进行测序读段分类，而不是采用映射方法，研究了人工神经网络和其他机器学习模型的性能。从 NCBI 数据库中获取分类和基因组数据，将标记的测序读段作为训练数据进行采样。将拟合的神经网络应用于模拟和真实世界测试集的未标记读段分类。使用额外的标记读段辅助测试集来估计条件类概率，并纠正实际测试集中的分类分布的先验估计。在分类级别中，病毒的生物目为生成训练数据提供了最全面的数据库。人工神经网络对病毒目进行分类的测试读段的预测准确性明显高于随机分类。类别的后验估计可以纠正主要的分类结果。