Suppr超能文献

VHost-Classifier:基于自然语言处理的病毒-宿主分类。

VHost-Classifier: virus-host classification using natural language processing.

机构信息

Department of Microbiology and Immunology, University of British Columbia, Vancouver, BC, Canada.

Department of Earth, Ocean and Atmospheric Sciences, University of British Columbia, Vancouver, BC, Canada.

出版信息

Bioinformatics. 2019 Oct 1;35(19):3867-3869. doi: 10.1093/bioinformatics/btz151.

Abstract

MOTIVATION

When analyzing viral metagenomic sequences, it is often desired to filter the results of a BLAST analysis by the host species of the virus. VHost-Classifier automates this procedure using a natural language processing algorithm written in Python 3, which takes a list of taxonomic identifiers (taxids) returned from a BLAST query using viral sequences as input. The taxid output is binned by the evolutionary lineage of their host, based on string matching the words in their English names. If VHost-Classifier cannot identify a host, it attempts to bin the sequences by the environment from which the sample originated. VHost-Classifier predicts the evolutionary lineage of the host from the virus name and does not rely on referencing taxids against a database; therefore, it is not constrained by the size of a database and can host classify newly characterized viruses.

RESULTS

Benchmarked on a test dataset of 1000 randomly selected viral taxids on the NCBI taxonomy database, VHost-Classifier assigned, with 100% accuracy, a host to the rank of Class for >93% of viruses, and to the rank of Family for >37% of viruses.

AVAILABILITY AND IMPLEMENTATION

For more information about VHost-Classifier as well as implementation instructions, visit https://github.com/Kzra/VHost-Classifier.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

在分析病毒宏基因组序列时,通常希望通过病毒的宿主物种来过滤 BLAST 分析的结果。VHost-Classifier 使用 Python 3 编写的自然语言处理算法来自动执行此过程,该算法将 BLAST 查询返回的分类标识符 (taxid) 列表作为输入,使用病毒序列。输出的 taxid 按其宿主的进化谱系进行分类,基于与英文名称中的单词匹配的字符串。如果 VHost-Classifier 无法识别宿主,它会尝试根据样本来源的环境对序列进行分类。VHost-Classifier 根据病毒名称预测宿主的进化谱系,不依赖于参考数据库中的 taxid;因此,它不受数据库大小的限制,可以对新表征的病毒进行宿主分类。

结果

在 NCBI 分类数据库中 1000 个随机选择的病毒 taxid 的测试数据集上进行基准测试,VHost-Classifier 以 100%的准确率将宿主分配到类别的等级,对于 >93%的病毒,分配到科的等级,对于 >37%的病毒,分配到科的等级。

可用性和实现

有关 VHost-Classifier 的更多信息以及实施说明,请访问 https://github.com/Kzra/VHost-Classifier。

补充信息

补充数据可在 Bioinformatics 在线获得。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验