Suppr超能文献

寨卡病毒蛋白质组的无监督学习分析。

Unsupervised learning analysis on the proteomes of Zika virus.

作者信息

Lara-Ramírez Edgar E, Rivera Gildardo, Oliva-Hernández Amanda Alejandra, Bocanegra-Garcia Virgilio, López Jesús Adrián, Guo Xianwu

机构信息

Laboratorio de Biotecnología Farmacéutica, Centro de Biotecnología Genómica, Instituto Politécnico Nacional, Reynosa, Tamaulipas, México.

Laboratorio de Biotecnología Experimental, Centro de Biotecnología Genómica, Instituto Politécnico Nacional, Reynosa, Tamaulipas, México.

出版信息

PeerJ Comput Sci. 2024 Nov 11;10:e2443. doi: 10.7717/peerj-cs.2443. eCollection 2024.

Abstract

BACKGROUND

The Zika virus (ZIKV), which is transmitted by mosquito vectors to nonhuman primates and humans, causes devastating outbreaks in the poorest tropical regions of the world. Molecular epidemiology, supported by clustering phylogenetic gold standard studies using sequence data, has provided valuable information for tracking and controlling the spread of ZIKV. Unsupervised learning (UL), a form of machine learning algorithm, can be applied on the datasets without the need of known information for training.

METHODS

In this work, unsupervised Random Forest (URF), followed by the application of dimensional reduction algorithms such as principal component analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), t-distributed stochastic neighbor embedding (t-SNE), and autoencoders were used to uncover hidden patterns from polymorphic amino acid sites extracted on the proteome ZIKV multi-alignments, without the need of an underlying evolutionary model.

RESULTS

The four UL algorithms revealed specific host and geographical clustering patterns for ZIKV. Among the four dimensionality reduction (DR) algorithms, the performance was better for UMAP. The four algorithms allowed the identification of imported viruses for specific geographical clusters. The UL dimension coordinates showed a significant correlation with phylogenetic tree branch lengths and significant phylogenetic dependence in Abouheif's Cmean and Pagel's Lambda tests (p value < 0.01) that showed comparable performance with the phylogenetic method. This analytical strategy was generalizable to an external large dengue type 2 dataset.

CONCLUSION

These UL algorithms could be practical evolutionary analytical techniques to track the dispersal of viral pathogens.

摘要

背景

寨卡病毒(ZIKV)通过蚊媒传播给非人类灵长类动物和人类,在世界最贫困的热带地区引发毁灭性疫情。分子流行病学在使用序列数据的聚类系统发育金标准研究的支持下,为追踪和控制ZIKV的传播提供了有价值的信息。无监督学习(UL)作为机器学习算法的一种形式,可以应用于数据集,无需已知信息进行训练。

方法

在这项工作中,使用无监督随机森林(URF),随后应用主成分分析(PCA)、均匀流形逼近与投影(UMAP)、t分布随机邻域嵌入(t-SNE)和自动编码器等降维算法,从ZIKV多序列比对蛋白质组中提取的多态氨基酸位点中发现隐藏模式,无需潜在的进化模型。

结果

四种无监督学习算法揭示了ZIKV特定的宿主和地理聚类模式。在四种降维(DR)算法中,UMAP的性能更好。这四种算法能够识别特定地理聚类中的输入病毒。无监督学习维度坐标与系统发育树分支长度显示出显著相关性,并且在阿卜杜勒夫C均值和佩格尔λ检验中显示出显著的系统发育依赖性(p值<0.01),其性能与系统发育方法相当。这种分析策略可推广到外部大型登革热2型数据集。

结论

这些无监督学习算法可能是追踪病毒病原体传播的实用进化分析技术。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/068e/11623125/91cdf2d7ae73/peerj-cs-10-2443-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验