Suppr超能文献

基于机器学习的简单HPV16谱系分类

A Straightforward HPV16 Lineage Classification Based on Machine Learning.

作者信息

Asensio-Puig Laura, Alemany Laia, Pavón Miquel Angel

机构信息

Cancer Epidemiology Research Programme, Catalan Institute of Oncology, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet de Llobregat, Barcelona, Spain.

Centro de Investigación Biomédica en Red de Epidemiología y Salud Pública (CIBERESP), Madrid, Spain.

出版信息

Front Artif Intell. 2022 Jun 23;5:851841. doi: 10.3389/frai.2022.851841. eCollection 2022.

Abstract

Human Papillomavirus (HPV) is the causal agent of 5% of cancers worldwide and the main cause of cervical cancer and it is also associated with a significant percentage of oropharyngeal and anogenital cancers. More than 60% of cervical cancers are caused by HPV16 genotype, which has been classified into lineages (A, B, C, and D). Lineages are related to the progression of cervical cancer and the current method to assess lineages is by building a Maximum Likelihood Tree (MLT); which is slow, it cannot assess poor sequenced samples, and annotation is done manually. In this study, we have developed a new model to assess HPV16 lineage using machine learning tools. A total of 645 HPV16 genomes were analyzed using Genome-Wide Association Study (GWAS), which identified 56 lineage-specific Single Nucleotide Polymorphisms (SNPs). From the SNPs found, training-test models were constructed using different algorithms such as Random Forest (RF), Support Vector Machine (SVM), and K-nearest neighbor (KNN). A distinct set of HPV16 sequences ( = 1,028), whose lineage was previously determined by MLT, was used for validation. The RF-based model allowed a precise assignment of HPV16 lineage, showing an accuracy of 99.5% in the known lineage samples. Moreover, the RF model could assess lineage to 273 samples that MLT could not determine. In terms of computer consuming time, the RF-based model was almost 40 times faster than MLT. Having a fast and efficient method for assigning HPV16 lineages, could facilitate the implementation of lineage classification as a triage or prognostic marker in the clinical setting.

摘要

人乳头瘤病毒(HPV)是全球5%癌症的致病因子,是宫颈癌的主要病因,还与相当比例的口咽癌和肛门生殖器癌有关。超过60%的宫颈癌由HPV16基因型引起,该基因型已被分为不同谱系(A、B、C和D)。谱系与宫颈癌的进展相关,目前评估谱系的方法是构建最大似然树(MLT);这种方法速度慢,无法评估测序质量差的样本,且注释是手动完成的。在本研究中,我们开发了一种使用机器学习工具评估HPV16谱系的新模型。使用全基因组关联研究(GWAS)分析了总共645个HPV16基因组,确定了56个谱系特异性单核苷酸多态性(SNP)。根据发现的SNP,使用随机森林(RF)、支持向量机(SVM)和K近邻(KNN)等不同算法构建训练 - 测试模型。一组独特的HPV16序列( = 1,028),其谱系先前由MLT确定,用于验证。基于RF的模型能够精确分配HPV16谱系,在已知谱系样本中显示出99.5%的准确率。此外,RF模型能够评估MLT无法确定的273个样本的谱系。在计算机耗时方面,基于RF的模型比MLT快近40倍。拥有一种快速有效的方法来分配HPV16谱系,有助于在临床环境中将谱系分类作为一种分诊或预后标志物加以应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b532/9260188/7b0e26df48cc/frai-05-851841-g0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验