Suppr超能文献

应用 T 分类器、二分类器对高通量 TCR 测序结果进行分析,以鉴定巨细胞病毒感染史。

Applying T-classifier, binary classifiers, upon high-throughput TCR sequencing output to identify cytomegalovirus exposure history.

机构信息

Department of Mathematics, School of Mathematical Sciences, Inner Mongolia University, Hohhot, China.

Hangzhou ImmuQuad Biotechnologies, Hangzhou, China.

出版信息

Sci Rep. 2023 Mar 28;13(1):5024. doi: 10.1038/s41598-023-31013-z.

Abstract

With the continuous development of information technology and the running speed of computers, the development of informatization has led to the generation of increasingly more medical data. Solving unmet needs such as employing the constantly developing artificial intelligence technology to medical data and providing support for the medical industry is a hot research topic. Cytomegalovirus (CMV) is a kind of virus that exists widely in nature with strict species specificity, and the infection rate among Chinese adults is more than 95%. Therefore, the detection of CMV is of great importance since the vast majority of infected patients are in a state of invisible infection after the infection, except for a few patients with clinical symptoms. In this study, we present a new method to detect CMV infection status by analyzing high-throughput sequencing results of T cell receptor beta chains (TCRβ). Based on the high-throughput sequencing data of 640 subjects from cohort 1, Fisher's exact test was performed to evaluate the relationship between TCRβ sequences and CMV status. Furthermore, the number of subjects with these correlated sequences to different degrees in cohort 1 and cohort 2 were measured to build binary classifier models to identify whether the subject was CMV positive or negative. We select four binary classification algorithms: logistic regression (LR), support vector machine (SVM), random forest (RF), and linear discriminant analysis (LDA) for side-by-side comparison. According to the performance of different algorithms corresponding to different thresholds, four optimal binary classification algorithm models are obtained. The logistic regression algorithm performs best when Fisher's exact test threshold is 10, and the sensitivity and specificity are 87.5% and 96.88%, respectively. The RF algorithm performs better at the threshold of 10, with a sensitivity of 87.5% and a specificity of 90.63%. The SVM algorithm also achieves high accuracy at the threshold value of 10, with a sensitivity of 85.42% and specificity of 96.88%. The LDA algorithm achieves high accuracy with 95.83% sensitivity and 90.63% specificity when the threshold value is 10. This is probably because the two-dimensional distribution of CMV data samples is linearly separable, and linear division models such as LDA are more effective, while the division effect of nonlinear separable algorithms such as random forest is relatively inaccurate. This new finding may be a potential diagnostic method for CMV and may even be applicable to other viruses, such as the infectious history detection of the new coronavirus.

摘要

随着信息技术的不断发展和计算机运行速度的提高,信息化的发展导致越来越多的医疗数据的产生。利用不断发展的人工智能技术处理医疗数据并为医疗行业提供支持,以解决未满足的需求,是一个热门的研究课题。巨细胞病毒(CMV)是一种广泛存在于自然界中,具有严格物种特异性的病毒,中国成年人的感染率超过 95%。因此,CMV 的检测非常重要,因为绝大多数感染患者在感染后处于隐形感染状态,除了少数有临床症状的患者。在这项研究中,我们提出了一种通过分析 T 细胞受体β链(TCRβ)高通量测序结果来检测 CMV 感染状态的新方法。基于来自队列 1 的 640 名受试者的高通量测序数据,采用 Fisher 精确检验评估 TCRβ序列与 CMV 状态之间的关系。此外,还测量了队列 1 和队列 2 中具有不同程度这些相关序列的受试者数量,以建立二分类器模型来识别受试者是否为 CMV 阳性或阴性。我们选择了四种二分类算法:逻辑回归(LR)、支持向量机(SVM)、随机森林(RF)和线性判别分析(LDA)进行并排比较。根据不同算法对应不同阈值的性能,得到了四个最佳的二分类算法模型。当 Fisher 精确检验的阈值为 10 时,逻辑回归算法的性能最佳,敏感性和特异性分别为 87.5%和 96.88%。在阈值为 10 时,RF 算法的性能更好,敏感性为 87.5%,特异性为 90.63%。在阈值为 10 时,SVM 算法也能达到很高的准确率,敏感性为 85.42%,特异性为 96.88%。当阈值为 10 时,LDA 算法的敏感性为 95.83%,特异性为 90.63%,准确率很高。这可能是因为 CMV 数据样本的二维分布是线性可分的,因此 LDA 等线性划分模型更有效,而随机森林等非线性可分算法的划分效果则相对不准确。这一新发现可能是 CMV 的一种潜在诊断方法,甚至可能适用于其他病毒,如新冠状病毒的感染史检测。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/72fd/10050212/0fd607f76bba/41598_2023_31013_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验