用于识别唾液中癌症蛋白质标志物的计算方法。

Computational methods for recognition of cancer protein markers in saliva.

机构信息

Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, School of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300384, China.

Information Technology Research Base of Civil Aviation Administration of China, Civil Aviation University of China, Tianjin 300300, China.

出版信息

Math Biosci Eng. 2020 Feb 25;17(3):2453-2469. doi: 10.3934/mbe.2020134.

In recent years, many studies have supported that cancer tissues can make disease-specific changes in some salivary proteins through some mediators in the pathogenesis of systemic diseases. These salivary proteins have the potential to become cancer-specific biomarkers in the early diagnosis stage. How to effectively identify these potential markers is one of the challenging issues. In this paper, we propose novel machine learning methods for recognition cancer biomarkers in saliva by two stages. In the first stage, salivary secreted proteins are recognized which are considered as candidate biomarkers of cancers. We picked up 557 salivary secretory proteins from 20379 human proteins by public databases and published literatures. Then, we present a training set construction strategy to solve the imbalance problem in order to make the classification methods get better accuracy. From all human protein set, the proteins belonging to the same families as salivary secretory proteins are removed. After that, we use SVC-KM method to cluster the remaining proteins, and select negative samples from each cluster in proportion. Next, the features of proteins are calculated by tools. We collect 24 protein properties such as sequence, structure and physicochemical properties, a total of 1087 features. An innovative procedure based on the local samples is proposed for selecting the appropriate features, in order to further improve the performance of SVM classifier. Experimental results show that the average sensitivity, specificity and accuracy of salivary secretory protein recognition using selected 32 features in training set are 97.09%, 98.10%, 97.61%, respectively. The use of these methods can improve the accuracy of recognition by solving the problems of unbalanced sample size and uneven distribution in training set. In the second stage, we apply the best model to dig out the salivary secreted proteins from 58 reported cancer markers, and get a total of 42 proteins which are considered to be used for salivary diagnosis. We analyze the gene expression data of three types of cancer, and predict that 33 genes will appear in saliva after they are translated into proteins. This study provides an important computational tool to help biologists and researchers reduce the number of candidate proteins and the cost of research. So as to further accelerate the discovery of cancer biomarkers in saliva and promote the development of saliva diagnosis.

近年来，许多研究支持通过系统疾病发病机制中的一些介质，癌症组织可以使一些唾液蛋白发生疾病特异性改变。这些唾液蛋白有可能成为早期诊断阶段的癌症特异性生物标志物。如何有效地识别这些潜在的标志物是具有挑战性的问题之一。在本文中，我们提出了通过两个阶段识别唾液生物标志物的新的机器学习方法。在第一阶段，识别唾液分泌蛋白，这些蛋白被认为是癌症的候选生物标志物。我们从 20379 个人类蛋白中通过公共数据库和已发表的文献中提取了 557 种唾液分泌蛋白。然后，我们提出了一种训练集构建策略来解决不平衡问题，以提高分类方法的准确性。从所有人类蛋白组中，去除与唾液分泌蛋白属于同一家族的蛋白。之后，我们使用 SVC-KM 方法对剩余蛋白进行聚类，并按比例从每个聚类中选择负样本。接下来，使用工具计算蛋白的特征。我们收集了 24 种蛋白特性，如序列、结构和理化特性，共计 1087 种特征。为了进一步提高 SVM 分类器的性能，提出了一种基于局部样本的新的特征选择方法。实验结果表明，在训练集中使用所选的 32 个特征进行唾液分泌蛋白识别的平均灵敏度、特异性和准确性分别为 97.09%、98.10%和 97.61%。这些方法的使用可以通过解决训练集样本大小不平衡和分布不均匀的问题来提高识别的准确性。在第二阶段，我们将最佳模型应用于从 58 个已报道的癌症标志物中挖掘出的唾液分泌蛋白，并得到了 42 种被认为可用于唾液诊断的蛋白。我们分析了三种癌症的基因表达数据，并预测这 33 个基因在翻译成蛋白质后将出现在唾液中。这项研究提供了一个重要的计算工具，帮助生物学家和研究人员减少候选蛋白的数量和研究成本。从而进一步加速唾液生物标志物的发现并促进唾液诊断的发展。

Computational methods for recognition of cancer protein markers in saliva.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献