Suppr超能文献

基于第二代测序技术的肝癌基因序列的大数据分析与应用。

Big Data Analysis and Application of Liver Cancer Gene Sequence Based on Second-Generation Sequencing Technology.

机构信息

Faculty of Hepato-Biliary-Pancreatic Surgery, Chinese People's Liberation Army (PLA) General Hospital, Beijing 100853, China.

Faculty of Hepatology Medicine, Chinese People's Liberation Army (PLA) General Hospital, Beijing 100039, China.

出版信息

Comput Math Methods Med. 2022 Aug 16;2022:4004130. doi: 10.1155/2022/4004130. eCollection 2022.

Abstract

In big data analysis with the rapid improvement of computer storage capacity and the rapid development of complex algorithms, the exponential growth of massive data has also made science and technology progress with each passing day. Based on omics data such as mRNA data, microRNA data, or DNA methylation data, this study uses traditional clustering methods such as kmeans, K-nearest neighbors, hierarchical clustering, affinity propagation, and nonnegative matrix decomposition to classify samples into categories, obtained: (1) The assumption that the attributes are independent of each other reduces the classification effect of the algorithm to a certain extent. According to the idea of multilevel grid, there is a one-to-one mapping from high-dimensional space to one-dimensional. The complexity is greatly simplified by encoding the one-dimensional grid of the hierarchical grid. The logic of the algorithm is relatively simple, and it also has a very stable classification efficiency. (2) Convert the two-dimensional representation of the data into the one-dimensional representation of the binary, realize the dimensionality reduction processing of the data, and improve the organization and storage efficiency of the data. The grid coding expresses the spatial position of the data, maintains the original organization method of the data, and does not make the abstract expression of the data object. (3) The data processing of nondiscrete and missing values provides a new opportunity for the identification of protein targets of small molecule therapy and obtains a better classification effect. (4) The comparison of the three models shows that Naive Bayes is the optimal model. Each iteration is composed of alternately expected steps and maximal steps and then identified and quantified by MS.

摘要

在大数据分析中,随着计算机存储容量的快速提高和复杂算法的快速发展,海量数据的指数级增长也使科学技术日新月异。本研究基于 mRNA 数据、microRNA 数据或 DNA 甲基化数据等组学数据,采用传统聚类方法,如 kmeans、K-最近邻、层次聚类、亲和传播和非负矩阵分解等,将样本分类为不同类别,得到:(1) 属性相互独立的假设在一定程度上降低了算法的分类效果。根据多级网格的思想,高维空间与一维空间之间存在一一映射。通过对层次网格的一维网格进行编码,大大简化了算法的复杂度。算法的逻辑比较简单,分类效率也非常稳定。(2) 将数据的二维表示转换为二进制的一维表示,实现数据的降维处理,提高数据的组织和存储效率。网格编码表达了数据的空间位置,保持了数据的原始组织方式,没有对数据对象进行抽象表达。(3) 对非离散和缺失值的数据处理为小分子治疗的蛋白质靶标的识别提供了新的机会,并获得了更好的分类效果。(4) 三种模型的比较表明,朴素贝叶斯是最优模型。每个迭代由交替的期望步骤和最大步骤组成,然后通过 MS 进行识别和量化。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4796/9398858/0e10e12a2cdb/CMMM2022-4004130.001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验