基于寡核苷酸的 K-均值聚类揭示代表多个核苷酸的符号序列的存在。

Revealing the Presence of a Symbolic Sequence Representing Multiple Nucleotides Based on K-Means Clustering of Oligonucleotides.

机构信息

School of Advanced Materials Science and Engineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, South Korea.

School of Chemical Engineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, South Korea.

出版信息

Molecules. 2019 Jan 18;24(2):348. doi: 10.3390/molecules24020348.

DOI:10.3390/molecules24020348

PMID:30669407

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6359743/

Abstract

In biological systems, a few sequence differences diversify the hybridization profile of nucleotides and enable the quantitative control of cellular metabolism in a cooperative manner. In this respect, the information required for a better understanding may not be in each nucleotide sequence, but representative information contained among them. Existing methodologies for nucleotide sequence design have been optimized to track the function of the genetic molecule and predict interaction with others. However, there has been no attempt to extract new sequence information to represent their inheritance function. Here, we tried to conceptually reveal the presence of a representative sequence from groups of nucleotides. The combined application of the K-means clustering algorithm and the social network analysis theorem enabled the effective calculation of the representative sequence. First, a "common sequence" is made that has the highest hybridization property to analog sequences. Next, the sequence complementary to the common sequence is designated as a 'representative sequence'. Based on this, we obtained a representative sequence from multiple analog sequences that are 8⁻10-bases long. Their hybridization was empirically tested, which confirmed that the common sequence had the highest hybridization tendency, and the representative sequence better alignment with the analogs compared to a mere complementary.

摘要

在生物系统中，少数序列差异使核苷酸的杂交模式多样化，并以协作的方式实现对细胞代谢的定量控制。在这方面，为了更好地理解，所需的信息可能不在每个核苷酸序列中，而是包含在它们中间的代表性信息。现有的核苷酸序列设计方法已经过优化，以跟踪遗传分子的功能并预测与其他分子的相互作用。然而，还没有人试图提取新的序列信息来代表它们的遗传功能。在这里，我们试图从核苷酸组中概念上揭示代表序列的存在。K-均值聚类算法和社交网络分析定理的联合应用使代表序列的有效计算成为可能。首先，生成一个“共有序列”，它对模拟序列具有最高的杂交特性。接下来，指定与共有序列互补的序列为“代表序列”。在此基础上，我们从多个长度为 8-10 个碱基的模拟序列中获得了一个代表序列。对它们的杂交进行了经验性测试，证实共有序列具有最高的杂交趋势，而代表序列与模拟序列的比对比单纯的互补序列更好。