Sun Shiquan, Zhang Xiongpan, Peng Qinke
Systems Engineering Institute, Xi'an Jiaotong University, 28 Xianning West Road, Xi'an, Shaanxi 710049, China; Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, USA.
Systems Engineering Institute, Xi'an Jiaotong University, 28 Xianning West Road, Xi'an, Shaanxi 710049, China.
Artif Intell Med. 2017 Jan;75:16-23. doi: 10.1016/j.artmed.2016.11.004. Epub 2016 Dec 1.
Identifying transcription factors binding sites (TFBSs) plays an important role in understanding gene regulatory processes. The underlying mechanism of the specific binding for transcription factors (TFs) is still poorly understood. Previous machine learning-based approaches to identifying TFBSs commonly map a known TFBS to a one-dimensional vector using its physicochemical properties. However, when the dimension-sample rate is large (i.e., number of dimensions/number of samples), concatenating different physicochemical properties to a one-dimensional vector not only is likely to lose some structural information, but also poses significant challenges to recognition methods.
In this paper, we introduce a purely geometric representation method, tensor (also called multidimensional array), to represent TFs using their physicochemical properties. Accompanying the multidimensional array representation, we also develop a tensor-based recognition method, tensor partial least squares classifier (abbreviated as TPLSC). Intuitively, multidimensional arrays enable borrowing more information than one-dimensional arrays. The performance of each method is evaluated by average F-measure on 51 Escherichia coli TFs from RegulonDB database.
In our first experiment, the results show that multiple nucleotide properties can obtain more power than dinucleotide properties. In the second experiment, the results demonstrate that our method can gain increased prediction power, roughly 33% improvements more than the best result from existing methods.
The representation method for TFs is an important step in TFBSs recognition. We illustrate the benefits of this representation on real data application via a series of experiments. This method can gain further insights into the mechanism of TF binding and be of great use for metabolic engineering applications.
识别转录因子结合位点(TFBSs)在理解基因调控过程中起着重要作用。转录因子(TFs)特异性结合的潜在机制仍知之甚少。以前基于机器学习识别TFBSs的方法通常利用其物理化学性质将已知的TFBS映射到一维向量。然而,当维度-样本率较大时(即维度数/样本数),将不同的物理化学性质连接成一维向量不仅可能丢失一些结构信息,而且给识别方法带来重大挑战。
在本文中,我们引入一种纯几何表示方法——张量(也称为多维数组),利用其物理化学性质来表示TFs。伴随多维数组表示,我们还开发了一种基于张量的识别方法——张量偏最小二乘分类器(简称为TPLSC)。直观地说,多维数组能够比一维数组借用更多信息。每种方法的性能通过对来自RegulonDB数据库的51个大肠杆菌TFs的平均F值进行评估。
在我们的第一个实验中,结果表明多个核苷酸性质比二核苷酸性质能获得更强的能力。在第二个实验中,结果表明我们的方法可以提高预测能力大约33%,比现有方法的最佳结果有显著提升。
TFs的表示方法是TFBSs识别中的重要一步。我们通过一系列实验说明了这种表示方法在实际数据应用中的优势。该方法可以进一步深入了解TF结合机制,对代谢工程应用有很大帮助。