Ji Guoli, Lin Qianmin, Long Yuqi, Ye Congting, Ye Wenbin, Wu Xiaohui
* Department of Automation, Xiamen University, Xiamen, Fujian, P. R. China.
† College of the Environment and Ecology, Xiamen University, Xiamen, Fujian, P. R. China.
J Bioinform Comput Biol. 2017 Oct;15(5):1750018. doi: 10.1142/S0219720017500184. Epub 2017 Aug 16.
Alternative polyadenylation (APA) is a pervasive mechanism that contributes to gene regulation. Increasing sequenced poly(A) sites are placing new demands for the development of computational methods to investigate APA regulation. Cluster analysis is important to identify groups of co-expressed genes. However, clustering of poly(A) sites has not been extensively studied in APA, where most APA studies failed to consider the distribution, abundance, and variation of APA sites in each gene. Here we constructed a two-layer model based on canonical correlation analysis (CCA) to explore the underlying biological mechanisms in APA regulation. The first layer quantifies the general correlation of APA sites across various conditions between each gene and the second layer identifies genes with statistically significant correlation on their APA patterns to infer APA-specific gene clusters. Using hierarchical clustering, we comprehensively compared our method with four other widely used distance measures based on three performance indexes. Results showed that our method significantly enhanced the clustering performance for both synthetic and real poly(A) site data and could generate clusters with more biological meaning. We have implemented the CCA-based method as a publically available R package called PAcluster, which provides an efficient solution to the clustering of large APA-specific biological dataset.
可变聚腺苷酸化(Alternative polyadenylation,APA)是一种广泛存在的基因调控机制。越来越多已测序的聚腺苷酸化(poly(A))位点对用于研究APA调控的计算方法的发展提出了新的要求。聚类分析对于识别共表达基因的组很重要。然而,在APA中,poly(A)位点的聚类尚未得到广泛研究,在大多数APA研究中,未能考虑每个基因中APA位点的分布、丰度和变异。在此,我们构建了一个基于典型相关分析(Canonical correlation analysis,CCA)的两层模型,以探索APA调控中的潜在生物学机制。第一层量化每个基因在各种条件下APA位点的总体相关性,第二层识别其APA模式具有统计学显著相关性的基因,以推断特定于APA的基因簇。使用层次聚类,我们基于三个性能指标将我们的方法与其他四种广泛使用的距离度量进行了全面比较。结果表明,我们的方法显著提高了合成和真实poly(A)位点数据的聚类性能,并且可以生成具有更多生物学意义的簇。我们已将基于CCA的方法实现为一个名为PAcluster的公开可用R包,它为大型特定于APA的生物学数据集的聚类提供了一个有效的解决方案。