Sarkar B K, Chakraborty Chiranjib
Department of Physics, School of Basic and Applied Sciences, Galgotias University, Greater Noida, India.
J Biosci. 2015 Oct;40(4):709-19. doi: 10.1007/s12038-015-9555-z.
We performed canonical correlation analysis as an unsupervised statistical tool to describe related views of the same semantic object for identifying patterns. A pattern recognition technique based on canonical correlation analysis (CCA) was proposed for finding required genetic code in the DNA sequence. Two related but different objects were considered: one was a particular pattern, and other was test DNA sequence. CCA found correlations between two observations of the same semantic pattern and test sequence. It is concluded that the relationship possesses maximum value in the position where the pattern exists. As a case study, the potential of CCA was demonstrated on the sequence found from HIV-1 preferred integration sites. The subsequences on the left and right flanking from the integration site were considered as the two views, and statistically significant relationships were established between these two views to elucidate the viral preference as an important factor for the correlation.
我们进行了典型相关分析,将其作为一种无监督统计工具,以描述同一语义对象的相关视图来识别模式。提出了一种基于典型相关分析(CCA)的模式识别技术,用于在DNA序列中寻找所需的遗传密码。考虑了两个相关但不同的对象:一个是特定模式,另一个是测试DNA序列。CCA发现了同一语义模式的两个观测值与测试序列之间的相关性。得出的结论是,这种关系在模式存在的位置具有最大值。作为一个案例研究,在从HIV-1偏好整合位点发现的序列上展示了CCA的潜力。整合位点左右两侧的子序列被视为两个视图,并在这两个视图之间建立了具有统计学意义的关系,以阐明病毒偏好作为相关性的一个重要因素。