Caldwell Rachel, Lin Yan-Xia, Zhang Ren
School of Biological Sciences, University of Wollongong, Northfields Ave, NSW 2522, Australia.
Int J Data Min Bioinform. 2010;4(5):535-52. doi: 10.1504/ijdmb.2010.035899.
The availability of genomic DNA and cDNA sequence data has escalated the data mining and genomics era. We aim to investigate the length distributions of the non-coding and coding regions of protein genes of two model organisms, Arabidopsis thaliana and Drosophila melanogaster. A non-linear functional relationship model was applied and strong correlation was found between the Coding Sequence (CDS) and non-coding sequence regions, conditional on the 5' UTR data. Significant differences were found between the protein functional classes and each gene region. Examination of the non-coding and coding regions of these organisms has revealed possible correlations.
基因组DNA和cDNA序列数据的可得性推动了数据挖掘和基因组学时代的发展。我们旨在研究两种模式生物——拟南芥和黑腹果蝇蛋白质基因的非编码区和编码区的长度分布。应用了非线性函数关系模型,发现在5'非翻译区(UTR)数据的条件下,编码序列(CDS)与非编码序列区域之间存在强相关性。在蛋白质功能类别和每个基因区域之间发现了显著差异。对这些生物的非编码区和编码区的研究揭示了可能的相关性。