Hu Hae-Jin, Goh Sung-Ho, Lee Yeon-Su
Functional Genomics Branch, Research Institute, National Cancer Center, Gyeonggi-do, Republic of Korea.
Genes Genet Syst. 2010;85(6):383-94. doi: 10.1266/ggs.85.383.
Alternative splicing is a main component of protein diversity, and aberrant splicing is known to be one of the main causes of genetic disorders such as cancer. Many statistical and computational approaches have identified several major factors that determine the splicing event, such as exon/intron length, splice site strength, and density of splicing enhancers or silencers. These factors may be correlated with one another and thus result in a specific type of splicing, but there has not been a systematic approach to extracting comprehensible association patterns. Here, we attempted to understand the decision making process of the learning machine on intron retention event. We adopted a hybrid learning machine approach using a random forest and association rule mining algorithm to determine the governing factors of intron retention events and their combined effect on decision-making processes. By quantifying all candidate features into five category values, we enhanced the understandability of generated rules. The interesting features found by the random forest algorithm are that only the adenine- and thymine-based triplets such as ATA, TTA, and ATT, but not the known intronic splicing enhancer GGG triplet is shown the significant features. The rules generated by the association rule mining algorithm also show that constitutive introns are generally characterized by high adenine- and thymine-based triplet frequency (level 3 and above), 3' and 5' splice site scores, exonic splicing silencer scores, and intron length, whereas retained introns are characterized by low-level counterpart scores.
可变剪接是蛋白质多样性的主要组成部分,已知异常剪接是癌症等遗传疾病的主要原因之一。许多统计和计算方法已经确定了决定剪接事件的几个主要因素,例如外显子/内含子长度、剪接位点强度以及剪接增强子或沉默子的密度。这些因素可能相互关联,从而导致特定类型的剪接,但尚未有一种系统的方法来提取可理解的关联模式。在这里,我们试图了解学习机器在保留内含子事件上的决策过程。我们采用了一种混合学习机器方法,使用随机森林和关联规则挖掘算法来确定保留内含子事件的控制因素及其对决策过程的综合影响。通过将所有候选特征量化为五个类别值,我们提高了生成规则的可理解性。随机森林算法发现的有趣特征是,只有基于腺嘌呤和胸腺嘧啶的三联体,如ATA、TTA和ATT,而不是已知的内含子剪接增强子GGG三联体显示出显著特征。关联规则挖掘算法生成的规则还表明,组成型内含子通常具有高基于腺嘌呤和胸腺嘧啶的三联体频率(3级及以上)、3'和5'剪接位点得分、外显子剪接沉默子得分以及内含子长度的特征,而保留的内含子则具有低水平对应得分的特征。