Zhao Yingjie, Wang Zhengzhi
College of Mechatronics Engineering and Automation, National University of Defense Technology, Changsha 410073, China.
Sheng Wu Yi Xue Gong Cheng Xue Za Zhi. 2010 Aug;27(4):779-84.
In the field of computational molecule biology, there is still a challenging question of how to detect non-coding RNA gene in lots of unlabeled sequences. Generally, the methods of machine learning and classification are employed to answer this question. However, only a limited number of positive training samples and unlabeled samples are available. The negative samples are difficult to define appropriately, yet they are necessary for usual learning-then-classification method. The common way for most of the existing non-coding RNA gene finding methods is to produce a number of random sequences as negative samples, which may hold some characteristic of positive sample sequences. Consequently, the contrived uncertain factor was introduced and the performance of methods was not good enough. In this paper, Support Vector Data Description (SVDD) is in use for to learning and classification as well as for detecting non-coding RNA gene in lots of unlabeled sequences, and the k-means clustering algorithm is employed before SVDD training to deal with the high flase positive fault in the result of SVDD. The training samples (target samples) are non-coding RNA genes validated by experiment. Moreover, appropriate features were constructed by Principal Component Analysis (PCA). The effectiveness and performance of the method are demonstrated by testing the cases in NONCODE databases and E. coli genome.
在计算分子生物学领域,如何在大量未标记序列中检测非编码RNA基因仍是一个具有挑战性的问题。一般来说,机器学习和分类方法被用于回答这个问题。然而,只有数量有限的阳性训练样本和未标记样本可用。阴性样本难以恰当定义,但它们对于常用的先学习后分类方法是必要的。大多数现有非编码RNA基因发现方法的常见做法是生成一些随机序列作为阴性样本,这些随机序列可能具有阳性样本序列的某些特征。因此,引入了人为的不确定因素,方法的性能不够理想。在本文中,支持向量数据描述(SVDD)被用于学习和分类,以及在大量未标记序列中检测非编码RNA基因,并且在SVDD训练之前采用k均值聚类算法来处理SVDD结果中的高误报故障。训练样本(目标样本)是经过实验验证的非编码RNA基因。此外,通过主成分分析(PCA)构建了合适的特征。通过对NONCODE数据库和大肠杆菌基因组中的案例进行测试,证明了该方法的有效性和性能。