Khalifa Waleed, Yousef Malik, Saçar Demirci Müşerref Duygu, Allmer Jens
Computer Science, The College of Sakhnin, Sakhnin, Israel.
The Institute of Applied Research- The Galilee Society, Shefa Amr, Israel.
PeerJ. 2016 Jun 21;4:e2135. doi: 10.7717/peerj.2135. eCollection 2016.
MicroRNAs (miRNAs) are short nucleotide sequences that form a typical hairpin structure which is recognized by a complex enzyme machinery. It ultimately leads to the incorporation of 18-24 nt long mature miRNAs into RISC where they act as recognition keys to aid in regulation of target mRNAs. It is involved to determine miRNAs experimentally and, therefore, machine learning is used to complement such endeavors. The success of machine learning mostly depends on proper input data and appropriate features for parameterization of the data. Although, in general, two-class classification (TCC) is used in the field; because negative examples are hard to come by, one-class classification (OCC) has been tried for pre-miRNA detection. Since both positive and negative examples are currently somewhat limited, feature selection can prove to be vital for furthering the field of pre-miRNA detection. In this study, we compare the performance of OCC and TCC using eight feature selection methods and seven different plant species providing positive pre-miRNA examples. Feature selection was very successful for OCC where the best feature selection method achieved an average accuracy of 95.6%, thereby being ∼29% better than the worst method which achieved 66.9% accuracy. While the performance is comparable to TCC, which performs up to 3% better than OCC, TCC is much less affected by feature selection and its largest performance gap is ∼13% which only occurs for two of the feature selection methodologies. We conclude that feature selection is crucially important for OCC and that it can perform on par with TCC given the proper set of features.
微小RNA(miRNA)是短核苷酸序列,形成典型的发夹结构,可被复杂的酶机制识别。它最终导致18 - 24个核苷酸长的成熟miRNA整合到RNA诱导沉默复合体(RISC)中,在那里它们作为识别关键分子,有助于调节靶标mRNA。实验中涉及到确定miRNA,因此,机器学习被用于辅助此类工作。机器学习的成功主要取决于合适的输入数据和用于数据参数化的适当特征。虽然该领域一般使用二分类(TCC);但由于负样本难以获得,单分类(OCC)已被尝试用于前体miRNA检测。由于目前正样本和负样本都有些有限,特征选择对于推动前体miRNA检测领域的发展可能至关重要。在本研究中,我们使用八种特征选择方法和七种不同的植物物种(提供正的前体miRNA示例)比较了OCC和TCC的性能表现。特征选择对于OCC非常成功,其中最佳特征选择方法的平均准确率达到95.6%,比最差方法(准确率为66.9%)高出约29%。虽然其性能与TCC相当,TCC比OCC的表现高出3%,但TCC受特征选择的影响要小得多,其最大性能差距约为13%,仅在两种特征选择方法中出现。我们得出结论,特征选择对OCC至关重要,并且在有合适的特征集时,它可以与TCC表现相当。