Li Junyi, Li Huinian, Ye Xiao, Zhang Li, Xu Qingzhe, Ping Yuan, Jing Xiaozhu, Jiang Wei, Liao Qing, Liu Bo, Wang Yadong
School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China.
Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, Heilongjiang, China.
BMC Bioinformatics. 2021 May 13;22(Suppl 3):243. doi: 10.1186/s12859-020-03884-w.
The prediction of long non-coding RNA (lncRNA) has attracted great attention from researchers, as more and more evidence indicate that various complex human diseases are closely related to lncRNAs. In the era of bio-med big data, in addition to the prediction of lncRNAs by biological experimental methods, many computational methods based on machine learning have been proposed to make better use of the sequence resources of lncRNAs.
We developed the lncRNA prediction method by integrating information-entropy-based features and machine learning algorithms. We calculate generalized topological entropy and generate 6 novel features for lncRNA sequences. By employing these 6 features and other features such as open reading frame, we apply supporting vector machine, XGBoost and random forest algorithms to distinguish human lncRNAs. We compare our method with the one which has more K-mer features and results show that our method has higher area under the curve up to 99.7905%.
We develop an accurate and efficient method which has novel information entropy features to analyze and classify lncRNAs. Our method is also extendable for research on the other functional elements in DNA sequences.
长链非编码RNA(lncRNA)的预测已引起研究人员的高度关注,因为越来越多的证据表明各种复杂的人类疾病与lncRNA密切相关。在生物医学大数据时代,除了通过生物学实验方法预测lncRNA外,还提出了许多基于机器学习的计算方法,以更好地利用lncRNA的序列资源。
我们通过整合基于信息熵的特征和机器学习算法开发了lncRNA预测方法。我们计算广义拓扑熵并为lncRNA序列生成6个新特征。通过使用这6个特征以及其他特征(如开放阅读框),我们应用支持向量机、XGBoost和随机森林算法来区分人类lncRNA。我们将我们的方法与具有更多K-mer特征的方法进行比较,结果表明我们的方法具有更高的曲线下面积,高达99.7905%。
我们开发了一种准确高效的方法,该方法具有新颖的信息熵特征,可用于分析和分类lncRNA。我们的方法也可扩展用于研究DNA序列中的其他功能元件。