School of Science, Minzu University of China, Beijing, 100081, China.
School of Science, Minzu University of China, Beijing, 100081, China.
Biochem Biophys Res Commun. 2024 Nov 19;734:150618. doi: 10.1016/j.bbrc.2024.150618. Epub 2024 Aug 29.
As pivotal markers of chromatin accessibility, DNase I hypersensitive sites (DHSs) intimately link to fundamental biological processes encompassing gene expression regulation and disease pathogenesis. Developing efficient and precise algorithms for DHSs identification holds paramount importance for unraveling genome functionality and elucidating disease mechanisms. This study innovatively presents iDHS-RGME, an Extremely Randomized Trees (Extra-Trees)-based algorithm that integrates unique feature extraction techniques for enhanced DHSs prediction. Specifically, iDHS-RGME utilizes two feature extraction approaches: Reverse Complementary Kmer (RCKmer) and Geary Spatial Autocorrelation (GSA), which comprehensively capture sequence attributes from diverse angles, bolstering information richness and accuracy. To address data imbalance, Borderline-SMOTE is employed, followed by Maximum Information Coefficient (MIC) for meticulous feature selection. Comparative evaluations underscored the superiority of the Extra-Trees classifier, which was subsequently adopted for model prediction. Through rigorous five-fold cross-validation, iDHS-RGME achieved remarkable accuracies of 94.71 % and 95.07 % on two independent datasets, outperforming previous models in terms of both precision and effectiveness.
作为染色质可及性的关键标记物,DNase I 超敏位点 (DHSs) 与包括基因表达调控和疾病发病机制在内的基本生物学过程密切相关。开发高效、精确的 DHSs 识别算法对于揭示基因组功能和阐明疾病机制至关重要。本研究创新性地提出了 iDHS-RGME,这是一种基于极端随机树 (Extra-Trees) 的算法,集成了独特的特征提取技术,用于增强 DHSs 的预测。具体来说,iDHS-RGME 利用了两种特征提取方法:反向互补 Kmer (RCKmer) 和 Geary 空间自相关 (GSA),它们从多个角度全面捕捉序列属性,增强了信息的丰富度和准确性。为了解决数据不平衡问题,采用了 Borderline-SMOTE,然后使用最大信息系数 (MIC) 进行细致的特征选择。比较评估突显了 Extra-Trees 分类器的优越性,随后该分类器被用于模型预测。通过严格的五重交叉验证,iDHS-RGME 在两个独立数据集上实现了 94.71%和 95.07%的出色准确率,在精度和有效性方面均优于以前的模型。