School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China.
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China; School of Mathematics & Computer Science, Yanan University, Shanxi, 716000, China.
Comput Biol Med. 2023 Feb;153:106523. doi: 10.1016/j.compbiomed.2022.106523. Epub 2023 Jan 2.
Prediction of essential genes in a life organism is one of the central tasks in synthetic biology. Computational predictors are desired because experimental data is often unavailable. Recently, some sequence-based predictors have been constructed to identify essential genes. However, their predictive performance should be further improved. One key problem is how to effectively extract the sequence-based features, which are able to discriminate the essential genes. Another problem is the imbalanced training set. The amount of essential genes in human cell lines is lower than that of non-essential genes. Therefore, predictors trained with such imbalanced training set tend to identify an unseen sequence as a non-essential gene. Here, a new over-sampling strategy was proposed called Clustering based Synthetic Minority Oversampling Technique (CSMOTE) to overcome the imbalanced data issue. Combining CSMOTE with the Z curve, the global features, and Support Vector Machines, a new protocol called iEsGene-CSMOTE was proposed to identify essential genes. The rigorous jackknife cross validation results indicated that iEsGene-CSMOTE is better than the other competing methods. The proposed method outperformed λ-interval Z curve by 35.48% and 11.25% in terms of Sn and BACC, respectively.
预测生命有机体中的必需基因是合成生物学的核心任务之一。由于实验数据通常不可用,因此需要计算预测器。最近,已经构建了一些基于序列的预测器来识别必需基因。然而,它们的预测性能还需要进一步提高。一个关键问题是如何有效地提取基于序列的特征,这些特征能够区分必需基因。另一个问题是不平衡的训练集。人类细胞系中的必需基因数量低于非必需基因。因此,使用这种不平衡训练集训练的预测器往往会将未见过的序列识别为非必需基因。在这里,提出了一种称为基于聚类的合成少数过采样技术(CSMOTE)的新过采样策略来克服不平衡数据问题。将 CSMOTE 与 Z 曲线、全局特征和支持向量机相结合,提出了一种称为 iEsGene-CSMOTE 的新协议来识别必需基因。严格的 Jackknife 交叉验证结果表明,iEsGene-CSMOTE 优于其他竞争方法。在 Sn 和 BACC 方面,与 λ 间隔 Z 曲线相比,该方法分别提高了 35.48%和 11.25%。