BMC Genomics. 2013;14 Suppl 6(Suppl 6):S6. doi: 10.1186/1471-2164-14-S6-S6. Epub 2013 Oct 25.
Data preprocessing is a major step in data mining. In data preprocessing, several known techniques can be applied, or new ones developed, to improve data quality such that the mining results become more accurate and intelligible. Bioinformatics is one area with a high demand for generation of comprehensive models from large datasets. In this article, we propose a context-based data preprocessing approach to mine data from molecular docking simulation results. The test cases used a fully-flexible receptor (FFR) model of Mycobacterium tuberculosis InhA enzyme (FFR_InhA) and four different ligands.
We generated an initial set of attributes as well as their respective instances. To improve this initial set, we applied two selection strategies. The first was based on our context-based approach while the second used the CFS (Correlation-based Feature Selection) machine learning algorithm. Additionally, we produced an extra dataset containing features selected by combining our context strategy and the CFS algorithm. To demonstrate the effectiveness of the proposed method, we evaluated its performance based on various predictive (RMSE, MAE, Correlation, and Nodes) and context (Precision, Recall and FScore) measures.
Statistical analysis of the results shows that the proposed context-based data preprocessing approach significantly improves predictive and context measures and outperforms the CFS algorithm. Context-based data preprocessing improves mining results by producing superior interpretable models, which makes it well-suited for practical applications in molecular docking simulations using FFR models.
数据预处理是数据挖掘的主要步骤。在数据预处理中,可以应用几种已知的技术,或者开发新的技术,以提高数据质量,从而使挖掘结果更加准确和可理解。生物信息学是一个对从大型数据集生成综合模型有很高需求的领域。在本文中,我们提出了一种基于上下文的数据预处理方法,用于从分子对接模拟结果中挖掘数据。测试案例使用了结核分枝杆菌 InhA 酶的完全柔性受体(FFR)模型(FFR_InhA)和四种不同的配体。
我们生成了一组初始属性及其各自的实例。为了改进这个初始集,我们应用了两种选择策略。第一种是基于我们的基于上下文的方法,而第二种是使用基于相关性的特征选择(CFS)机器学习算法。此外,我们还生成了一个包含通过结合我们的上下文策略和 CFS 算法选择的特征的额外数据集。为了证明所提出方法的有效性,我们根据各种预测(RMSE、MAE、相关性和节点)和上下文(精度、召回率和 FScore)度量来评估其性能。
对结果的统计分析表明,所提出的基于上下文的数据预处理方法显著提高了预测和上下文度量,并优于 CFS 算法。基于上下文的数据预处理通过生成更具解释性的模型来提高挖掘结果,这使其非常适合使用 FFR 模型进行分子对接模拟的实际应用。