Li Yuting, Dai Zhijun, Cao Dan, Luo Feng, Chen Yuan, Yuan Zheming
Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University 410128 China
School of Computing, Clemson University Clemson SC USA.
RSC Adv. 2020 May 27;10(34):19852-19860. doi: 10.1039/d0ra00061b. eCollection 2020 May 26.
Quantitative structure-activity relationship models are used in toxicology to predict the effects of organic compounds on aquatic organisms. Common filter feature selection methods use correlation statistics to rank features, but this approach considers only the correlation between a single feature and the response variable and does not take into account feature redundancy. Although the minimal redundancy maximal relevance approach considers the redundancy among features, direct removal of the redundant features may result in loss of prediction accuracy, and cross-validation of training sets to select an optimal subset of features is time-consuming. In this paper, we describe the development of a feature selection method, Chi-MIC-share, which can terminate feature selection automatically and is based on an improved maximal information coefficient and a redundant allocation strategy. We validated Chi-MIC-share using three environmental toxicology datasets and a support vector regression model. The results show that Chi-MIC-share is more accurate than other feature selection methods. We also performed a significance test on the model and analyzed the single-factor effects of the reserved descriptors.
定量构效关系模型在毒理学中用于预测有机化合物对水生生物的影响。常见的过滤特征选择方法使用相关统计对特征进行排序,但这种方法仅考虑单个特征与响应变量之间的相关性,而没有考虑特征冗余。尽管最小冗余最大相关方法考虑了特征之间的冗余,但直接去除冗余特征可能会导致预测准确性的损失,并且对训练集进行交叉验证以选择最优特征子集非常耗时。在本文中,我们描述了一种特征选择方法Chi-MIC-share的开发,该方法基于改进的最大信息系数和冗余分配策略,可以自动终止特征选择。我们使用三个环境毒理学数据集和一个支持向量回归模型对Chi-MIC-share进行了验证。结果表明,Chi-MIC-share比其他特征选择方法更准确。我们还对模型进行了显著性检验,并分析了保留描述符的单因素效应。