Suppr超能文献

ARKA:一种用于机器学习分类建模、风险评估和填补稀疏环境毒性数据的数据空白的降维框架。

ARKA: a framework of dimensionality reduction for machine-learning classification modeling, risk assessment, and data gap-filling of sparse environmental toxicity data.

机构信息

Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata 700 032, India.

出版信息

Environ Sci Process Impacts. 2024 Jun 19;26(6):991-1007. doi: 10.1039/d4em00173g.

Abstract

Due to the lack of experimental toxicity data for environmental chemicals, there arises a need to fill data gaps by approaches. One of the most commonly used approaches for toxicity assessment of small datasets is the Quantitative Structure-Activity Relationship (QSAR), which generates predictive models for the efficient prediction of query compounds. However, the reliability of the predictions from QSARs derived from small datasets is often questionable from a statistical point of view. This is due to the presence of a larger number of descriptors as compared to the number of training compounds, which reduces the degree of freedom of the developed model. To reduce the overall prediction error for a particular QSAR model, we have proposed here the computation of the novel Arithmetic Residuals in -groups Analysis (ARKA) descriptors. We have reduced the number of modeling descriptors in a supervised manner by partitioning them into classes ( = 2 here) depending on the higher mean normalized values of the descriptors to a particular response class, thus preventing the loss of chemical information. A scatter plot of the data points using the values of two ARKA descriptors (ARKA_2 ARKA_1) can potentially identify activity cliffs, less confident data points, and less modelable data points. We have used here five representative environmentally relevant endpoints (skin sensitization, earthworm toxicity, milk/plasma partitioning, algal toxicity, and rodent carcinogenicity of hazardous chemicals) with graded responses to which the ARKA framework was applied for classification modeling. On comparing the performance of the models generated using conventional QSAR descriptors and the ARKA descriptors, the prediction quality of the models derived from ARKA descriptors was found, based on multiple graded-data validation metrics-derived decision criteria, much better than the models derived from QSAR descriptors signifying the potential of ARKA descriptors in ecotoxicological classification modeling of small data sets. Additionally, this holds true for the Read-Across approach as well, since the Read-Across predictions using ARKA descriptors supersede the predictions generated from QSAR descriptors. For the ease of users, a Java-based expert system has been developed that computes the ARKA descriptors from the input of QSAR descriptors.

摘要

由于缺乏环境化学物质的实验毒性数据,因此需要通过多种方法来填补数据空白。对于小数据集的毒性评估,最常用的方法之一是定量构效关系(QSAR),它为查询化合物的有效预测生成预测模型。然而,从统计学角度来看,从小数据集得出的 QSAR 预测的可靠性往往值得怀疑。这是因为与训练化合物的数量相比,描述符的数量更多,从而降低了开发模型的自由度。为了降低特定 QSAR 模型的整体预测误差,我们在这里提出了计算新颖的分组算术残差分析(ARKA)描述符的方法。我们通过将描述符分为 2 个类(此处为 2),根据描述符到特定响应类的更高平均归一化值,以监督方式减少建模描述符的数量,从而防止化学信息的丢失。使用两个 ARKA 描述符(ARKA_2 和 ARKA_1)的值绘制数据点的散点图,可以潜在地识别活性悬崖、置信度较低的数据点和不易建模的数据点。我们在这里使用了五个具有分级响应的具有代表性的环境相关终点(皮肤致敏性、蚯蚓毒性、牛奶/血浆分配、藻类毒性和危险化学品的啮齿动物致癌性),应用 ARKA 框架进行分类建模。在比较使用传统 QSAR 描述符和 ARKA 描述符生成的模型的性能时,根据多个分级数据验证指标衍生的决策标准,发现基于 ARKA 描述符的模型的预测质量比基于 QSAR 描述符的模型要好得多,这表明 ARKA 描述符在小数据集的生态毒理学分类建模中的潜力。此外,对于 Read-Across 方法也是如此,因为使用 ARKA 描述符的 Read-Across 预测优于基于 QSAR 描述符的预测。为了方便用户,我们开发了一个基于 Java 的专家系统,它可以从 QSAR 描述符的输入中计算 ARKA 描述符。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验