Arturi Katarzyna, Harris Eliza J, Gasser Lilian, Escher Beate I, Braun Georg, Bosshard Robin, Hollender Juliane
Department of Environmental Chemistry, Swiss Federal Institute of Aquatic Science and Technology (Eawag), Überlandstrasse 133, 8600, Dübendorf, Switzerland.
Swiss Data Science Center (SDSC), Andreasstrasse 5, 8092, Zürich, Switzerland.
J Cheminform. 2025 Jan 31;17(1):14. doi: 10.1186/s13321-025-00950-4.
MLinvitroTox is an automated Python pipeline developed for high-throughput hazard-driven prioritization of toxicologically relevant signals detected in complex environmental samples through high-resolution tandem mass spectrometry (HRMS/MS). MLinvitroTox is a machine learning (ML) framework comprising 490 independent XGBoost classifiers trained on molecular fingerprints from chemical structures and target-specific endpoints from the ToxCast/Tox21 invitroDBv4.1 database. For each analyzed HRMS feature, MLinvitroTox generates a 490-bit bioactivity fingerprint used as a basis for prioritization, focusing the time-consuming molecular identification efforts on features most likely to cause adverse effects. The practical advantages of MLinvitroTox are demonstrated for groundwater HRMS data. Among the 874 features for which molecular fingerprints were derived from spectra, including 630 nontargets, 185 spectral matches, and 59 targets, around 4% of the feature/endpoint relationship pairs were predicted to be active. Cross-checking the predictions for targets and spectral matches with invitroDB data confirmed the bioactivity of 120 active and 6791 nonactive pairs while mislabeling 88 active and 56 non-active relationships. By filtering according to bioactivity probability, endpoint scores, and similarity to the training data, the number of potentially toxic features was reduced by at least one order of magnitude. This refinement makes the analytical confirmation of the toxicologically most relevant features feasible, offering significant benefits for cost-efficient chemical risk assessment.Scientific Contribution:In contrast to the classical ML-based approaches for toxicity prediction, MLinvitroTox predicts bioactivity for HRMS features (i.e., distinct m/z signals) based on MS2 fragmentation spectra rather than the chemical structures from the identified features. While the original proof of concept study was accompanied by the release of a MLinvitroTox v1 KNIME workflow, in this study, we release a Python MLinvitroTox v2 package, which, in addition to automation, expands functionality to include predicting toxicity from structures, cleaning up and generating chemical fingerprints, customizing models, and retraining on custom data. Furthermore, as a result of improvements in bioactivity data processing, realized in the concurrently released pytcpl Python package for the custom processing of invitroDBv4.1 input data used for training MLinvitroTox, the current release introduces enhancements in model accuracy, coverage of biological mechanistic targets, and overall interpretability.
MLinvitroTox是一个自动化的Python管道,用于通过高分辨率串联质谱(HRMS/MS)对复杂环境样品中检测到的毒理学相关信号进行高通量危害驱动的优先级排序。MLinvitroTox是一个机器学习(ML)框架,由490个独立的XGBoost分类器组成,这些分类器基于化学结构的分子指纹和ToxCast/Tox21体外数据库v4.1中的靶点特异性终点进行训练。对于每个分析的HRMS特征,MLinvitroTox生成一个490位的生物活性指纹,作为优先级排序的基础,将耗时的分子鉴定工作集中在最有可能造成不利影响的特征上。MLinvitroTox在地下水HRMS数据上的实际优势得到了证明。在从光谱中导出分子指纹的874个特征中,包括630个非靶点、185个光谱匹配和59个靶点,约4%的特征/终点关系对被预测为具有活性。将靶点和光谱匹配的预测结果与体外数据库数据进行交叉核对,确认了120个活性对和6791个非活性对的生物活性,同时错误标记了88个活性关系和56个非活性关系。通过根据生物活性概率、终点得分和与训练数据的相似性进行筛选,潜在有毒特征的数量减少了至少一个数量级。这种优化使得对毒理学上最相关特征的分析确认变得可行,为经济高效的化学风险评估带来了显著益处。科学贡献:与基于经典机器学习的毒性预测方法不同,MLinvitroTox基于MS2碎片光谱而非已识别特征的化学结构来预测HRMS特征(即不同的m/z信号)的生物活性。虽然最初的概念验证研究伴随着MLinvitroTox v1 KNIME工作流程的发布,但在本研究中,我们发布了Python MLinvitroTox v2包,除了自动化之外,还扩展了功能,包括从结构预测毒性、清理和生成化学指纹、定制模型以及在定制数据上重新训练。此外,由于在同时发布的用于训练MLinvitroTox的体外数据库v4.1输入数据的定制处理的pytcpl Python包中实现了生物活性数据处理的改进,当前版本在模型准确性、生物机制靶点覆盖范围和整体可解释性方面都有了增强。