Transgero Limited, Cullinagh, Newcastle West, Co. Limerick, Ireland; Dept. of Accounting and Finance, Kemmy Business School, University of Limerick, V94PH93, Ireland.
Department of Bioinformatics - BiGCaT, NUTRIM, Maastricht University, the Netherlands.
NanoImpact. 2023 Jul;31:100475. doi: 10.1016/j.impact.2023.100475. Epub 2023 Jul 7.
The current effort towards the digital transformation across multiple scientific domains requires data that is Findable, Accessible, Interoperable and Reusable (FAIR). In addition to the FAIR data, what is required for the application of computational tools, such as Quantitative Structure Activity Relationships (QSARs), is a sufficient data volume and the ability to merge sources into homogeneous digital assets. In the nanosafety domain there is a lack of FAIR available metadata.
To address this challenge, we utilized 34 datasets from the nanosafety domain by exploiting the NanoSafety Data Reusability Assessment (NSDRA) framework, which allowed the annotation and assessment of dataset's reusability. From the framework's application results, eight datasets targeting the same endpoint (i.e. numerical cellular viability) were selected, processed and merged to test several hypothesis including universal versus nanogroup-specific QSAR models (metal oxide and nanotubes), and regression versus classification Machine Learning (ML) algorithms.
Universal regression and classification QSARs reached an 0.86 R and 0.92 accuracy, respectively, for the test set. Nanogroup-specific regression models reached 0.88 R for nanotubes test set followed by metal oxide (0.78). Nanogroup-specific classification models reached 0.99 accuracy for nanotubes test set, followed by metal oxide (0.91). Feature importance revealed different patterns depending on the dataset with common influential features including core size, exposure conditions and toxicological assay. Even in the case where the available experimental knowledge was merged, the models still failed to correctly predict the outputs of an unseen dataset, revealing the cumbersome conundrum of scientific reproducibility in realistic applications of QSAR for nanosafety. To harness the full potential of computational tools and ensure their long-term applications, embracing FAIR data practices is imperative in driving the development of responsible QSAR models.
This study reveals that the digitalization of nanosafety knowledge in a reproducible manner has a long way towards its successful pragmatic implementation. The workflow carried out in the study shows a promising approach to increase the FAIRness across all the elements of computational studies, from dataset's annotation, selection, merging to FAIR modeling reporting. This has significant implications for future research as it provides an example of how to utilize and report different tools available in the nanosafety knowledge system, while increasing the transparency of the results. One of the main benefits of this workflow is that it promotes data sharing and reuse, which is essential for advancing scientific knowledge by making data and metadata FAIR compliant. In addition, the increased transparency and reproducibility of the results can enhance the trustworthiness of the computational findings.
当前,多个科学领域都在努力实现数字化转型,这需要数据具有可查找、可访问、可互操作和可重用(FAIR)的特点。除了 FAIR 数据外,计算工具(如定量构效关系(QSAR))的应用还需要足够的数据量,并能够将数据源合并为同质的数字资产。在纳米安全领域,缺乏可用的 FAIR 元数据。
为了解决这个挑战,我们利用了纳米安全领域的 34 个数据集,利用了纳米安全数据可重复性评估(NSDRA)框架,该框架允许对数据集的可重用性进行注释和评估。从框架的应用结果中,选择了针对相同终点(即数值细胞活力)的八个数据集进行处理和合并,以测试包括通用与纳米组特定 QSAR 模型(金属氧化物和纳米管)以及回归与分类机器学习(ML)算法在内的多个假设。
通用回归和分类 QSAR 模型在测试集中的 R 值分别达到了 0.86 和 0.92,准确性较高。纳米组特定回归模型对纳米管测试集的 R 值达到了 0.88,其次是金属氧化物(0.78)。纳米组特定分类模型对纳米管测试集的准确率达到了 0.99,其次是金属氧化物(0.91)。特征重要性因数据集而异,常见的影响因素包括核心尺寸、暴露条件和毒理学检测。即使将可用的实验知识合并,模型仍然无法正确预测未见数据集的输出,这揭示了在 QSAR 纳米安全的实际应用中,科学可重复性面临的棘手难题。为了充分发挥计算工具的潜力并确保其长期应用,在推动负责任的 QSAR 模型发展方面,采用 FAIR 数据实践是至关重要的。
本研究表明,以可重现的方式实现纳米安全知识的数字化还有很长的路要走,才能成功实现其实际应用。研究中进行的工作流程展示了一种有前途的方法,可以提高从数据集注释、选择、合并到 FAIR 建模报告等计算研究各个环节的 FAIR 性。这对未来的研究具有重要意义,因为它提供了一个示例,说明了如何利用和报告纳米安全知识系统中可用的不同工具,同时提高结果的透明度。该工作流程的主要好处之一是促进了数据共享和重用,这对于通过使数据和元数据符合 FAIR 标准来推进科学知识的发展至关重要。此外,结果的透明度和可重复性的提高可以增强计算结果的可信度。