• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

化学中小数据集机器学习的实用特征过滤策略

Practical feature filter strategy to machine learning for small datasets in chemistry.

作者信息

Hu Yang, Sandt Roland, Spatschek Robert

机构信息

Institute of Energy Materials and Devices IMD-1, Forschungszentrum Jülich GmbH, 52428, Jülich, Germany.

Georesources and Materials Engineering, RWTH Aachen University, 52062, Aachen, Germany.

出版信息

Sci Rep. 2024 Sep 3;14(1):20449. doi: 10.1038/s41598-024-71342-1.

DOI:10.1038/s41598-024-71342-1
PMID:39242744
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11379859/
Abstract

Many potential use cases for machine learning in chemistry and materials science suffer from small dataset sizes, which demands special care for the model design in order to deliver reliable predictions. Hence, feature selection as the key determinant for dataset design is essential here. We propose a practical and efficient feature filter strategy to determine the best input feature candidates. We illustrate this strategy for the prediction of adsorption energies based on a public dataset and sublimation enthalpies using an in-house training dataset. The input of adsorption energies reduces the feature space from 12 dimensions to two and still delivers accurate results. For the sublimation enthalpies, three input configurations are filtered from 14 possible configurations with different dimensions for further productive predictions as being most relevant by using our feature filter strategy. The best extreme gradient boosting regression model possesses a good performance and is evaluated from statistical and theoretical perspectives, reaching a level of accuracy comparable to density functional theory computations and allowing for physical interpretations of the predictions. Overall, the results indicate that the feature filter strategy can help interdisciplinary scientists without rich professional AI knowledge and limited computational resources to establish a reliable small training dataset first, which may make the final machine learning model training easier and more accurate, avoiding time-consuming hyperparameter explorations and improper feature selection.

摘要

化学和材料科学中机器学习的许多潜在用例都面临数据集规模较小的问题,这就要求在模型设计时格外小心,以便做出可靠的预测。因此,特征选择作为数据集设计的关键决定因素在此至关重要。我们提出了一种实用且高效的特征过滤策略,以确定最佳的输入特征候选集。我们基于一个公共数据集展示了该策略用于预测吸附能,并使用内部训练数据集展示了用于预测升华焓的情况。吸附能的输入将特征空间从12维减少到2维,并且仍然能给出准确的结果。对于升华焓,通过使用我们的特征过滤策略,从14种不同维度的可能配置中筛选出三种输入配置,作为最相关的配置用于进一步有效的预测。最佳的极端梯度提升回归模型具有良好的性能,并从统计和理论角度进行了评估,达到了与密度泛函理论计算相当的准确度水平,并且能够对预测结果进行物理解释。总体而言,结果表明特征过滤策略可以帮助没有丰富专业人工智能知识和有限计算资源的跨学科科学家首先建立一个可靠的小训练数据集,这可能会使最终的机器学习模型训练更容易、更准确,避免耗时的超参数探索和不当的特征选择。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a598/11379859/79084530a1c6/41598_2024_71342_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a598/11379859/619c8994d0b7/41598_2024_71342_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a598/11379859/5e3133ce7452/41598_2024_71342_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a598/11379859/f5ff6400332b/41598_2024_71342_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a598/11379859/ac425d031e19/41598_2024_71342_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a598/11379859/0f21f8705ff0/41598_2024_71342_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a598/11379859/79084530a1c6/41598_2024_71342_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a598/11379859/619c8994d0b7/41598_2024_71342_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a598/11379859/5e3133ce7452/41598_2024_71342_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a598/11379859/f5ff6400332b/41598_2024_71342_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a598/11379859/ac425d031e19/41598_2024_71342_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a598/11379859/0f21f8705ff0/41598_2024_71342_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a598/11379859/79084530a1c6/41598_2024_71342_Fig6_HTML.jpg

相似文献

1
Practical feature filter strategy to machine learning for small datasets in chemistry.化学中小数据集机器学习的实用特征过滤策略
Sci Rep. 2024 Sep 3;14(1):20449. doi: 10.1038/s41598-024-71342-1.
2
Ensemble machine learning model trained on a new synthesized dataset generalizes well for stress prediction using wearable devices.在新合成数据集上训练的集成机器学习模型,对于使用可穿戴设备进行压力预测具有良好的泛化能力。
J Biomed Inform. 2023 Dec;148:104556. doi: 10.1016/j.jbi.2023.104556. Epub 2023 Dec 2.
3
Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models.信息论特征选择和机器学习方法在遗传风险预测模型开发中的应用。
Sci Rep. 2021 Dec 2;11(1):23335. doi: 10.1038/s41598-021-00854-x.
4
Using machine learning to realize genetic site screening and genomic prediction of productive traits in pigs.利用机器学习实现猪生产性状的遗传位点筛选和基因组预测。
FASEB J. 2023 Jun;37(6):e22961. doi: 10.1096/fj.202300245R.
5
Performance discrepancy mitigation in heart disease prediction for multisensory inter-datasets.多感官跨数据集心脏病预测中的性能差异缓解
PeerJ Comput Sci. 2024 Mar 18;10:e1917. doi: 10.7717/peerj-cs.1917. eCollection 2024.
6
Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning.使用机器学习对猪的剩余采食量进行基因组预测的预测模型的特征选择稳定性和准确性
Front Genet. 2021 Feb 22;12:611506. doi: 10.3389/fgene.2021.611506. eCollection 2021.
7
Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers.机器学习算法在(放化疗)治疗结果预测中的应用:分类器的实证比较。
Med Phys. 2018 Jul;45(7):3449-3459. doi: 10.1002/mp.12967. Epub 2018 Jun 13.
8
A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification.三阶段包装器-过滤器特征选择框架用于疾病分类。
Sensors (Basel). 2021 Aug 18;21(16):5571. doi: 10.3390/s21165571.
9
Analysis of Hybrid Feature Optimization Techniques Based on the Classification Accuracy of Brain Tumor Regions Using Machine Learning and Further Evaluation Based on the Institute Test Data.基于机器学习的脑肿瘤区域分类准确率的混合特征优化技术分析及基于机构测试数据的进一步评估
J Med Phys. 2024 Jan-Mar;49(1):22-32. doi: 10.4103/jmp.jmp_77_23. Epub 2024 Mar 30.
10
Combining handcrafted features with latent variables in machine learning for prediction of radiation-induced lung damage.将机器学习中的手工特征与潜在变量相结合,以预测放射性肺损伤。
Med Phys. 2019 May;46(5):2497-2511. doi: 10.1002/mp.13497. Epub 2019 Apr 8.

本文引用的文献

1
Theoretical Prediction of the Sublimation Behavior by Combining Ab Initio Calculations with Statistical Mechanics.结合从头算计算与统计力学对升华行为进行理论预测。
Materials (Basel). 2023 Apr 1;16(7):2826. doi: 10.3390/ma16072826.
2
Testing the applicability and performance of Auto ML for potential applications in diagnostic neuroradiology.测试 Auto ML 在诊断神经放射学中的潜在应用的适用性和性能。
Sci Rep. 2022 Aug 11;12(1):13648. doi: 10.1038/s41598-022-18028-8.
3
A Pragmatic Transfer Learning Approach for Oxygen Vacancy Formation Energies in Oxidic Ceramics.
一种用于氧化陶瓷中氧空位形成能的实用迁移学习方法。
Materials (Basel). 2022 Apr 14;15(8):2879. doi: 10.3390/ma15082879.
4
Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions.深度学习:关于技术、分类法、应用及研究方向的全面综述
SN Comput Sci. 2021;2(6):420. doi: 10.1007/s42979-021-00815-1. Epub 2021 Aug 18.
5
From Local Explanations to Global Understanding with Explainable AI for Trees.利用可解释人工智能实现从局部解释到树木的全局理解
Nat Mach Intell. 2020 Jan;2(1):56-67. doi: 10.1038/s42256-019-0138-9. Epub 2020 Jan 17.
6
Scaling tree-based automated machine learning to biomedical big data with a feature set selector.使用特征集选择器将基于树的自动化机器学习扩展到生物医学大数据。
Bioinformatics. 2020 Jan 1;36(1):250-256. doi: 10.1093/bioinformatics/btz470.
7
Predicting the Enthalpy and Gibbs Energy of Sublimation by QSPR Modeling.用 QSPR 模型预测升华焓和吉布斯自由能。
Sci Rep. 2018 Jun 27;8(1):9779. doi: 10.1038/s41598-018-28105-6.
8
A problem of dimensionality: a simple example.维度问题:一个简单的例子。
IEEE Trans Pattern Anal Mach Intell. 1979 Mar;1(3):306-7. doi: 10.1109/tpami.1979.4766926.