• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用可微信息不平衡在分子系统中进行自动特征选择和加权。

Automatic feature selection and weighting in molecular systems using Differentiable Information Imbalance.

作者信息

Wild Romina, Wodaczek Felix, Del Tatto Vittorio, Cheng Bingqing, Laio Alessandro

机构信息

International School for Advanced Studies (SISSA), Trieste, Italy.

The Institute of Science and Technology Austria (ISTA), Klosterneuburg, Austria.

出版信息

Nat Commun. 2025 Jan 2;16(1):270. doi: 10.1038/s41467-024-55449-7.

DOI:10.1038/s41467-024-55449-7
PMID:39747013
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11696465/
Abstract

Feature selection is essential in the analysis of molecular systems and many other fields, but several uncertainties remain: What is the optimal number of features for a simplified, interpretable model that retains essential information? How should features with different units be aligned, and how should their relative importance be weighted? Here, we introduce the Differentiable Information Imbalance (DII), an automated method to rank information content between sets of features. Using distances in a ground truth feature space, DII identifies a low-dimensional subset of features that best preserves these relationships. Each feature is scaled by a weight, which is optimized by minimizing the DII through gradient descent. This allows simultaneously performing unit alignment and relative importance scaling, while preserving interpretability. DII can also produce sparse solutions and determine the optimal size of the reduced feature space. We demonstrate the usefulness of this approach on two benchmark molecular problems: (1) identifying collective variables that describe conformations of a biomolecule, and (2) selecting features for training a machine-learning force field. These results show the potential of DII in addressing feature selection challenges and optimizing dimensionality in various applications. The method is available in the Python library DADApy.

摘要

特征选择在分子系统分析及许多其他领域中至关重要,但仍存在一些不确定性:对于一个保留基本信息的简化、可解释模型而言,最佳特征数量是多少?具有不同单位的特征应如何对齐,其相对重要性又应如何加权?在此,我们引入了可微信息不平衡(DII),这是一种对特征集之间的信息内容进行排序的自动化方法。利用真实特征空间中的距离,DII识别出能最佳保留这些关系的低维特征子集。每个特征都由一个权重进行缩放,该权重通过梯度下降最小化DII来进行优化。这使得在保留可解释性的同时,能够同时进行单位对齐和相对重要性缩放。DII还可以产生稀疏解并确定降维特征空间的最佳大小。我们在两个基准分子问题上展示了这种方法的实用性:(1)识别描述生物分子构象的集体变量,以及(2)选择用于训练机器学习力场的特征。这些结果表明DII在应对特征选择挑战和优化各种应用中的维度方面具有潜力。该方法可在Python库DADApy中获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/138a/11696465/80d477b0303b/41467_2024_55449_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/138a/11696465/6e88f1bef2f0/41467_2024_55449_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/138a/11696465/e1ab9d1781f2/41467_2024_55449_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/138a/11696465/80d477b0303b/41467_2024_55449_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/138a/11696465/6e88f1bef2f0/41467_2024_55449_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/138a/11696465/e1ab9d1781f2/41467_2024_55449_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/138a/11696465/80d477b0303b/41467_2024_55449_Fig3_HTML.jpg

相似文献

1
Automatic feature selection and weighting in molecular systems using Differentiable Information Imbalance.使用可微信息不平衡在分子系统中进行自动特征选择和加权。
Nat Commun. 2025 Jan 2;16(1):270. doi: 10.1038/s41467-024-55449-7.
2
Unsupervised Feature Selection to Identify Important ICD-10 and ATC Codes for Machine Learning on a Cohort of Patients With Coronary Heart Disease: Retrospective Study.无监督特征选择以识别冠心病患者队列机器学习中的重要国际疾病分类第十版(ICD - 10)和解剖治疗化学分类系统(ATC)编码:回顾性研究
JMIR Med Inform. 2024 Jul 26;12:e52896. doi: 10.2196/52896.
3
An effective multi-step feature selection framework for clinical outcome prediction using electronic medical records.一种使用电子病历进行临床结果预测的有效多步骤特征选择框架。
BMC Med Inform Decis Mak. 2025 Feb 17;25(1):84. doi: 10.1186/s12911-025-02922-y.
4
Collective feature selection to identify crucial epistatic variants.用于识别关键上位性变异的集体特征选择
BioData Min. 2018 Apr 19;11:5. doi: 10.1186/s13040-018-0168-6. eCollection 2018.
5
Stochastic Mutual Information Gradient Estimation for Dimensionality Reduction Networks.用于降维网络的随机互信息梯度估计
Inf Sci (N Y). 2021 Sep;570:298-305. doi: 10.1016/j.ins.2021.04.066. Epub 2021 Apr 20.
6
Combining handcrafted features with latent variables in machine learning for prediction of radiation-induced lung damage.将机器学习中的手工特征与潜在变量相结合,以预测放射性肺损伤。
Med Phys. 2019 May;46(5):2497-2511. doi: 10.1002/mp.13497. Epub 2019 Apr 8.
7
oFVSD: a Python package of optimized forward variable selection decoder for high-dimensional neuroimaging data.oFVSD:用于高维神经成像数据的优化前向变量选择解码器的Python软件包。
Front Neuroinform. 2023 Sep 26;17:1266713. doi: 10.3389/fninf.2023.1266713. eCollection 2023.
8
A Comprehensive Machine Learning Benchmark Study for Radiomics-Based Survival Analysis of CT Imaging Data in Patients With Hepatic Metastases of CRC.基于 CT 成像数据的 CRC 肝转移瘤生存分析的放射组学的全面机器学习基准研究。
Invest Radiol. 2023 Dec 1;58(12):874-881. doi: 10.1097/RLI.0000000000001009. Epub 2023 Jul 28.
9
A data-guided approach for the evaluation of zeolites for hydrogen storage with the aid of molecular simulations.一种借助分子模拟对用于储氢的沸石进行评估的数据导向方法。
J Mol Model. 2024 Jan 18;30(2):43. doi: 10.1007/s00894-024-05837-z.
10
Advancing aircraft engine RUL predictions: an interpretable integrated approach of feature engineering and aggregated feature importance.推进飞机发动机剩余使用寿命预测:一种特征工程与聚合特征重要性的可解释集成方法。
Sci Rep. 2023 Aug 18;13(1):13466. doi: 10.1038/s41598-023-40315-1.

引用本文的文献

1
Adaptive information-constrained mapping for feature compression in edge AI and federated systems.边缘人工智能和联邦系统中用于特征压缩的自适应信息约束映射
Sci Rep. 2025 Aug 22;15(1):30915. doi: 10.1038/s41598-025-16604-2.

本文引用的文献

1
Maximally informative feature selection using Information Imbalance: Application to COVID-19 severity prediction.基于信息不平衡的最大信息量特征选择:在 COVID-19 严重程度预测中的应用。
Sci Rep. 2024 May 10;14(1):10744. doi: 10.1038/s41598-024-61334-6.
2
Robust inference of causality in high-dimensional dynamical processes from the Information Imbalance of distance ranks.基于距离秩的信息不平衡对高维动态过程中的因果关系进行稳健推断。
Proc Natl Acad Sci U S A. 2024 May 7;121(19):e2317256121. doi: 10.1073/pnas.2317256121. Epub 2024 Apr 30.
3
Tensor-Reduced Atomic Density Representations.
张量约化原子密度表示
Phys Rev Lett. 2023 Jul 14;131(2):028001. doi: 10.1103/PhysRevLett.131.028001.
4
Updates to the DScribe library: New descriptors and derivatives.DScribe 库更新:新增描述符和衍生物。
J Chem Phys. 2023 Jun 21;158(23). doi: 10.1063/5.0151031.
5
The role of asymmetric dimethylarginine (ADMA) in COVID-19: association with respiratory failure and predictive role for outcome.非对称性二甲基精氨酸(ADMA)在 COVID-19 中的作用:与呼吸衰竭的关联及其对预后的预测作用。
Sci Rep. 2023 Jun 17;13(1):9811. doi: 10.1038/s41598-023-36954-z.
6
Feature importance-based interpretation of UMAP-visualized polymer space.基于特征重要性的UMAP可视化聚合物空间解释。
Mol Inform. 2023 Aug;42(8-9):e2300061. doi: 10.1002/minf.202300061. Epub 2023 Jun 16.
7
Do Machine-Learning Atomic Descriptors and Order Parameters Tell the Same Story? The Case of Liquid Water.机器学习原子描述符和序参量讲述的是同一个故事吗?以液态水为例。
J Chem Theory Comput. 2023 Jul 25;19(14):4596-4605. doi: 10.1021/acs.jctc.2c01205. Epub 2023 Mar 15.
8
Ranking the information content of distance measures.对距离度量的信息内容进行排序。
PNAS Nexus. 2022 Apr 14;1(2):pgac039. doi: 10.1093/pnasnexus/pgac039. eCollection 2022 May.
9
DADApy: Distance-based analysis of data-manifolds in Python.DADApy:Python 中基于距离的数据流形分析。
Patterns (N Y). 2022 Sep 19;3(10):100589. doi: 10.1016/j.patter.2022.100589. eCollection 2022 Oct 14.
10
Interpretable instance disease prediction based on causal feature selection and effect analysis.基于因果特征选择和效应分析的可解释实例疾病预测。
BMC Med Inform Decis Mak. 2022 Feb 26;22(1):51. doi: 10.1186/s12911-022-01788-8.