Suppr超能文献

通过弹性网络进行生物医学文本分类的高效稀疏特征选择:在根据护理记录进行重症监护病房风险分层中的应用

Efficient and sparse feature selection for biomedical text classification via the elastic net: Application to ICU risk stratification from nursing notes.

作者信息

Marafino Ben J, Boscardin W John, Dudley R Adams

机构信息

Philip R. Lee Institute for Health Policy Studies, School of Medicine, University of California, San Francisco, United States; Center for Healthcare Value, University of California, San Francisco, United States.

Department of Epidemiology and Biostatistics, University of California, San Francisco, United States; Department of Medicine, University of California, San Francisco, United States.

出版信息

J Biomed Inform. 2015 Apr;54:114-20. doi: 10.1016/j.jbi.2015.02.003. Epub 2015 Feb 17.

Abstract

BACKGROUND AND SIGNIFICANCE

Sparsity is often a desirable property of statistical models, and various feature selection methods exist so as to yield sparser and interpretable models. However, their application to biomedical text classification, particularly to mortality risk stratification among intensive care unit (ICU) patients, has not been thoroughly studied.

OBJECTIVE

To develop and characterize sparse classifiers based on the free text of nursing notes in order to predict ICU mortality risk and to discover text features most strongly associated with mortality.

METHODS

We selected nursing notes from the first 24h of ICU admission for 25,826 adult ICU patients from the MIMIC-II database. We then developed a pair of stochastic gradient descent-based classifiers with elastic-net regularization. We also studied the performance-sparsity tradeoffs of both classifiers as their regularization parameters were varied.

RESULTS

The best-performing classifier achieved a 10-fold cross-validated AUC of 0.897 under the log loss function and full L2 regularization, while full L1 regularization used just 0.00025% of candidate input features and resulted in an AUC of 0.889. Using the log loss (range of AUCs 0.889-0.897) yielded better performance compared to the hinge loss (0.850-0.876), but the latter yielded even sparser models.

DISCUSSION

Most features selected by both classifiers appear clinically relevant and correspond to predictors already present in existing ICU mortality models. The sparser classifiers were also able to discover a number of informative - albeit nonclinical - features.

CONCLUSION

The elastic-net-regularized classifiers perform reasonably well and are capable of reducing the number of features required by over a thousandfold, with only a modest impact on performance.

摘要

背景与意义

稀疏性通常是统计模型所期望的属性,并且存在各种特征选择方法以产生更稀疏且可解释的模型。然而,它们在生物医学文本分类中的应用,特别是在重症监护病房(ICU)患者的死亡风险分层中的应用,尚未得到充分研究。

目的

基于护理记录的自由文本开发并表征稀疏分类器,以预测ICU死亡风险并发现与死亡最密切相关的文本特征。

方法

我们从MIMIC-II数据库中选取了25826名成年ICU患者入住ICU后前24小时的护理记录。然后我们开发了一对基于随机梯度下降且带有弹性网络正则化的分类器。我们还研究了随着正则化参数变化两个分类器的性能-稀疏性权衡。

结果

在对数损失函数和完全L2正则化下,性能最佳的分类器在10折交叉验证中AUC为0.897,而完全L1正则化仅使用了0.00025%的候选输入特征,AUC为0.889。与铰链损失(0.850 - 0.876)相比,使用对数损失(AUC范围为0.889 - 0.897)性能更好,但后者产生的模型更稀疏。

讨论

两个分类器选择的大多数特征在临床上似乎都相关,并且与现有ICU死亡模型中已有的预测因子相对应。更稀疏的分类器还能够发现一些信息丰富的——尽管是非临床的——特征。

结论

弹性网络正则化的分类器表现相当不错,能够将所需特征数量减少一千多倍,而对性能的影响较小。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验