School of Computational Science & Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA.
J Am Med Inform Assoc. 2014 Feb;21(e1):e136-42. doi: 10.1136/amiajnl-2013-001792. Epub 2013 Sep 27.
Electronic health records possess critical predictive information for machine-learning-based diagnostic aids. However, many traditional machine learning methods fail to simultaneously integrate textual data into the prediction process because of its high dimensionality. In this paper, we present a supervised method using Laplacian Eigenmaps to enable existing machine learning methods to estimate both low-dimensional representations of textual data and accurate predictors based on these low-dimensional representations at the same time.
We present a supervised Laplacian Eigenmap method to enhance predictive models by embedding textual predictors into a low-dimensional latent space, which preserves the local similarities among textual data in high-dimensional space. The proposed implementation performs alternating optimization using gradient descent. For the evaluation, we applied our method to over 2000 patient records from a large single-center pediatric cardiology practice to predict if patients were diagnosed with cardiac disease. In our experiments, we consider relatively short textual descriptions because of data availability. We compared our method with latent semantic indexing, latent Dirichlet allocation, and local Fisher discriminant analysis. The results were assessed using four metrics: the area under the receiver operating characteristic curve (AUC), Matthews correlation coefficient (MCC), specificity, and sensitivity.
The results indicate that supervised Laplacian Eigenmaps was the highest performing method in our study, achieving 0.782 and 0.374 for AUC and MCC, respectively. Supervised Laplacian Eigenmaps showed an increase of 8.16% in AUC and 20.6% in MCC over the baseline that excluded textual data and a 2.69% and 5.35% increase in AUC and MCC, respectively, over unsupervised Laplacian Eigenmaps.
As a solution, we present a supervised Laplacian Eigenmap method to embed textual predictors into a low-dimensional Euclidean space. This method allows many existing machine learning predictors to effectively and efficiently capture the potential of textual predictors, especially those based on short texts.
电子健康记录拥有基于机器学习的诊断辅助工具的关键预测信息。然而,由于其高维性,许多传统的机器学习方法无法同时将文本数据集成到预测过程中。在本文中,我们提出了一种使用拉普拉斯特征映射的有监督方法,使现有的机器学习方法能够同时估计文本数据的低维表示和基于这些低维表示的准确预测器。
我们提出了一种有监督的拉普拉斯特征映射方法,通过将文本预测器嵌入到低维潜在空间中来增强预测模型,该空间保留了高维空间中文本数据的局部相似性。所提出的实现使用梯度下降进行交替优化。为了进行评估,我们将我们的方法应用于来自大型单中心儿科心脏病学实践的 2000 多个患者记录,以预测患者是否患有心脏病。在我们的实验中,由于数据可用性,我们考虑了相对较短的文本描述。我们将我们的方法与潜在语义索引、潜在狄利克雷分配和局部 Fisher 判别分析进行了比较。使用四个指标评估结果:接收器工作特征曲线下的面积(AUC)、马修斯相关系数(MCC)、特异性和敏感性。
结果表明,在我们的研究中,有监督的拉普拉斯特征映射是表现最好的方法,分别在 AUC 和 MCC 方面达到了 0.782 和 0.374。与排除文本数据的基线相比,有监督的拉普拉斯特征映射在 AUC 方面提高了 8.16%,在 MCC 方面提高了 20.6%,与无监督的拉普拉斯特征映射相比,AUC 和 MCC 分别提高了 2.69%和 5.35%。
作为解决方案,我们提出了一种有监督的拉普拉斯特征映射方法,将文本预测器嵌入到低维欧几里得空间中。该方法允许许多现有的机器学习预测器有效地利用文本预测器的潜力,特别是基于短文本的预测器。