IEEE Trans Nanobioscience. 2018 Jul;17(3):165-171. doi: 10.1109/TNB.2018.2841053. Epub 2018 May 28.
Data mapping plays an important role in data integration and exchanges among institutions and organizations with different data standards. However, traditional rule-based approaches and machine learning methods fail to achieve satisfactory results for the data mapping problem. In this paper, we propose a novel and sophisticated deep learning framework for data mapping called mixture feature embedding convolutional neural network (MfeCNN). The MfeCNN model converts the data mapping task to a multiple classification problem. In the model, we incorporated multimodal learning and multiview embedding into a CNN for mixture feature tensor generation and classification prediction. Multimodal features were extracted from various linguistic spaces with a medical natural language processing package. Then, powerful feature embeddings were learned by using the CNN. As many as 10 classes could be simultaneously classified by a softmax prediction layer based on multiview embedding. MfeCNN achieved the best results on unbalanced data (average F1 score, 82.4%) among the traditional state-of-the-art machine learning models and CNN without mixture feature embedding. Our model also outperformed a very deep CNN with 29 layers, which took free texts as inputs. The combination of mixture feature embedding and a deep neural network can achieve high accuracy for data mapping and multiple classification.
数据映射在具有不同数据标准的机构和组织之间的数据集成和交换中起着重要作用。然而,传统的基于规则的方法和机器学习方法无法为数据映射问题提供令人满意的结果。在本文中,我们提出了一种新颖而复杂的深度学习框架,称为混合特征嵌入卷积神经网络(MfeCNN),用于数据映射。MfeCNN 模型将数据映射任务转换为多分类问题。在该模型中,我们将多模态学习和多视图嵌入到 CNN 中,用于混合特征张量生成和分类预测。多模态特征是使用医学自然语言处理包从各种语言空间中提取出来的。然后,使用 CNN 学习强大的特征嵌入。基于多视图嵌入的 softmax 预测层可以同时对多达 10 个类进行分类。在传统的最先进的机器学习模型和没有混合特征嵌入的 CNN 中,MfeCNN 在不平衡数据(平均 F1 得分 82.4%)上取得了最佳结果。我们的模型也优于一个具有 29 层的非常深的 CNN,该模型以自由文本作为输入。混合特征嵌入和深度神经网络的结合可以实现数据映射和多分类的高精度。