Luo Yuan, Uzuner Ozlem
Massachueets Institute of Technology.
Massachueets Institute of Technology ; State University of New York at Albany.
AMIA Jt Summits Transl Sci Proc. 2014 Apr 7;2014:67-75. eCollection 2014.
The UMLS Semantic Network is constructed by experts and requires periodic expert review to update. We propose and implement a semi-supervised approach for automatically identifying UMLS semantic relations from narrative text in PubMed. Our method analyzes biomedical narrative text to collect semantic entity pairs, and extracts multiple semantic, syntactic and orthographic features for the collected pairs. We experiment with seeded k-means clustering with various distance metrics. We create and annotate a ground truth corpus according to the top two levels of the UMLS semantic relation hierarchy. We evaluate our system on this corpus and characterize the learning curves of different clustering configuration. Using KL divergence consistently performs the best on the held-out test data. With full seeding, we obtain macro-averaged F-measures above 70% for clustering the top level UMLS relations (2-way), and above 50% for clustering the second level relations (7-way).
统一医学语言系统(UMLS)语义网络由专家构建,需要定期进行专家审查以更新。我们提出并实施了一种半监督方法,用于从PubMed中的叙述文本中自动识别UMLS语义关系。我们的方法分析生物医学叙述文本以收集语义实体对,并为收集到的实体对提取多种语义、句法和正字法特征。我们使用各种距离度量对种子k均值聚类进行实验。我们根据UMLS语义关系层次结构的前两个级别创建并注释了一个真值语料库。我们在这个语料库上评估我们的系统,并刻画不同聚类配置的学习曲线。使用KL散度在留出的测试数据上始终表现最佳。在完全播种的情况下,对于顶级UMLS关系(二元)聚类,我们获得的宏平均F值超过70%,对于二级关系(七元)聚类,该值超过50%。