Marciniak Małgorzata, Mykowiecka Agnieszka
Institute of Computer Science PAS, Jana Kazimierza 5, 01-248 Warsaw, Poland.
J Biomed Semantics. 2014 May 31;5:24. doi: 10.1186/2041-1480-5-24. eCollection 2014.
Hospital documents contain free text describing the most important facts relating to patients and their illnesses. These documents are written in specific language containing medical terminology related to hospital treatment. Their automatic processing can help in verifying the consistency of hospital documentation and obtaining statistical data. To perform this task we need information on the phrases we are looking for. At the moment, clinical Polish resources are sparse. The existing terminologies, such as Polish Medical Subject Headings (MeSH), do not provide sufficient coverage for clinical tasks. It would be helpful therefore if it were possible to automatically prepare, on the basis of a data sample, an initial set of terms which, after manual verification, could be used for the purpose of information extraction.
Using a combination of linguistic and statistical methods for processing over 1200 children hospital discharge records, we obtained a list of single and multiword terms used in hospital discharge documents written in Polish. The phrases are ordered according to their presumed importance in domain texts measured by the frequency of use of a phrase and the variety of its contexts. The evaluation showed that the automatically identified phrases cover about 84% of terms in domain texts. At the top of the ranked list, only 4% out of 400 terms were incorrect while out of the final 200, 20% of expressions were either not domain related or syntactically incorrect. We also observed that 70% of the obtained terms are not included in the Polish MeSH.
Automatic terminology extraction can give results which are of a quality high enough to be taken as a starting point for building domain related terminological dictionaries or ontologies. This approach can be useful for preparing terminological resources for very specific subdomains for which no relevant terminologies already exist. The evaluation performed showed that none of the tested ranking procedures were able to filter out all improperly constructed noun phrases from the top of the list. Careful choice of noun phrases is crucial to the usefulness of the created terminological resource in applications such as lexicon construction or acquisition of semantic relations from texts.
医院文档包含描述患者及其疾病相关最重要事实的自由文本。这些文档使用包含与医院治疗相关医学术语的特定语言编写。其自动处理有助于验证医院文档的一致性并获取统计数据。为执行此任务,我们需要有关我们正在寻找的短语的信息。目前,波兰语临床资源匮乏。现有的术语表,如波兰医学主题词表(MeSH),对临床任务的覆盖不足。因此,如果能够基于数据样本自动准备一组初始术语,经过人工验证后可用于信息提取,将很有帮助。
通过结合语言和统计方法处理1200多条儿童医院出院记录,我们获得了一份用波兰语编写的医院出院文档中使用的单字和多字术语列表。这些短语根据其在领域文本中的假定重要性排序,该重要性通过短语的使用频率及其上下文的多样性来衡量。评估表明,自动识别的短语涵盖了领域文本中约84%的术语。在排名列表的顶部,400个术语中只有4%不正确,而在最终的200个术语中,20%的表达要么与领域无关,要么语法不正确。我们还观察到,所获得的术语中有70%未包含在波兰语MeSH中。
自动术语提取可以给出质量足够高的结果,可作为构建与领域相关的术语词典或本体的起点。这种方法对于为尚无相关术语的非常特定的子领域准备术语资源可能很有用。所进行的评估表明,没有一种测试的排名程序能够从列表顶部过滤掉所有结构不当的名词短语。仔细选择名词短语对于创建的术语资源在诸如词典构建或从文本中获取语义关系等应用中的有用性至关重要。