Kastrin A, Peterlin B, Hristovski D
Institute of Medical Genetics, University Medical Centre Ljubljana, Ljubljana, Slovenia.
Methods Inf Med. 2010;49(4):371-8. doi: 10.3414/ME09-01-0009. Epub 2010 Jan 20.
Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE citations containing genetic relevant topic.
Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain.
Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine-learning algorithms (support vector machines, decision trees, naïve Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine-learning algorithms.
We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.
文本分类已应用于生物医学信息学领域,用于识别包含相关感兴趣主题的文档。我们开发了一种简单的方法,该方法使用基于卡方的评分函数来确定MEDLINE引用中包含遗传相关主题的可能性。
我们的程序需要构建一个遗传领域和一个非遗传领域的文档语料库。我们使用分配给MEDLINE引用的医学主题词(MeSH)描述符来完成此分类任务。我们应用卡方检验比较了两个语料库中MeSH描述符的频率。如果一个MeSH描述符在遗传领域语料库中的相对观察频率大于其在非遗传领域语料库中的相对观察频率,则该描述符被视为阳性指标。所提出方法的输出是所有引用的分数列表,分数最高的是那些包含遗传领域典型MeSH描述符的引用。
在一组734篇人工标注的MEDLINE引用上进行了验证。其预测准确率为0.87,召回率为0.69,精确率为0.64。我们通过将该方法与三种机器学习算法(支持向量机、决策树、朴素贝叶斯)进行比较来评估该方法。尽管差异没有统计学意义,但结果表明我们的卡方评分与所比较的机器学习算法表现相当。
我们认为卡方评分是帮助对MEDLINE引用进行分类的有效解决方案。该算法在BITOLA基于文献的发现支持系统中作为基因符号消歧过程的预处理器实现。