The La Jolla Institute for Allergy and Immunology, 9420 Athena Circle, La Jolla, CA 92037, USA.
BMC Bioinformatics. 2011 Dec 19;12:482. doi: 10.1186/1471-2105-12-482.
The Immune Epitope Database (IEDB) project manually curates information from published journal articles that describe immune epitopes derived from a wide variety of organisms and associated with different diseases. In the past, abstracts of scientific articles were retrieved by broad keyword queries of PubMed, and were classified as relevant (curatable) or irrelevant (not curatable) to the scope of the database by a Naïve Bayes classifier. The curatable abstracts were subsequently manually classified into categories corresponding to different disease domains. Over the past four years, we have examined how to further improve this approach in order to enhance classification performance and to reduce the need for manual intervention.
Utilizing 89,884 abstracts classified by a domain expert as curatable or uncuratable, we found that a SVM classifier outperformed the previously used Naïve Bayes classifier for curatability predictions with an AUC of 0.899 and 0.854, respectively. Next, using a non-hierarchical and a hierarchical application of SVM classifiers trained on 22,833 curatable abstracts manually classified into three levels of disease specific categories we demonstrated that a hierarchical application of SVM classifiers outperformed non-hierarchical SVM classifiers for categorization. Finally, to optimize the hierarchical SVM classifiers' error profile for the curation process, cost sensitivity functions were developed to avoid serious misclassifications. We tested our design on a benchmark dataset of 1,388 references and achieved an overall category prediction accuracy of 94.4%, 93.9%, and 82.1% at the three levels of categorization, respectively.
A hierarchical application of SVM algorithms with cost sensitive output weighting enabled high quality reference classification with few serious misclassifications. This enabled us to significantly reduce the manual component of abstract categorization. Our findings are relevant to other databases that are developing their own document classifier schema and the datasets we make available provide large scale real-life benchmark sets for method developers.
免疫表位数据库(IEDB)项目人工整理了来自各种生物体的与不同疾病相关的免疫表位的文献信息。过去,通过对 PubMed 中的广泛关键字查询检索科学文章的摘要,并使用朴素贝叶斯分类器将其分类为与数据库范围相关(可编辑)或不相关(不可编辑)。随后,可编辑的摘要被手动分类到对应不同疾病领域的类别中。在过去的四年中,我们一直在研究如何进一步改进这种方法,以提高分类性能并减少对人工干预的需求。
利用 89884 篇经领域专家分类为可编辑或不可编辑的摘要,我们发现 SVM 分类器在可编辑性预测方面优于之前使用的朴素贝叶斯分类器,AUC 分别为 0.899 和 0.854。接下来,我们使用非层次和层次应用 SVM 分类器对 22833 篇可编辑摘要进行训练,并将其手动分类为三个疾病特定类别的级别,结果表明,层次 SVM 分类器在分类方面优于非层次 SVM 分类器。最后,为了优化分类过程中层次 SVM 分类器的错误分布,开发了成本敏感函数以避免严重的误分类。我们在 1388 个参考文献的基准数据集上测试了我们的设计,在三个分类级别分别实现了 94.4%、93.9%和 82.1%的总体类别预测精度。
具有成本敏感输出加权的 SVM 算法的层次应用实现了高质量的参考分类,并且很少出现严重的误分类。这使我们能够显著减少摘要分类的人工成分。我们的发现与其他正在开发自己的文档分类器模式的数据库相关,并且我们提供的数据集为方法开发人员提供了大规模的真实基准集。