Olsen Lars, Johan Kudahl Ulrich, Winther Ole, Brusic Vladimir
BMC Genomics. 2013;14 Suppl 5(Suppl 5):S14. doi: 10.1186/1471-2164-14-S5-S14. Epub 2013 Oct 16.
As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature.
We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature.
We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases.
随着生物学检测的输出在分辨率和数量上不断提高,诸如基因和蛋白质序列的功能注释等专业生物学数据主体,使得能够提取生物信息学实际应用所需的更高级知识。虽然常见类型的生物学数据,如序列数据,被广泛存储在生物数据库中,但功能注释,如免疫表位,主要以半结构化格式或嵌入原始科学文献中的自由文本形式存在。
我们定义并应用了一种用于文献分类的机器学习方法,以支持更新肿瘤T细胞抗原知识库TANTIGEN。从PubMed下载摘要,并将其分类为对数据库更新“相关”或“不相关”。在310篇摘要上对k近邻分类器进行训练和五折交叉验证,分类准确率为0.95,从而显示出在支持从文献中提取数据方面的显著价值。
我们在此提出一个概念框架,用于使用文本挖掘和机器学习原理从科学文献中半自动提取表位数据。添加此类数据将有助于生物数据库向知识库的转变。