Center for Molecular Imaging Research, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA, USA.
BMC Bioinformatics. 2009 Oct 3;10:317. doi: 10.1186/1471-2105-10-317.
The breadth of biological databases and their information content continues to increase exponentially. Unfortunately, our ability to query such sources is still often suboptimal. Here, we introduce and apply community voting, database-driven text classification, and visual aids as a means to incorporate distributed expert knowledge, to automatically classify database entries and to efficiently retrieve them.
Using a previously developed peptide database as an example, we compared several machine learning algorithms in their ability to classify abstracts of published literature results into categories relevant to peptide research, such as related or not related to cancer, angiogenesis, molecular imaging, etc. Ensembles of bagged decision trees met the requirements of our application best. No other algorithm consistently performed better in comparative testing. Moreover, we show that the algorithm produces meaningful class probability estimates, which can be used to visualize the confidence of automatic classification during the retrieval process. To allow viewing long lists of search results enriched by automatic classifications, we added a dynamic heat map to the web interface. We take advantage of community knowledge by enabling users to cast votes in Web 2.0 style in order to correct automated classification errors, which triggers reclassification of all entries. We used a novel framework in which the database "drives" the entire vote aggregation and reclassification process to increase speed while conserving computational resources and keeping the method scalable. In our experiments, we simulate community voting by adding various levels of noise to nearly perfectly labelled instances, and show that, under such conditions, classification can be improved significantly.
Using PepBank as a model database, we show how to build a classification-aided retrieval system that gathers training data from the community, is completely controlled by the database, scales well with concurrent change events, and can be adapted to add text classification capability to other biomedical databases.The system can be accessed at http://pepbank.mgh.harvard.edu.
生物数据库及其信息内容的广度呈指数级增长。不幸的是,我们查询这些资源的能力往往仍然不尽如人意。在这里,我们引入并应用了社区投票、数据库驱动的文本分类和可视化辅助工具,作为整合分布式专家知识、自动对数据库条目进行分类以及高效检索的手段。
我们使用先前开发的肽数据库作为示例,比较了几种机器学习算法在将已发表文献结果的摘要分类为与肽研究相关的类别(如与癌症、血管生成、分子成像等相关或不相关)的能力。袋装决策树的集成最符合我们应用的要求。在比较测试中,没有其他算法始终表现更好。此外,我们表明,该算法产生了有意义的类别概率估计值,这些估计值可用于在检索过程中可视化自动分类的置信度。为了允许查看通过自动分类丰富的搜索结果列表,我们在 Web 界面中添加了动态热图。我们利用社区知识,使用户能够以 Web 2.0 风格投票,以纠正自动分类错误,这会触发所有条目的重新分类。我们使用了一种新颖的框架,其中数据库“驱动”整个投票聚合和重新分类过程,以在节省计算资源的同时提高速度并保持方法的可扩展性。在我们的实验中,我们通过向几乎完全标记的实例添加各种级别的噪声来模拟社区投票,并表明在这种情况下,分类可以得到显著改善。
我们使用 PepBank 作为模型数据库,展示了如何构建一个分类辅助检索系统,该系统从社区收集训练数据,完全由数据库控制,与并发更改事件很好地扩展,并且可以适应将文本分类功能添加到其他生物医学数据库。该系统可在 http://pepbank.mgh.harvard.edu 上访问。