Karp Peter D
Bioinformatics Research Group, SRI, International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA. Tel:650-859-4358; Fax: 650-859-3735; E-mail:
Database (Oxford). 2016 Dec 26;2016. doi: 10.1093/database/baw150. Print 2016.
Can we use programs for automated or semi-automated information extraction from scientific texts as practical alternatives to professional curation? I show that error rates of current information extraction programs are too high to replace professional curation today. Furthermore, current IEP programs extract single narrow slivers of information, such as individual protein interactions; they cannot extract the large breadth of information extracted by professional curators for databases such as EcoCyc. They also cannot arbitrate among conflicting statements in the literature as curators can. Therefore, funding agencies should not hobble the curation efforts of existing databases on the assumption that a problem that has stymied Artificial Intelligence researchers for more than 60 years will be solved tomorrow. Semi-automated extraction techniques appear to have significantly more potential based on a review of recent tools that enhance curator productivity. But a full cost-benefit analysis for these tools is lacking. Without such analysis it is possible to expend significant effort developing information-extraction tools that automate small parts of the overall curation workflow without achieving a significant decrease in curation costs.Database URL.
我们能否使用程序从科学文本中自动或半自动提取信息,作为专业编目的实用替代方法?我发现,当前信息提取程序的错误率过高,目前无法取代专业编目。此外,当前的信息提取程序只能提取单一狭窄的信息片段,例如单个蛋白质相互作用;它们无法提取专业编目人员为EcoCyc等数据库提取的广泛信息。它们也无法像编目人员那样对文献中相互矛盾的陈述进行仲裁。因此,资助机构不应基于认为一个困扰人工智能研究人员60多年的问题明天就能解决的假设,来阻碍现有数据库的编目工作。基于对近期提高编目人员生产力的工具的回顾,半自动提取技术似乎具有更大的潜力。但目前缺乏对这些工具的全面成本效益分析。没有这样的分析,就有可能花费大量精力开发信息提取工具,这些工具只能自动执行整体编目工作流程中的小部分任务,而无法显著降低编目成本。数据库网址。