De Sa Christopher, Ratner Alex, Ré Christopher, Shin Jaeho, Wang Feiran, Wu Sen, Zhang Ce
Stanford University.
SIGMOD Rec. 2016 Mar;45(1):60-67. Epub 2016 Feb 6.
The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.
暗数据提取或知识库构建(KBC)问题是要用来自包括电子邮件、网页和PDF报告在内的非结构化数据源的信息填充SQL数据库。KBC是行业和研究领域的一个长期存在的问题,它涵盖了数据提取、清理和集成等问题。我们描述了DeepDive,这是一个结合数据库和机器学习思想来帮助开发KBC系统的系统。DeepDive的关键思想是,统计推断和机器学习是以统一且更有效的方式解决提取、清理和集成中的经典数据问题的关键工具。DeepDive程序是声明式的,即不能编写概率推断算法;相反,人们通过定义关于该领域的特征或规则来进行交互。这种设计选择的一个关键原因是使领域专家能够构建自己的KBC系统。我们展示了为加速KBC系统构建而采用的DeepDive的应用、抽象和技术。