Binette Olivier, Steorts Rebecca C
Department of Statistical Science, Duke University, Durham, NC, USA.
Department of Statistical Science, Computer Science, Biostatistics and Bioinformatics, the Rhodes Information Initiative at Duke (iiD) and the Social Science Research Institute (SSRI), Duke University, Durham, NC, USA.
Sci Adv. 2022 Mar 25;8(12):eabi8021. doi: 10.1126/sciadv.abi8021.
Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme-integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as structured entity resolution (record linkage or deduplication). Here, we review motivational applications and seminal papers that have led to the growth of this area. We review modern probabilistic and Bayesian methods in statistics, computer science, machine learning, database management, economics, political science, and other disciplines that are used throughout industry and academia in applications such as human rights, official statistics, medicine, and citation networks, among others. Last, we discuss current research topics of practical importance.
无论是要估算国会选区的人口数量,还是要估算在武装冲突中死亡的人数,亦或是利用书目数据来消除作者身份的歧义,所有这些应用都有一个共同的主题——整合来自多个来源的信息。在回答此类问题之前,必须以系统且准确的方式清理和整合数据库,这通常被称为结构化实体解析(记录链接或去重)。在此,我们回顾了促使该领域发展的激励性应用和开创性论文。我们还回顾了统计学、计算机科学、机器学习、数据库管理、经济学、政治学以及其他学科中的现代概率和贝叶斯方法,这些方法在整个人权、官方统计、医学和引文网络等行业和学术界的应用中都有使用。最后,我们讨论了当前具有实际重要性的研究课题。