Department of Physics, The University of Houston, Houston, Texas, USA.
IMT Institute for Advanced Studies, Lucca, Italy.
Sci Data. 2017 May 16;4:170064. doi: 10.1038/sdata.2017.64.
Patent data represent a significant source of information on innovation, knowledge production, and the evolution of technology through networks of citations, co-invention and co-assignment. A major obstacle to extracting useful information from this data is the problem of name disambiguation: linking alternate spellings of individuals or institutions to a single identifier to uniquely determine the parties involved in knowledge production and diffusion. In this paper, we describe a new algorithm that uses high-resolution geolocation to disambiguate both inventors and assignees on about 8.5 million patents found in the European Patent Office (EPO), under the Patent Cooperation Treaty (PCT), and in the US Patent and Trademark Office (USPTO). We show this disambiguation is consistent with a number of ground-truth benchmarks of both assignees and inventors, significantly outperforming the use of undisambiguated names to identify unique entities. A significant benefit of this work is the high quality assignee disambiguation with coverage across the world coupled with an inventor disambiguation (that is competitive with other state of the art approaches) in multiple patent offices.
专利数据是创新、知识生产和技术演变的重要信息来源,通过引用、共同发明和共同分配的网络来体现。从这些数据中提取有用信息的一个主要障碍是名称歧义问题:将个人或机构的不同拼写与单个标识符联系起来,以唯一确定参与知识生产和传播的各方。在本文中,我们描述了一种新算法,该算法使用高分辨率地理位置来区分欧洲专利局(EPO)、专利合作条约(PCT)下和美国专利商标局(USPTO)中约 850 万项专利中的发明人及受让人。我们表明,这种去歧义与受让人及发明人的许多真实基准一致,显著优于使用未去歧义的名称来识别唯一实体。这项工作的一个显著好处是,在全球范围内具有高质量的受让人去歧义功能,同时在多个专利局中具有与其他最先进方法竞争的发明人去歧义功能。