Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02115, USA.
Computational Epidemiology Lab, Boston Children's Hospital, Boston, MA, USA.
Sci Rep. 2024 Oct 16;14(1):24306. doi: 10.1038/s41598-024-73318-7.
Author affiliations are essential in bibliometric studies, requiring relevant information extraction from free-text affiliations. Precisely determining an author's location from their affiliation is crucial for understanding research networks, collaborations, and geographic distribution. Existing geoparsing tools using regular expressions have limitations due to unstructured and ambiguous affiliations, resulting in erroneous location identification, especially for unconventional variations or misspellings. Moreover, their inefficient handling of big datasets hampers large-scale bibliometric studies. Though machine learning-based geoparsers exist, they depend on explicit location information, creating challenges when detailed geographic data is absent. To address these issues, we developed and evaluated a natural language processing model to predict the city, state, and country from an author's free-text affiliation. Our model automates location inference, overcoming drawbacks of existing methods. Trained and tested with MapAffil, a publicly available geoparsed dataset of PubMed affiliations up to 2018, our model accurately retrieves high-resolution locations, even without explicit mentions of a city, state, or even country. Leveraging NLP techniques and the LinearSVC algorithm, our machine learning model achieves superior accuracy based on several validation datasets. This research demonstrates a practical application of text classification for inferring specific geographical locations from free-text affiliations, benefiting researchers and institutions in analyzing research output distribution.
作者单位在文献计量学研究中至关重要,需要从自由文本单位中提取相关信息。准确地从单位中确定作者的位置对于理解研究网络、合作关系和地理分布至关重要。现有的基于正则表达式的地理解析工具由于单位不规范和模糊,存在局限性,导致位置识别错误,尤其是对于非常规的变体或拼写错误。此外,它们对大数据集的低效处理也妨碍了大规模文献计量学研究的进行。尽管存在基于机器学习的地理解析器,但它们依赖于明确的位置信息,在详细地理数据缺失时会带来挑战。为了解决这些问题,我们开发并评估了一种自然语言处理模型,用于从作者的自由文本单位中预测城市、州和国家。我们的模型自动进行位置推断,克服了现有方法的缺点。我们的模型使用 MapAffil 进行训练和测试,这是一个公开的 2018 年前 PubMed 单位地理解析数据集,即使没有明确提到城市、州甚至国家,它也能准确地检索到高分辨率的位置。我们的机器学习模型利用自然语言处理技术和 LinearSVC 算法,在多个验证数据集上实现了卓越的准确性。这项研究展示了从自由文本单位中推断特定地理位置的文本分类的实际应用,使研究人员和机构受益,有利于分析研究成果的分布。