Suppr超能文献

基于在线文本分析的隐匿人群定位推断。

Location inference for hidden population with online text analysis.

机构信息

College of Systems Engineering, National University of Defense Technology, Changsha, 410073, China.

School of Software Engineering, Shenzhen Institute of Information Technology, Shenzhen, 518172, China.

出版信息

Int J Health Geogr. 2020 Dec 9;19(1):57. doi: 10.1186/s12942-020-00245-x.

Abstract

BACKGROUND

Understanding the geographic distribution of hidden population, such as men who have sex with men (MSM), sex workers, or injecting drug users, are of great importance for the adequate deployment of intervention strategies and public health decision making. However, due to the hard-to-access properties, e.g., lack of a sampling frame, sensitivity issue, reporting error, etc., traditional survey methods are largely limited when studying such populations. With data extracted from the very active online community of MSM in China, in this study we adopt and develop location inferring methods to achieve a high-resolution mapping of users in this community at national level.

METHODS

We collect a comprehensive dataset from the largest sub-community related to MSM topics in Baidu Tieba, covering 628,360 MSM-related users. Based on users' publicly available posts, we evaluate and compare the performances of mainstream location inference algorithms on the online locating problem of Chinese MSM population. To improve the inference accuracy, other approaches in natural language processing are introduced into the location extraction, such as context analysis and pattern recognition. In addition, we develop a hybrid voting algorithm (HVA-LI) by allowing different approaches to vote to determine the best inference results, which guarantees a more effective way on location inference for hidden population.

RESULTS

By comparing the performances of popular inference algorithms, we find that the classic gazetteer-based algorithm has achieved better results. And in the HVA-LI algorithms, the hybrid algorithm consisting of the simple gazetteer-based method and named entity recognition (NER) is proven to be the best to deal with inferring users' locations disclosed in short texts on online communities, improving the inferring accuracy from 50.3 to 71.3% on the MSM-related dataset.

CONCLUSIONS

In this study, we have explored the possibility of location inferring by analyzing textual content posted by online users. A more effective hybrid algorithm, i.e., the Gazetteer & NER algorithm is proposed, which is conducive to overcoming the sparse location labeling problem in user profiles, and can be extended to the inference of geo-statistics for other hidden populations.

摘要

背景

了解男男性行为者(MSM)、性工作者或注射吸毒者等隐蔽人群的地理分布情况非常重要,这对于充分部署干预策略和制定公共卫生决策具有重要意义。然而,由于难以接触到这些人群,例如缺乏抽样框架、敏感性问题、报告错误等,传统的调查方法在研究这些人群时受到了很大的限制。本研究通过从中国最大的 MSM 在线社区中提取数据,采用并开发位置推断方法,实现了全国范围内该社区用户的高分辨率映射。

方法

我们从百度贴吧中与 MSM 主题相关的最大子社区中收集了一个全面的数据集,其中包含 628360 名 MSM 相关用户。基于用户公开的帖子,我们评估和比较了主流位置推断算法在解决中国 MSM 人群在线定位问题上的性能。为了提高推断的准确性,我们将自然语言处理中的其他方法引入到位置提取中,例如上下文分析和模式识别。此外,我们通过允许不同的方法投票来确定最佳推断结果,开发了一种混合投票算法(HVA-LI),从而保证了一种更有效的隐藏人群位置推断方法。

结果

通过比较流行的推断算法的性能,我们发现基于地名典的经典算法取得了更好的结果。在 HVA-LI 算法中,由简单基于地名典的方法和命名实体识别(NER)组成的混合算法被证明是处理在线社区中短文本用户位置推断的最佳方法,将推断准确率从 MSM 相关数据集的 50.3%提高到 71.3%。

结论

本研究通过分析在线用户发布的文本内容,探索了位置推断的可能性。提出了一种更有效的混合算法,即地名典和 NER 算法,有利于克服用户资料中位置标记稀疏的问题,并可扩展到其他隐蔽人群的地理统计推断。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/da37/7724834/17f143e8cc7f/12942_2020_245_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验