Department of Statistics, Pennsylvania State University, University Park, PA, United States of America.
Information and Computer Science Department, King Fahad University of Petroleum and Minerals, Dhahran, Saudi Arabia.
PLoS One. 2022 Jun 30;17(6):e0270034. doi: 10.1371/journal.pone.0270034. eCollection 2022.
There remains a limited understanding of the HIV prevention and treatment needs among female sex workers in many parts of the world. Systematic reviews of existing literature can help fill this gap; however, well-done systematic reviews are time-demanding and labor-intensive. Here, we propose an automatic document classification approach to a systematic review to significantly reduce the effort in reviewing documents and optimizing empiric decision making. We first describe a manual document classification procedure that is used to curate a pertinent training dataset and then propose three classifiers: a keyword-guided method, a cluster analysis-based method, and a random forest approach that utilizes a large set of feature tokens. This approach is used to identify documents studying female sex workers that contain content relevant to either HIV or experienced violence. We compare the performance of the three classifiers by cross-validation in terms of area under the curve of the receiver operating characteristic and precision and recall plot, and found random forest approach reduces the amount of manual reading for our example by 80%; in sensitivity analysis, we found that even trained with only 10% of data, the classifier can still avoid reading 75% of future documents (68% of total) while retaining 80% of relevant documents. In sum, the automated procedure of document classification presented here could improve both the precision and efficiency of systematic reviews and facilitate live reviews, where reviews are updated regularly. We expect to obtain a reasonable classifier by taking 20% of retrieved documents as training samples. The proposed classifier could also be used for more meaningfully assembling literature in other research areas and for rapid documents screening with a tight schedule, such as COVID-related work during the crisis.
在世界许多地方,人们对女性性工作者的艾滋病预防和治疗需求仍了解有限。对现有文献的系统综述可以帮助填补这一空白;然而,精心制作的系统综述需要花费大量时间和精力。在这里,我们提出了一种自动文档分类方法来进行系统综述,以显著减少审查文件的工作量和优化经验决策。我们首先描述了一个手动文档分类过程,该过程用于整理相关的训练数据集,然后提出了三种分类器:一种基于关键词的方法、一种基于聚类分析的方法和一种利用大量特征标记的随机森林方法。该方法用于识别研究女性性工作者的文档,这些文档包含与艾滋病毒或经历暴力相关的内容。我们通过交叉验证,比较了三种分类器在接收者操作特征曲线的面积、精度和召回率方面的性能,并发现随机森林方法将我们示例中的手动阅读量减少了 80%;在敏感性分析中,我们发现即使只用 10%的数据进行训练,该分类器仍然可以避免阅读 75%的未来文档(占总数的 68%),同时保留 80%的相关文档。总之,这里提出的文档分类自动化程序可以提高系统综述的准确性和效率,并促进实时审查,即在审查过程中定期更新。我们期望通过将 20%的检索到的文档作为训练样本,获得一个合理的分类器。该分类器还可以用于更有意义地整理其他研究领域的文献,以及在时间紧迫的情况下进行快速文档筛选,例如在危机期间与 COVID 相关的工作。