Knisely Benjamin M, Pavliscsak Holly H
Telemedicine and Advanced Technology Research Center, United States Army Medical Research and Development Command, Fort Detrick, MD 21702 USA.
Scientometrics. 2023;128(5):3197-3224. doi: 10.1007/s11192-023-04689-3. Epub 2023 Apr 8.
Funding institutions often solicit text-based research proposals to evaluate potential recipients. Leveraging the information contained in these documents could help institutions understand the supply of research within their domain. In this work, an end-to-end methodology for semi-supervised document clustering is introduced to partially automate classification of research proposals based on thematic areas of interest. The methodology consists of three stages: (1) manual annotation of a document sample; (2) semi-supervised clustering of documents; (3) evaluation of cluster results using quantitative metrics and qualitative ratings (coherence, relevance, distinctiveness) by experts. The methodology is described in detail to encourage replication and is demonstrated on a real-world data set. This demonstration sought to categorize proposals submitted to the US Army Telemedicine and Advanced Technology Research Center (TATRC) related to technological innovations in military medicine. A comparative analysis of method features was performed, including unsupervised vs. semi-supervised clustering, several document vectorization techniques, and several cluster result selection strategies. Outcomes suggest that pretrained Bidirectional Encoder Representations from Transformers (BERT) embeddings were better suited for the task than older text embedding techniques. When comparing expert ratings between algorithms, semi-supervised clustering produced coherence ratings ~ 25% better on average compared to standard unsupervised clustering with negligible differences in cluster distinctiveness. Last, it was shown that a cluster result selection strategy that balances internal and external validity produced ideal results. With further refinement, this methodological framework shows promise as a useful analytical tool for institutions to unlock hidden insights from untapped archives and similar administrative document repositories.
The online version contains supplementary material available at 10.1007/s11192-023-04689-3.
资助机构经常征集基于文本的研究提案以评估潜在受助者。利用这些文档中包含的信息可以帮助机构了解其领域内的研究供应情况。在这项工作中,引入了一种用于半监督文档聚类的端到端方法,以基于感兴趣的主题领域对研究提案进行部分自动化分类。该方法包括三个阶段:(1)对文档样本进行人工标注;(2)对文档进行半监督聚类;(3)由专家使用定量指标和定性评级(连贯性、相关性、独特性)对聚类结果进行评估。详细描述了该方法以鼓励重复使用,并在一个真实数据集上进行了演示。该演示旨在对提交给美国陆军远程医疗和先进技术研究中心(TATRC)的与军事医学技术创新相关的提案进行分类。对方法特征进行了比较分析,包括无监督与半监督聚类、几种文档向量化技术以及几种聚类结果选择策略。结果表明,与旧的文本嵌入技术相比,预训练的来自Transformer的双向编码器表示(BERT)嵌入更适合该任务。在比较算法之间的专家评级时,与标准无监督聚类相比,半监督聚类产生的连贯性评级平均提高了约25%,聚类独特性方面的差异可忽略不计。最后,结果表明一种平衡内部和外部有效性的聚类结果选择策略产生了理想的结果。经过进一步完善,这个方法框架有望成为机构从未开发的档案和类似行政文档存储库中挖掘隐藏见解的有用分析工具。
在线版本包含可在10.1007/s11192-023-04689-3获取的补充材料。