Liang Yilun, Zhang Gongbo, Sun Edward, Idnay Betina, Fang Yilu, Chen Fangyi, Ta Casey, Peng Yifan, Weng Chunhua
Department of Biomedical Informatics, Columbia University, New York, NY, USA.
Tandon School of Engineering, New York University, Brooklyn, NY, USA.
ArXiv. 2025 Aug 19:arXiv:2508.15834v1.
Research profiles highlight scientists' research focus, enabling talent discovery and fostering collaborations, but they are often outdated. Automated, scalable methods are urgently needed to keep these profiles current.
In this study, we design and evaluate two Large Language Models (LLMs)-based methods to generate scientific interest profiles-one summarizing researchers' PubMed abstracts and the other generating a summary using their publications' Medical Subject Headings (MeSH) terms-and compare these machine-generated profiles with researchers' self-summarized interests. We collected the titles, MeSH terms, and abstracts of PubMed publications for 595 faculty members affiliated with Columbia University Irving Medical Center (CUIMC), for 167 of whom we obtained human-written online research profiles. Subsequently, GPT-4o-mini, a state-of-the-art LLM, was prompted to summarize each researcher's interests. Both manual and automated evaluations were conducted to characterize the similarities and differences between the machine-generated and self-written research profiles.
The similarity study showed low ROUGE-L, BLEU, and METEOR scores, reflecting little overlap between terminologies used in machine-generated and self-written profiles. BERTScore analysis revealed moderate semantic similarity between machine-generated and reference summaries (F1: 0.542 for MeSH-based, 0.555 for abstract-based), despite low lexical overlap. In validation, paraphrased summaries achieved a higher F1 of 0.851. A further comparison between the original and paraphrased manually written summaries indicates the limitations of such metrics. Kullback-Leibler (KL) Divergence of term frequency-inverse document frequency (TF-IDF) values (8.56 and 8.58 for profiles derived from MeSH terms and abstracts, respectively) suggests that machine-generated summaries employ different keywords than human-written summaries. Manual reviews further showed that 77.78% rated the overall impression of MeSH-based profiling as "good" or "excellent," with readability receiving favorable ratings in 93.44% of cases, though granularity and factual accuracy varied. Overall, panel reviews favored 67.86% of machine-generated profiles derived from MeSH terms over those derived from abstracts.
LLMs promise to automate scientific interest profiling at scale. Profiles derived from MeSH terms have better readability than profiles derived from abstracts. Overall, machine-generated summaries differ from human-written ones in their choice of concepts, with the latter initiating more novel ideas.
研究简介突出了科学家的研究重点,有助于人才发现和促进合作,但它们往往过时。迫切需要自动化、可扩展的方法来使这些简介保持最新。
在本研究中,我们设计并评估了两种基于大语言模型(LLMs)的方法来生成科学兴趣简介——一种是总结研究人员的PubMed摘要,另一种是使用其出版物的医学主题词(MeSH)生成摘要——并将这些机器生成的简介与研究人员的自我总结兴趣进行比较。我们收集了哥伦比亚大学欧文医学中心(CUIMC)595名教职员工的PubMed出版物的标题、MeSH词和摘要,其中167人的在线研究简介是人工撰写的。随后,使用最先进的大语言模型GPT-4o-mini来总结每位研究人员的兴趣。进行了人工和自动评估,以描述机器生成的和人工撰写的研究简介之间的异同。
相似性研究显示ROUGE-L、BLEU和METEOR分数较低,这反映出机器生成的简介和人工撰写的简介中使用的术语几乎没有重叠。BERTScore分析表明,尽管词汇重叠率较低,但机器生成的摘要与参考摘要之间存在适度的语义相似性(基于MeSH的F1值为0.542,基于摘要的F1值为0.555)。在验证中,释义摘要的F1值更高,为0.851。原始人工撰写摘要与释义人工撰写摘要之间的进一步比较表明了此类指标的局限性。词频-逆文档频率(TF-IDF)值的Kullback-Leibler(KL)散度(分别从MeSH词和摘要得出的简介的KL散度为8.56和8.58)表明,机器生成的摘要使用的关键词与人工撰写的摘要不同。人工评审进一步表明,77.78%的人对基于MeSH的简介的总体印象评价为“好”或“优秀”,93.44%的情况下可读性得到好评,不过粒度和事实准确性各不相同。总体而言,专家小组评审更青睐67.(此处原文可能有误,推测应为67.86%)从MeSH词得出的机器生成的简介,而不是从摘要得出的简介。
大语言模型有望大规模自动化科学兴趣简介的生成。从MeSH词得出的简介比从摘要得出的简介可读性更好。总体而言,机器生成的摘要与人工撰写的摘要在概念选择上有所不同,后者引发的新想法更多。