Wang Jing, Deng Huan, Liu Bangtao, Hu Anbin, Liang Jun, Fan Lingye, Zheng Xu, Wang Tong, Lei Jianbo
School of Medical Informatics and Engineering, Southwest Medical University, Luzhou, China.
IT Center, Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China.
J Med Internet Res. 2020 Jan 23;22(1):e16816. doi: 10.2196/16816.
Natural language processing (NLP) is an important traditional field in computer science, but its application in medical research has faced many challenges. With the extensive digitalization of medical information globally and increasing importance of understanding and mining big data in the medical field, NLP is becoming more crucial.
The goal of the research was to perform a systematic review on the use of NLP in medical research with the aim of understanding the global progress on NLP research outcomes, content, methods, and study groups involved.
A systematic review was conducted using the PubMed database as a search platform. All published studies on the application of NLP in medicine (except biomedicine) during the 20 years between 1999 and 2018 were retrieved. The data obtained from these published studies were cleaned and structured. Excel (Microsoft Corp) and VOSviewer (Nees Jan van Eck and Ludo Waltman) were used to perform bibliometric analysis of publication trends, author orders, countries, institutions, collaboration relationships, research hot spots, diseases studied, and research methods.
A total of 3498 articles were obtained during initial screening, and 2336 articles were found to meet the study criteria after manual screening. The number of publications increased every year, with a significant growth after 2012 (number of publications ranged from 148 to a maximum of 302 annually). The United States has occupied the leading position since the inception of the field, with the largest number of articles published. The United States contributed to 63.01% (1472/2336) of all publications, followed by France (5.44%, 127/2336) and the United Kingdom (3.51%, 82/2336). The author with the largest number of articles published was Hongfang Liu (70), while Stéphane Meystre (17) and Hua Xu (33) published the largest number of articles as the first and corresponding authors. Among the first author's affiliation institution, Columbia University published the largest number of articles, accounting for 4.54% (106/2336) of the total. Specifically, approximately one-fifth (17.68%, 413/2336) of the articles involved research on specific diseases, and the subject areas primarily focused on mental illness (16.46%, 68/413), breast cancer (5.81%, 24/413), and pneumonia (4.12%, 17/413).
NLP is in a period of robust development in the medical field, with an average of approximately 100 publications annually. Electronic medical records were the most used research materials, but social media such as Twitter have become important research materials since 2015. Cancer (24.94%, 103/413) was the most common subject area in NLP-assisted medical research on diseases, with breast cancers (23.30%, 24/103) and lung cancers (14.56%, 15/103) accounting for the highest proportions of studies. Columbia University and the talents trained therein were the most active and prolific research forces on NLP in the medical field.
自然语言处理(NLP)是计算机科学中一个重要的传统领域,但其在医学研究中的应用面临诸多挑战。随着全球医学信息的广泛数字化以及医学领域中理解和挖掘大数据的重要性日益增加,NLP变得愈发关键。
本研究的目标是对NLP在医学研究中的应用进行系统评价,旨在了解NLP研究成果、内容、方法以及所涉及的研究群体的全球进展情况。
以PubMed数据库作为搜索平台进行系统评价。检索了1999年至2018年这20年间所有已发表的关于NLP在医学(生物医学除外)中应用的研究。对从这些已发表研究中获得的数据进行清理和结构化处理。使用Excel(微软公司)和VOSviewer(内斯·扬·范·埃克和卢多·沃尔特曼)对发表趋势、作者顺序、国家、机构、合作关系、研究热点、所研究疾病以及研究方法进行文献计量分析。
初步筛选共获得3498篇文章,经人工筛选后发现2336篇文章符合研究标准。每年的出版物数量都在增加,2012年后有显著增长(每年的出版物数量从148篇到最多302篇不等)。自该领域创立以来,美国一直占据领先地位,发表的文章数量最多。美国的出版物占所有出版物的63.01%(1472/2336),其次是法国(5.44%,127/2336)和英国(3.51%,82/2336)。发表文章数量最多的作者是刘红芳(70篇),而斯特凡·梅斯特雷(17篇)和徐华(33篇)作为第一作者和通讯作者发表的文章数量最多。在第一作者所属机构中,哥伦比亚大学发表的文章数量最多,占总数的4.54%(106/2336)。具体而言,约五分之一(17.68%,413/2336)的文章涉及特定疾病的研究,主题领域主要集中在精神疾病(16.46%,68/413)、乳腺癌(5.81%,24/413)和肺炎(4.12%,17/413)。
NLP在医学领域正处于蓬勃发展时期,平均每年约有100篇出版物。电子病历是最常用的研究材料,但自2015年以来,推特等社交媒体已成为重要的研究材料。癌症(24.94%,103/413)是NLP辅助医学疾病研究中最常见的主题领域,其中乳腺癌(23.30%,24/103)和肺癌(14.56%,15/103)的研究比例最高。哥伦比亚大学及其培养的人才是医学领域NLP最活跃、最多产的研究力量。