Division of Cell and Developmental Biology, School of Life Sciences, University of Dundee, Dundee, DD1 5EH, UK.
Division of Population Health and Genomics, Ninewells Hospital, School of Medicine, University of Dundee, Dundee, DD1 9SY, UK.
Sci Rep. 2021 Aug 3;11(1):15747. doi: 10.1038/s41598-021-94897-9.
Target identification and prioritisation are prominent first steps in modern drug discovery. Traditionally, individual scientists have used their expertise to manually interpret scientific literature and prioritise opportunities. However, increasing publication rates and the wider routine coverage of human genes by omic-scale research make it difficult to maintain meaningful overviews from which to identify promising new trends. Here we propose an automated yet flexible pipeline that identifies trends in the scientific corpus which align with the specific interests of a researcher and facilitate an initial prioritisation of opportunities. Using a procedure based on co-citation networks and machine learning, genes and diseases are first parsed from PubMed articles using a novel named entity recognition system together with publication date and supporting information. Then recurrent neural networks are trained to predict the publication dynamics of all human genes. For a user-defined therapeutic focus, genes generating more publications or citations are identified as high-interest targets. We also used topic detection routines to help understand why a gene is trendy and implement a system to propose the most prominent review articles for a potential target. This TrendyGenes pipeline detects emerging targets and pathways and provides a new way to explore the literature for individual researchers, pharmaceutical companies and funding agencies.
目标识别和优先级确定是现代药物发现的重要第一步。传统上,个别科学家利用自己的专业知识来手动解释科学文献并确定优先级。然而,随着出版物数量的增加以及组学规模的研究更广泛地涵盖人类基因,从这些文献中很难获得有意义的概览,从而难以确定有前途的新趋势。在这里,我们提出了一个自动化但灵活的流程,该流程可以识别与研究人员特定兴趣一致的科学语料库中的趋势,并促进机会的初步优先级排序。使用基于共引网络和机器学习的程序,首先使用一种新颖的命名实体识别系统以及出版物日期和支持信息,从 PubMed 文章中解析基因和疾病。然后,训练递归神经网络以预测所有人类基因的出版动态。对于用户定义的治疗重点,生成更多出版物或引文的基因被识别为高兴趣目标。我们还使用主题检测例程来帮助了解为什么某个基因是趋势,并实施了一个系统,为潜在目标提出最突出的综述文章。TrendyGenes 流程可检测新兴目标和途径,并为个别研究人员、制药公司和资助机构提供了一种探索文献的新方法。