Institute of Data Science and Biotechnology, Gladstone Institutes, San Francisco, CA, USA.
Maastricht Centre for Systems Biology (MaCSBio), Maastricht University, Maastricht, The Netherlands.
Genome Biol. 2020 Nov 9;21(1):273. doi: 10.1186/s13059-020-02181-2.
Thousands of pathway diagrams are published each year as static figures inaccessible to computational queries and analyses. Using a combination of machine learning, optical character recognition, and manual curation, we identified 64,643 pathway figures published between 1995 and 2019 and extracted 1,112,551 instances of human genes, comprising 13,464 unique NCBI genes, participating in a wide variety of biological processes. This collection represents an order of magnitude more genes than found in the text of the same papers, and thousands of genes missing from other pathway databases, thus presenting new opportunities for discovery and research.
每年都会发表数千张路径图,但这些图都是静态的,无法进行计算查询和分析。我们结合使用机器学习、光学字符识别和人工编辑,从 1995 年至 2019 年发表的路径图中确定了 64643 张,并从中提取了 1112551 个人类基因实例,包含 13464 个独特的 NCBI 基因,参与了各种各样的生物过程。与同一批论文的文本相比,该数据集包含的基因数量多了一个数量级,而且还包含了其他通路数据库中缺失的数千个基因,因此为发现和研究提供了新的机会。