Ozyurt Ibrahim Burak, Bandrowski Anita
FDI Lab, University of California, San Diego, 9500 Gilman Drive, M/C 0608, La Jolla, CA 92093-0608, USA.
bioRxiv. 2024 Oct 17:2024.10.15.618379. doi: 10.1101/2024.10.15.618379.
Tables are useful information artifacts that allow easy detection of data "missingness" by humans and have been deployed by several publishers to improve the amount of information present for key resources and reagents such as antibodies, cell lines, and other tools that constitute the inputs to a study. The STAR*Methods tables, specifically, have increased the "findability" of these key resources, but they have not been commonly available outside of the Cell Press journal family. To improve the availability of these tables in the broader biomedical literature, we have attempted to automatically process BioRxiv preprints to create tables from text or to recognize tables already created by authors and structure them for later use by publishers and search systems, to improve "findability" of resources in a larger amount of the scientific literature. The extraction of key resource tables in PDF files by the best in class tools resulted in Grid Table Similarity (GriTS) score of 0.12, so we have created several multimodal pipelines employing machine learning approaches for key resource table page identification, Table Transformer models for table detection and table structure recognition and a new table-specific language model for row over-segmentation to improve the extraction of text in tables created by biomedical authors and published on BioRxiv to around GriTS score of 0.90 enabling the deployment of automated research resource extraction tools onto BioRxiv.
表格是有用的信息载体,便于人们轻松发现数据“缺失情况”,并且已有多家出版商采用表格来增加关键资源和试剂(如抗体、细胞系以及构成研究输入的其他工具)的信息量。具体而言,《STAR*方法》表格提高了这些关键资源的“可查找性”,但在细胞出版社期刊系列之外,这些表格并不常见。为了提高这些表格在更广泛的生物医学文献中的可得性,我们尝试自动处理BioRxiv预印本,从文本中创建表格,或者识别作者已经创建的表格并对其进行结构化处理,以供出版商和搜索系统日后使用,从而提高大量科学文献中资源的“可查找性”。一流工具对PDF文件中的关键资源表格进行提取,得到的网格表格相似度(GriTS)分数为0.12,因此我们创建了多个多模态管道,采用机器学习方法进行关键资源表格页面识别、使用表格Transformer模型进行表格检测和表格结构识别,并使用一种新的特定于表格的语言模型进行行过度分割,以将生物医学作者在BioRxiv上发表的表格中的文本提取率提高到GriTS分数约为0.9 的水平,从而能够将自动化研究资源提取工具部署到BioRxiv上。