Biostatistics, Yale School of Public Health, Yale University, New Haven, CT 06510, United States.
Integrative Genomics, Princeton University, Princeton, NJ 08540, United States.
J Am Med Inform Assoc. 2024 Jun 20;31(7):1463-1470. doi: 10.1093/jamia/ocae097.
ModelDB (https://modeldb.science) is a discovery platform for computational neuroscience, containing over 1850 published model codes with standardized metadata. These codes were mainly supplied from unsolicited model author submissions, but this approach is inherently limited. For example, we estimate we have captured only around one-third of NEURON models, the most common type of models in ModelDB. To more completely characterize the state of computational neuroscience modeling work, we aim to identify works containing results derived from computational neuroscience approaches and their standardized associated metadata (eg, cell types, research topics).
Known computational neuroscience work from ModelDB and identified neuroscience work queried from PubMed were included in our study. After pre-screening with SPECTER2 (a free document embedding method), GPT-3.5, and GPT-4 were used to identify likely computational neuroscience work and relevant metadata.
SPECTER2, GPT-4, and GPT-3.5 demonstrated varied but high abilities in identification of computational neuroscience work. GPT-4 achieved 96.9% accuracy and GPT-3.5 improved from 54.2% to 85.5% through instruction-tuning and Chain of Thought. GPT-4 also showed high potential in identifying relevant metadata annotations.
Accuracy in identification and extraction might further be improved by dealing with ambiguity of what are computational elements, including more information from papers (eg, Methods section), improving prompts, etc.
Natural language processing and large language model techniques can be added to ModelDB to facilitate further model discovery, and will contribute to a more standardized and comprehensive framework for establishing domain-specific resources.
ModelDB(https://modeldb.science)是一个计算神经科学的发现平台,包含超过 1850 个发布的模型代码,具有标准化的元数据。这些代码主要是由未经请求的模型作者提交的,但这种方法本质上是有限的。例如,我们估计只捕获了大约三分之一的 ModelDB 中的 NEURON 模型,这是 ModelDB 中最常见的模型类型。为了更全面地描述计算神经科学建模工作的状态,我们旨在确定包含计算神经科学方法及其标准化相关元数据(例如细胞类型、研究主题)的结果的工作。
我们的研究包括来自 ModelDB 的已知计算神经科学工作和从 PubMed 查询到的已识别神经科学工作。在使用 SPECTER2(一种免费的文档嵌入方法)进行预筛选后,GPT-3.5 和 GPT-4 用于识别可能的计算神经科学工作和相关元数据。
SPECTER2、GPT-4 和 GPT-3.5 在识别计算神经科学工作方面表现出不同但很高的能力。GPT-4 的准确率达到 96.9%,而 GPT-3.5 通过指令调整和思维链从 54.2%提高到 85.5%。GPT-4 在识别相关元数据注释方面也显示出很高的潜力。
通过处理什么是计算元素的歧义,包括从论文中获取更多信息(例如方法部分)、改进提示等,可以进一步提高识别和提取的准确性。
自然语言处理和大型语言模型技术可以添加到 ModelDB 中,以促进进一步的模型发现,并为建立特定于领域的资源提供更标准化和全面的框架做出贡献。