Suppr超能文献

卡斯伯特:基于BERT的复合注释生物模拟模型实体检索

CASBERT: BERT-based retrieval for compositely annotated biosimulation model entities.

作者信息

Munarko Yuda, Rampadarath Anand, Nickerson David P

机构信息

Auckland Bioengineering Institute, University of Auckland, Auckland, New Zealand.

The New Zealand Institute for Plant & Food Research Ltd., Auckland, New Zealand.

出版信息

Front Bioinform. 2023 Feb 14;3:1107467. doi: 10.3389/fbinf.2023.1107467. eCollection 2023.

Abstract

Maximising FAIRness of biosimulation models requires a comprehensive description of model entities such as reactions, variables, and components. The COmputational Modeling in BIology NEtwork (COMBINE) community encourages the use of Resource Description Framework with composite annotations that semantically involve ontologies to ensure completeness and accuracy. These annotations facilitate scientists to find models or detailed information to inform further reuse, such as model composition, reproduction, and curation. SPARQL has been recommended as a key standard to access semantic annotation with RDF, which helps get entities precisely. However, SPARQL is unsuitable for most repository users who explore biosimulation models freely without adequate knowledge of ontologies, RDF structure, and SPARQL syntax. We propose here a text-based information retrieval approach, CASBERT, that is easy to use and can present candidates of relevant entities from models across a repository's contents. CASBERT adapts Bidirectional Encoder Representations from Transformers (BERT), where each composite annotation about an entity is converted into an entity embedding for subsequent storage in a list of entity embeddings. For entity lookup, a query is transformed to a query embedding and compared to the entity embeddings, and then the entities are displayed in order based on their similarity. The list structure makes it possible to implement CASBERT as an efficient search engine product, with inexpensive addition, modification, and insertion of entity embedding. To demonstrate and test CASBERT, we created a dataset for testing from the Physiome Model Repository and a static export of the BioModels database consisting of query-entities pairs. Measured using Mean Average Precision and Mean Reciprocal Rank, we found that our approach can perform better than the traditional bag-of-words method.

摘要

最大化生物模拟模型的公平性需要对模型实体进行全面描述,如反应、变量和组件。生物网络计算建模(COMBINE)社区鼓励使用带有复合注释的资源描述框架,这些注释在语义上涉及本体,以确保完整性和准确性。这些注释有助于科学家找到模型或详细信息,以便进一步重用,如模型组合、再现和管理。SPARQL已被推荐为访问带有RDF的语义注释的关键标准,这有助于精确获取实体。然而,SPARQL不适用于大多数在没有足够本体、RDF结构和SPARQL语法知识的情况下自由探索生物模拟模型的存储库用户。我们在此提出一种基于文本的信息检索方法CASBERT,它易于使用,并且可以从存储库内容中的模型中呈现相关实体的候选对象。CASBERT采用了来自Transformer的双向编码器表示(BERT),其中关于一个实体的每个复合注释都被转换为一个实体嵌入,以便随后存储在实体嵌入列表中。对于实体查找,一个查询被转换为一个查询嵌入,并与实体嵌入进行比较,然后根据实体的相似度按顺序显示实体。列表结构使得将CASBERT实现为一个高效的搜索引擎产品成为可能,实体嵌入的添加、修改和插入成本低廉。为了演示和测试CASBERT,我们从生理组模型存储库创建了一个测试数据集,并从BioModels数据库进行了由查询-实体对组成的静态导出。使用平均精度均值和平均倒数排名进行测量,我们发现我们的方法比传统的词袋法表现更好。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1238/9971925/f4d42cb888cb/fbinf-03-1107467-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验