Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
Literature Services, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
Sci Data. 2024 Sep 27;11(1):1032. doi: 10.1038/s41597-024-03841-9.
We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.
我们提出了一种新颖的系统,利用编辑人员的参与来开发数据集和模型,以从标准出版物文本中检测残基水平的结构特征和功能注释。我们的方法涉及整合来自多个资源的数据,包括 PDBe、EuropePMC、PubMedCentral 和 PubMed,并结合来自 UniProt、LitSuggest 和 HuggingFace 模型的注释指南,作为注释过程中的工具。一个由七名注释员组成的团队手动注释了十个具有命名实体的文章,我们利用这些文章来从 HuggingFace 训练起始的 PubmedBert 模型。我们使用人工参与的注释系统,通过迭代开发出表现出色的最佳模型,其精度为 0.90,召回率为 0.92,F1 测度为 0.91。我们提出的系统展示了机器学习技术和编辑人员在为残基水平的功能注释和蛋白质结构特征编辑数据集方面的成功协同作用。结果表明,该系统具有在蛋白质研究中更广泛应用的潜力,弥合了高级机器学习模型和领域专家不可或缺的见解之间的差距。