Upadhyay Utkarsh, Pucci Fabrizio, Herold Julian, Schug Alexander
John von Neumann Institute for Computing, Jülich Supercomputing Centre, 52428 Jülich, Germany.
Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050 Brussels, Belgium.
NAR Genom Bioinform. 2025 Mar 18;7(1):lqaf021. doi: 10.1093/nargab/lqaf021. eCollection 2025 Mar.
The structural prediction of biomolecules via computational methods complements the often involved wet-lab experiments. Unlike protein structure prediction, RNA structure prediction remains a significant challenge in bioinformatics, primarily due to the scarcity of annotated RNA structure data and its varying quality. Many methods have used this limited data to train deep learning models but redundancy, data leakage and bad data quality hampers their performance. In this work, we present NucleoSeeker, a tool designed to curate high-quality, tailored datasets from the Protein Data Bank (PDB) database. It is a unified framework that combines multiple tools and streamlines an otherwise complicated process of data curation. It offers multiple filters at structure, sequence, and annotation levels, giving researchers full control over data curation. Further, we present several use cases. In particular, we demonstrate how NucleoSeeker allows the creation of a nonredundant RNA structure dataset to assess AlphaFold3's performance for RNA structure prediction. This demonstrates NucleoSeeker's effectiveness in curating valuable nonredundant tailored datasets to both train novel and judge existing methods. NucleoSeeker is very easy to use, highly flexible, and can significantly increase the quality of RNA structure datasets.
通过计算方法对生物分子进行结构预测,是对通常较为繁琐的湿实验室实验的一种补充。与蛋白质结构预测不同,RNA结构预测在生物信息学中仍然是一项重大挑战,主要原因是注释的RNA结构数据稀缺且质量参差不齐。许多方法利用这些有限的数据来训练深度学习模型,但数据冗余、数据泄露和数据质量不佳会影响其性能。在这项工作中,我们展示了NucleoSeeker,这是一种旨在从蛋白质数据库(PDB)中整理高质量、定制数据集的工具。它是一个统一的框架,结合了多个工具,简化了原本复杂的数据整理过程。它在结构、序列和注释层面提供了多个过滤器,让研究人员能够完全掌控数据整理。此外,我们还展示了几个用例。特别是,我们展示了NucleoSeeker如何创建一个无冗余的RNA结构数据集,以评估AlphaFold3对RNA结构预测的性能。这证明了NucleoSeeker在整理有价值的无冗余定制数据集以训练新方法和评判现有方法方面的有效性。NucleoSeeker非常易于使用,高度灵活,并且可以显著提高RNA结构数据集的质量。