Cannon Ethalinda K S, Molik David C, Wright Adam J, Zhang Huiting, Honaas Loren, Chougule Kapeel, Dyer Sarah
USDA Agricultural Research Service-Corn Insects and Crop Genetics Research Unit, Crop Genome Informatics Lab, 819 Wallace Rd, Ames, IA 50012, USA.
USDA Agricultural Research Service-Arthropod-borne Animal Diseases Research Unit, Center for Grain and Animal Health Research, 1515 College Avenue, Manhattan, KS 66502, USA.
Genetics. 2025 Mar 17;229(3). doi: 10.1093/genetics/iyaf006.
The rapid increase in the number of reference-quality genome assemblies presents significant new opportunities for genomic research. However, the absence of standardized naming conventions for genome assemblies and annotations across datasets creates substantial challenges. Inconsistent naming hinders the identification of correct assemblies, complicates the integration of bioinformatics pipelines, and makes it difficult to link assemblies across multiple resources. To address this, we developed a specification for standardizing the naming of reference genome assemblies, to improve consistency across datasets and facilitate interoperability. This specification was created with FAIR (Findable, Accessible, Interoperable, and Reusable) practices in mind, ensuring that reference assemblies are easier to locate, access, and reuse across research communities. Additionally, it has been designed to comply with primary genomic data repositories, including members of the International Nucleotide Sequence Database Collaboration consortium, ensuring compatibility with widely used databases. While initially tailored to the agricultural genomics community, the specification is adaptable for use across different taxa. Widespread adoption of this standardized nomenclature would streamline assembly management, better enable cross-species analyses, and improve the reproducibility of research. It would also enhance natural language processing applications that depend on consistent reference assembly names in genomic literature, promoting greater integration and automated analysis of genomic data. This is a good time to consider more consistent genomic data nomenclature as many research communities and data resources are now finding themselves juggling multiple datasets from multiple data providers.
具有参考质量的基因组组装数量的快速增长为基因组研究带来了重大的新机遇。然而,不同数据集中基因组组装和注释缺乏标准化的命名规范带来了巨大挑战。命名不一致阻碍了正确组装的识别,使生物信息学管道的整合变得复杂,并难以跨多种资源链接组装。为解决这一问题,我们制定了一项规范,用于标准化参考基因组组装的命名,以提高数据集之间的一致性并促进互操作性。该规范的制定考虑了FAIR(可查找、可访问、可互操作和可重用)原则,确保参考组装在各个研究社区中更易于定位、访问和重用。此外,它的设计符合主要的基因组数据存储库,包括国际核苷酸序列数据库协作联盟的成员,确保与广泛使用的数据库兼容。虽然该规范最初是为农业基因组学界量身定制的,但它适用于不同的分类群。广泛采用这种标准化命名将简化组装管理,更好地实现跨物种分析,并提高研究的可重复性。它还将增强依赖于基因组文献中一致的参考组装名称的自然语言处理应用,促进基因组数据的更大整合和自动化分析。鉴于许多研究社区和数据资源目前发现自己正在处理来自多个数据提供者的多个数据集,现在是考虑采用更一致的基因组数据命名的时候了。