Wang Ting, Cui Yunpeng, Sun Tan, Li Huan, Wang Chao, Hou Ying, Wang Mo, Chen Li, Wu Jinming
Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China.
Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081, China.
Int J Mol Sci. 2025 Mar 4;26(5):2281. doi: 10.3390/ijms26052281.
Feature engineering for whole-genome DNA sequences plays a critical role in predicting plant phenotypic traits. However, due to limitations in the models' analytical capabilities and computational resources, the existing methods are predominantly confined to SNP-based approaches, which typically extract genetic variation sites for dimensionality reduction before feature extraction. These methods not only suffer from incomplete locus coverage and insufficient genetic information but also overlook the relationships between nucleotides, thereby restricting the accuracy of phenotypic trait prediction. Inspired by the parallels between gene sequences and natural language, the emergence of large language models (LLMs) offers novel approaches for addressing the challenge of constructing genome-wide feature representations with nucleotide granularity. This study proposes FE-WDNA, a whole-genome DNA sequence feature engineering method, using HyenaDNA to fine-tune it on whole-genome data from 1000 soybean samples. We thus provide deep insights into the contextual and long-range dependencies among nucleotide sites to derive comprehensive genome-wide feature vectors. We further evaluated the application of FE-WDNA in agronomic trait prediction, examining factors such as the context window length of the DNA input, feature vector dimensions, and trait prediction methods, achieving significant improvements compared to the existing SNP-based approaches. FE-WDNA provides a mode of high-quality DNA sequence feature engineering at nucleotide resolution, which can be transformed to other plants and directly applied to various computational breeding tasks.
全基因组DNA序列的特征工程在预测植物表型性状方面起着关键作用。然而,由于模型分析能力和计算资源的限制,现有方法主要局限于基于单核苷酸多态性(SNP)的方法,这些方法通常在特征提取之前提取遗传变异位点以进行降维。这些方法不仅存在位点覆盖不完整和遗传信息不足的问题,还忽略了核苷酸之间的关系,从而限制了表型性状预测的准确性。受基因序列与自然语言之间相似性的启发,大语言模型(LLMs)的出现为解决以核苷酸粒度构建全基因组特征表示的挑战提供了新方法。本研究提出了FE-WDNA,一种全基因组DNA序列特征工程方法,并使用HyenaDNA在1000个大豆样本的全基因组数据上对其进行微调。我们从而深入了解了核苷酸位点之间的上下文和长程依赖性,以获得全面的全基因组特征向量。我们进一步评估了FE-WDNA在农艺性状预测中的应用,考察了DNA输入的上下文窗口长度、特征向量维度和性状预测方法等因素,与现有的基于SNP的方法相比取得了显著改进。FE-WDNA提供了一种核苷酸分辨率下的高质量DNA序列特征工程模式,可转化到其他植物上并直接应用于各种计算育种任务。