National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Cell Genom. 2024 Jan 10;4(1):100466. doi: 10.1016/j.xgen.2023.100466. Epub 2023 Dec 15.
The data-intensive fields of genomics and machine learning (ML) are in an early stage of convergence. Genomics researchers increasingly seek to harness the power of ML methods to extract knowledge from their data; conversely, ML scientists recognize that genomics offers a wealth of large, complex, and well-annotated datasets that can be used as a substrate for developing biologically relevant algorithms and applications. The National Human Genome Research Institute (NHGRI) inquired with researchers working in these two fields to identify common challenges and receive recommendations to better support genomic research efforts using ML approaches. Those included increasing the amount and variety of training datasets by integrating genomic with multiomics, context-specific (e.g., by cell type), and social determinants of health datasets; reducing the inherent biases of training datasets; prioritizing transparency and interpretability of ML methods; and developing privacy-preserving technologies for research participants' data.
基因组学和机器学习(ML)这两个数据密集型领域正处于融合的早期阶段。基因组学研究人员越来越多地寻求利用 ML 方法的力量从他们的数据中提取知识;相反,ML 科学家认识到,基因组学提供了丰富的大型、复杂且注释良好的数据集,可以用作开发与生物学相关的算法和应用的基础。美国国家人类基因组研究所(NHGRI)向从事这两个领域的研究人员询问,以确定共同的挑战,并收到更好地支持使用 ML 方法进行基因组学研究的建议。这些建议包括通过整合基因组与多组学、特定于上下文(例如,按细胞类型)和健康的社会决定因素数据集来增加训练数据集的数量和种类;减少训练数据集的固有偏差;优先考虑 ML 方法的透明度和可解释性;并为研究参与者的数据开发隐私保护技术。