树状序列作为群体遗传推断的通用工具。

Tree sequences as a general-purpose tool for population genetic inference.

作者信息

Whitehouse Logan S, Ray Dylan, Schrider Daniel R

机构信息

Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA, 120 Mason Farm Rd, Chapel Hill, NC 27514.

出版信息

bioRxiv. 2024 Oct 5:2024.02.20.581288. doi: 10.1101/2024.02.20.581288.

DOI:10.1101/2024.02.20.581288

PMID:39185244

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11343121/

Abstract

As population genetics data increases in size new methods have been developed to store genetic information in efficient ways, such as tree sequences. These data structures are computationally and storage efficient, but are not interchangeable with existing data structures used for many population genetic inference methodologies such as the use of convolutional neural networks (CNNs) applied to population genetic alignments. To better utilize these new data structures we propose and implement a graph convolutional network (GCN) to directly learn from tree sequence topology and node data, allowing for the use of neural network applications without an intermediate step of converting tree sequences to population genetic alignment format. We then compare our approach to standard CNN approaches on a set of previously defined benchmarking tasks including recombination rate estimation, positive selection detection, introgression detection, and demographic model parameter inference. We show that tree sequences can be directly learned from using a GCN approach and can be used to perform well on these common population genetics inference tasks with accuracies roughly matching or even exceeding that of a CNN-based method. As tree sequences become more widely used in population genetics research we foresee developments and optimizations of this work to provide a foundation for population genetics inference moving forward.

摘要

随着群体遗传学数据规模的增加，人们开发了新的方法来高效存储遗传信息，比如树序列。这些数据结构在计算和存储方面都很高效，但与许多群体遗传推断方法所使用的现有数据结构不可互换，例如应用于群体遗传比对的卷积神经网络（CNN）。为了更好地利用这些新的数据结构，我们提出并实现了一种图卷积网络（GCN），以直接从树序列拓扑结构和节点数据中学习，从而无需将树序列转换为群体遗传比对格式的中间步骤即可使用神经网络应用。然后，我们在一组先前定义的基准任务上，将我们的方法与标准CNN方法进行比较，这些任务包括重组率估计、正选择检测、基因渗入检测和人口统计模型参数推断。我们表明，使用GCN方法可以直接从树序列中学习，并且可以用于在这些常见的群体遗传推断任务中表现良好，其准确率大致与基于CNN方法的准确率相当，甚至超过后者。随着树序列在群体遗传学研究中得到更广泛的应用，我们预见这项工作将会得到发展和优化，为未来的群体遗传推断奠定基础。