Annu Int Conf IEEE Eng Med Biol Soc. 2022 Jul;2022:3558-3562. doi: 10.1109/EMBC48229.2022.9870905.
We analyze dog genotypes (i.e., positions of dog DNA sequences that often vary between different dogs) in order to predict the corresponding phenotypes (i.e., unique observed characteristics). More specifically, given chromosome data from a dog, we aim to predict the breed, height, and weight. We explore a variety of linear and non-linear classification and regression techniques to accomplish these three tasks. We also investigate the use of a neural network (both in linear and non-linear modes) for breed classification and compare the performance to traditional statistical methods. We show that linear methods generally outperform or match the performance of non-linear methods for breed classification. However, we show that the reverse is true for height and weight regression. Finally, we evaluate the results of all of these methods based on the number of input features used in the analysis. We conduct experiments using different fractions of the full genomic sequences, resulting in input sequences ranging from 20 SNPs to ∼200k SNPs. In doing so, we explore the impact of using a very limited number of SNPs for prediction. Our experiments demonstrate that these phenotypes in dogs can be predicted with as few as 0.5% of randomly selected SNPs (i.e., 992 SNPs) and that dog breeds can be classified with 50% balanced accuracy with as few as 0.02% SNPs (i.e., 40 SNPs).
我们分析狗的基因型(即狗的 DNA 序列在不同狗之间经常变化的位置),以预测相应的表型(即独特的观察特征)。更具体地说,给定狗的染色体数据,我们的目标是预测品种、身高和体重。我们探索了多种线性和非线性分类和回归技术来完成这三个任务。我们还研究了使用神经网络(线性和非线性模式)进行品种分类,并将性能与传统统计方法进行比较。我们表明,对于品种分类,线性方法通常优于或与非线性方法的性能相匹配。然而,对于身高和体重回归,情况则相反。最后,我们根据分析中使用的输入特征数量来评估所有这些方法的结果。我们使用全基因组序列的不同部分进行实验,从而产生了从 20 个单核苷酸多态性到约 20 万个单核苷酸多态性的输入序列。通过这样做,我们探索了使用非常有限数量的 SNP 进行预测的影响。我们的实验表明,这些犬的表型可以用随机选择的 SNP 的 0.5%(即 992 个 SNP)进行预测,并且可以用 50%平衡准确率对犬种进行分类,只需使用 0.02%的 SNP(即 40 个 SNP)。