Hernandez Margarita, Shenk Mary K, Perry George H
Department of Anthropology, Pennsylvania State University, University Park, PA 16802, USA.
Department of Biology, Pennsylvania State University, University Park, PA 16802, USA.
R Soc Open Sci. 2020 Sep 30;7(9):201206. doi: 10.1098/rsos.201206. eCollection 2020 Sep.
Scholars have noted major disparities in the extent of scientific research conducted among taxonomic groups. Such trends may cascade if future scientists gravitate towards study species with more data and resources already available. As new technologies emerge, do research studies employing these technologies continue these disparities? Here, using non-human primates as a case study, we identified disparities in massively parallel genomic sequencing data and conducted interviews with scientists who produced these data to learn their motivations when selecting study species. We tested whether variables including publication history and conservation status were significantly correlated with publicly available sequence data in the NCBI Sequence Read Archive (SRA). Of the 179.6 terabases (Tb) of sequence data in SRA for 519 non-human primate species, 135 Tb (approx. 75%) were from only five species: rhesus macaques, olive baboons, green monkeys, chimpanzees and crab-eating macaques. The strongest predictors of the amount of genomic data were the total number of non-medical publications (linear regression; = 0.37; = 6.15 × 10) and number of medical publications ( = 0.27; = 9.27 × 10). In a generalized linear model, the number of non-medical publications ( = 0.00064) and closer phylogenetic distance to humans ( = 0.024) were the most predictive of the amount of genomic sequence data. We interviewed 33 authors of genomic data-producing publications and analysed their responses using grounded theory. Consistent with our quantitative results, authors mentioned their choice of species was motivated by sample accessibility, prior published work and relevance to human medicine. Our mixed-methods approach helped identify and contextualize some of the driving factors behind species-uneven patterns of scientific research, which can now be considered by funding agencies, scientific societies and research teams aiming to align their broader goals with future data generation efforts.
学者们已经注意到不同分类群之间进行的科学研究程度存在重大差异。如果未来的科学家倾向于研究已有更多数据和资源的物种,这种趋势可能会层层递进。随着新技术的出现,采用这些技术的研究是否会延续这些差异呢?在这里,我们以非人类灵长类动物为例,识别了大规模平行基因组测序数据中的差异,并采访了生成这些数据的科学家,以了解他们在选择研究物种时的动机。我们测试了包括发表历史和保护状况在内的变量是否与NCBI序列读取存档(SRA)中公开可用的序列数据显著相关。在SRA中519种非人类灵长类动物的179.6太字节(Tb)序列数据中,135 Tb(约75%)仅来自五个物种:恒河猴、东非狒狒、绿猴、黑猩猩和食蟹猕猴。基因组数据量的最强预测因素是非医学出版物的总数(线性回归; = 0.37; = 6.15 × 10)和医学出版物的数量( = 0.27; = 9.27 × 10)。在广义线性模型中,非医学出版物的数量( = 0.00064)和与人类更近的系统发育距离( = 0.024)最能预测基因组序列数据的量。我们采访了33篇生成基因组数据的出版物的作者,并使用扎根理论分析了他们的回答。与我们的定量结果一致,作者提到他们选择物种的动机是样本可及性、先前发表的工作以及与人类医学的相关性。我们的混合方法有助于识别和背景化科学研究中物种不均衡模式背后的一些驱动因素,资助机构、科学协会和研究团队在旨在使其更广泛的目标与未来数据生成努力保持一致时,现在可以考虑这些因素。