Suppr超能文献

用于软件开发的逐步分布式开放创新竞赛:全基因组关联分析的加速

Stepwise Distributed Open Innovation Contests for Software Development: Acceleration of Genome-Wide Association Analysis.

作者信息

Hill Andrew, Loh Po-Ru, Bharadwaj Ragu B, Pons Pascal, Shang Jingbo, Guinan Eva, Lakhani Karim, Kilty Iain, Jelinsky Scott A

机构信息

Research Business Technology, Pfizer Research, 1 Portland Street, Cambridge, Massachusetts, 02139 USA.

Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.

出版信息

Gigascience. 2017 May 1;6(5):1-10. doi: 10.1093/gigascience/gix009.

Abstract

BACKGROUND

The association of differing genotypes with disease-related phenotypic traits offers great potential to both help identify new therapeutic targets and support stratification of patients who would gain the greatest benefit from specific drug classes. Development of low-cost genotyping and sequencing has made collecting large-scale genotyping data routine in population and therapeutic intervention studies. In addition, a range of new technologies is being used to capture numerous new and complex phenotypic descriptors. As a result, genotype and phenotype datasets have grown exponentially. Genome-wide association studies associate genotypes and phenotypes using methods such as logistic regression. As existing tools for association analysis limit the efficiency by which value can be extracted from increasing volumes of data, there is a pressing need for new software tools that can accelerate association analyses on large genotype-phenotype datasets.

RESULTS

Using open innovation (OI) and contest-based crowdsourcing, the logistic regression analysis in a leading, community-standard genetics software package (PLINK 1.07) was substantially accelerated. OI allowed us to do this in <6 months by providing rapid access to highly skilled programmers with specialized, difficult-to-find skill sets. Through a crowd-based contest a combination of computational, numeric, and algorithmic approaches was identified that accelerated the logistic regression in PLINK 1.07 by 18- to 45-fold. Combining contest-derived logistic regression code with coarse-grained parallelization, multithreading, and associated changes to data initialization code further developed through distributed innovation, we achieved an end-to-end speedup of 591-fold for a data set size of 6678 subjects by 645 863 variants, compared to PLINK 1.07's logistic regression. This represents a reduction in run time from 4.8 hours to 29 seconds. Accelerated logistic regression code developed in this project has been incorporated into the PLINK2 project.

CONCLUSIONS

Using iterative competition-based OI, we have developed a new, faster implementation of logistic regression for genome-wide association studies analysis. We present lessons learned and recommendations on running a successful OI process for bioinformatics.

摘要

背景

不同基因型与疾病相关表型特征之间的关联,为识别新的治疗靶点以及支持对特定药物类别能获得最大益处的患者进行分层提供了巨大潜力。低成本基因分型和测序技术的发展,使得在人群和治疗干预研究中收集大规模基因分型数据成为常规操作。此外,一系列新技术正被用于获取众多新的和复杂的表型描述符。因此,基因型和表型数据集呈指数级增长。全基因组关联研究使用逻辑回归等方法将基因型和表型关联起来。由于现有的关联分析工具限制了从不断增加的数据量中提取价值的效率,迫切需要新的软件工具来加速对大型基因型 - 表型数据集的关联分析。

结果

通过开放式创新(OI)和基于竞赛的众包方式,领先的社区标准遗传学软件包(PLINK 1.07)中的逻辑回归分析得到了大幅加速。OI使我们能够在不到6个月的时间内完成这一目标,通过快速接触到拥有专业且难以找到的技能集的高技能程序员。通过一场基于人群的竞赛,确定了计算、数值和算法方法的组合,使PLINK 1.07中的逻辑回归加速了18至45倍。将竞赛衍生的逻辑回归代码与粗粒度并行化、多线程以及通过分布式创新进一步开发的数据初始化代码的相关更改相结合,对于一个包含6678个受试者和645863个变体的数据集,与PLINK 1.07的逻辑回归相比,我们实现了591倍的端到端加速。这意味着运行时间从4.8小时减少到了29秒。本项目中开发的加速逻辑回归代码已被纳入PLINK2项目。

结论

通过基于迭代竞争的OI,我们为全基因组关联研究分析开发了一种新的、更快的逻辑回归实现方式。我们介绍了在生物信息学中成功运行OI过程的经验教训和建议。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/5467032/da0a74db8ee1/gix009fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验