• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于软件开发的逐步分布式开放创新竞赛:全基因组关联分析的加速

Stepwise Distributed Open Innovation Contests for Software Development: Acceleration of Genome-Wide Association Analysis.

作者信息

Hill Andrew, Loh Po-Ru, Bharadwaj Ragu B, Pons Pascal, Shang Jingbo, Guinan Eva, Lakhani Karim, Kilty Iain, Jelinsky Scott A

机构信息

Research Business Technology, Pfizer Research, 1 Portland Street, Cambridge, Massachusetts, 02139 USA.

Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.

出版信息

Gigascience. 2017 May 1;6(5):1-10. doi: 10.1093/gigascience/gix009.

DOI:10.1093/gigascience/gix009
PMID:28327993
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5467032/
Abstract

BACKGROUND

The association of differing genotypes with disease-related phenotypic traits offers great potential to both help identify new therapeutic targets and support stratification of patients who would gain the greatest benefit from specific drug classes. Development of low-cost genotyping and sequencing has made collecting large-scale genotyping data routine in population and therapeutic intervention studies. In addition, a range of new technologies is being used to capture numerous new and complex phenotypic descriptors. As a result, genotype and phenotype datasets have grown exponentially. Genome-wide association studies associate genotypes and phenotypes using methods such as logistic regression. As existing tools for association analysis limit the efficiency by which value can be extracted from increasing volumes of data, there is a pressing need for new software tools that can accelerate association analyses on large genotype-phenotype datasets.

RESULTS

Using open innovation (OI) and contest-based crowdsourcing, the logistic regression analysis in a leading, community-standard genetics software package (PLINK 1.07) was substantially accelerated. OI allowed us to do this in <6 months by providing rapid access to highly skilled programmers with specialized, difficult-to-find skill sets. Through a crowd-based contest a combination of computational, numeric, and algorithmic approaches was identified that accelerated the logistic regression in PLINK 1.07 by 18- to 45-fold. Combining contest-derived logistic regression code with coarse-grained parallelization, multithreading, and associated changes to data initialization code further developed through distributed innovation, we achieved an end-to-end speedup of 591-fold for a data set size of 6678 subjects by 645 863 variants, compared to PLINK 1.07's logistic regression. This represents a reduction in run time from 4.8 hours to 29 seconds. Accelerated logistic regression code developed in this project has been incorporated into the PLINK2 project.

CONCLUSIONS

Using iterative competition-based OI, we have developed a new, faster implementation of logistic regression for genome-wide association studies analysis. We present lessons learned and recommendations on running a successful OI process for bioinformatics.

摘要

背景

不同基因型与疾病相关表型特征之间的关联,为识别新的治疗靶点以及支持对特定药物类别能获得最大益处的患者进行分层提供了巨大潜力。低成本基因分型和测序技术的发展,使得在人群和治疗干预研究中收集大规模基因分型数据成为常规操作。此外,一系列新技术正被用于获取众多新的和复杂的表型描述符。因此,基因型和表型数据集呈指数级增长。全基因组关联研究使用逻辑回归等方法将基因型和表型关联起来。由于现有的关联分析工具限制了从不断增加的数据量中提取价值的效率,迫切需要新的软件工具来加速对大型基因型 - 表型数据集的关联分析。

结果

通过开放式创新(OI)和基于竞赛的众包方式,领先的社区标准遗传学软件包(PLINK 1.07)中的逻辑回归分析得到了大幅加速。OI使我们能够在不到6个月的时间内完成这一目标,通过快速接触到拥有专业且难以找到的技能集的高技能程序员。通过一场基于人群的竞赛,确定了计算、数值和算法方法的组合,使PLINK 1.07中的逻辑回归加速了18至45倍。将竞赛衍生的逻辑回归代码与粗粒度并行化、多线程以及通过分布式创新进一步开发的数据初始化代码的相关更改相结合,对于一个包含6678个受试者和645863个变体的数据集,与PLINK 1.07的逻辑回归相比,我们实现了591倍的端到端加速。这意味着运行时间从4.8小时减少到了29秒。本项目中开发的加速逻辑回归代码已被纳入PLINK2项目。

结论

通过基于迭代竞争的OI,我们为全基因组关联研究分析开发了一种新的、更快的逻辑回归实现方式。我们介绍了在生物信息学中成功运行OI过程的经验教训和建议。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/5467032/84ff8b00f6f2/gix009fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/5467032/da0a74db8ee1/gix009fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/5467032/5202e80028bf/gix009fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/5467032/84ff8b00f6f2/gix009fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/5467032/da0a74db8ee1/gix009fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/5467032/5202e80028bf/gix009fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/5467032/84ff8b00f6f2/gix009fig3.jpg

相似文献

1
Stepwise Distributed Open Innovation Contests for Software Development: Acceleration of Genome-Wide Association Analysis.用于软件开发的逐步分布式开放创新竞赛:全基因组关联分析的加速
Gigascience. 2017 May 1;6(5):1-10. doi: 10.1093/gigascience/gix009.
2
Second-generation PLINK: rising to the challenge of larger and richer datasets.第二代PLINK:应对更大、更丰富数据集的挑战
Gigascience. 2015 Feb 25;4:7. doi: 10.1186/s13742-015-0047-8. eCollection 2015.
3
Trinculo: Bayesian and frequentist multinomial logistic regression for genome-wide association studies of multi-category phenotypes.特林库洛:用于多类别表型全基因组关联研究的贝叶斯和频率主义多项逻辑回归。
Bioinformatics. 2016 Jun 15;32(12):1898-900. doi: 10.1093/bioinformatics/btw075. Epub 2016 Feb 11.
4
Identifying and mitigating batch effects in whole genome sequencing data.识别并减轻全基因组测序数据中的批次效应。
BMC Bioinformatics. 2017 Jul 24;18(1):351. doi: 10.1186/s12859-017-1756-z.
5
openSNP--a crowdsourced web resource for personal genomics.openSNP--一个用于个人基因组学的众包网络资源。
PLoS One. 2014 Mar 19;9(3):e89204. doi: 10.1371/journal.pone.0089204. eCollection 2014.
6
SCOPA and META-SCOPA: software for the analysis and aggregation of genome-wide association studies of multiple correlated phenotypes.SCOPA和META-SCOPA:用于分析和汇总多个相关表型的全基因组关联研究的软件。
BMC Bioinformatics. 2017 Jan 11;18(1):25. doi: 10.1186/s12859-016-1437-3.
7
The Mega2R package: R tools for accessing and processing genetic data in common formats.Mega2R软件包:用于访问和处理常见格式遗传数据的R工具。
F1000Res. 2018 Aug 29;7:1352. doi: 10.12688/f1000research.15949.2. eCollection 2018.
8
phenosim--A software to simulate phenotypes for testing in genome-wide association studies.phenosim--用于模拟表型以进行全基因组关联研究测试的软件。
BMC Bioinformatics. 2011 Jun 29;12:265. doi: 10.1186/1471-2105-12-265.
9
GWAMA: software for genome-wide association meta-analysis.GWAMA:全基因组关联荟萃分析软件。
BMC Bioinformatics. 2010 May 28;11:288. doi: 10.1186/1471-2105-11-288.
10
PLINK: Key Functions for Data Analysis.PLINK:数据分析的关键功能。
Curr Protoc Hum Genet. 2018 Apr;97(1):e59. doi: 10.1002/cphg.59.

引用本文的文献

1
Does Play a Role in Parkinson's Disease Susceptibility Across Diverse Ancestral Populations?在不同祖先群体的帕金森病易感性中起作用吗?
medRxiv. 2025 Apr 11:2025.04.11.25325572. doi: 10.1101/2025.04.11.25325572.
2
Blood DNA virome associates with autoimmune diseases and COVID-19.血液DNA病毒组与自身免疫性疾病和COVID-19相关。
Nat Genet. 2025 Jan;57(1):65-79. doi: 10.1038/s41588-024-02022-z. Epub 2025 Jan 3.
3
eQTLs identify regulatory networks and drivers of variation in the individual response to sepsis.eQTLs 确定了个体对败血症反应的个体差异的调控网络和驱动因素。

本文引用的文献

1
Second-generation PLINK: rising to the challenge of larger and richer datasets.第二代PLINK:应对更大、更丰富数据集的挑战
Gigascience. 2015 Feb 25;4:7. doi: 10.1186/s13742-015-0047-8. eCollection 2015.
2
Use of crowdsourcing for cancer clinical trial development.利用众包进行癌症临床试验开发。
J Natl Cancer Inst. 2014 Sep 12;106(10). doi: 10.1093/jnci/dju258. Print 2014 Oct.
3
A community effort to assess and improve drug sensitivity prediction algorithms.一项评估和改进药物敏感性预测算法的社区工作。
Cell Genom. 2024 Jul 10;4(7):100587. doi: 10.1016/j.xgen.2024.100587. Epub 2024 Jun 18.
4
Enhancing prediction accuracy of coronary artery disease through machine learning-driven genomic variant selection.通过机器学习驱动的基因组变异选择提高冠状动脉疾病预测准确性。
J Transl Med. 2024 Apr 16;22(1):356. doi: 10.1186/s12967-024-05090-1.
5
A genome-wide association study for allergen component sensitizations identifies allergen component-specific and allergen protein group-specific associations.一项针对变应原成分致敏的全基因组关联研究确定了变应原成分特异性和变应原蛋白组特异性关联。
J Allergy Clin Immunol Glob. 2023 Feb 20;2(2):100086. doi: 10.1016/j.jacig.2023.100086. eCollection 2023 May.
6
Comprehensive characterization of putative genetic influences on plasma metabolome in a pediatric cohort.综合分析儿科队列中血浆代谢组学的潜在遗传影响因素。
Hum Genomics. 2022 Dec 8;16(1):67. doi: 10.1186/s40246-022-00440-w.
7
Pseudoexfoliation and Cataract Syndrome Associated with Genetic and Epidemiological Factors in a Mayan Cohort of Guatemala.危地马拉玛雅人群中与遗传和流行病学因素相关的假性剥脱和白内障综合征。
Int J Environ Res Public Health. 2021 Jul 6;18(14):7231. doi: 10.3390/ijerph18147231.
8
Open Innovation in Medical and Pharmaceutical Research: A Literature Landscape Analysis.医学与制药研究中的开放式创新:文献综述分析
Front Pharmacol. 2021 Jan 14;11:587526. doi: 10.3389/fphar.2020.587526. eCollection 2020.
9
Baseline characteristics and age-related macular degeneration in participants of the "ASPirin in Reducing Events in the Elderly" (ASPREE)-AMD trial.“老年人使用阿司匹林降低事件风险”(ASPREE)-年龄相关性黄斑变性试验参与者的基线特征与年龄相关性黄斑变性
Contemp Clin Trials Commun. 2020 Oct 11;20:100667. doi: 10.1016/j.conctc.2020.100667. eCollection 2020 Dec.
10
Assessing Digital Phenotyping to Enhance Genetic Studies of Human Diseases.评估数字表型学以增强人类疾病的遗传研究。
Am J Hum Genet. 2020 May 7;106(5):611-622. doi: 10.1016/j.ajhg.2020.03.007. Epub 2020 Apr 9.
Nat Biotechnol. 2014 Dec;32(12):1202-12. doi: 10.1038/nbt.2877. Epub 2014 Jun 1.
4
Risk loci for chronic obstructive pulmonary disease: a genome-wide association study and meta-analysis.慢性阻塞性肺疾病风险基因座的全基因组关联研究和荟萃分析。
Lancet Respir Med. 2014 Mar;2(3):214-25. doi: 10.1016/S2213-2600(14)70002-5. Epub 2014 Feb 7.
5
Crowdsourcing for bioinformatics.生物信息学众包。
Bioinformatics. 2013 Aug 15;29(16):1925-33. doi: 10.1093/bioinformatics/btt333. Epub 2013 Jun 19.
6
Sequence squeeze: an open contest for sequence compression.序列压缩:序列压缩公开竞赛。
Gigascience. 2013 Apr 18;2(1):5. doi: 10.1186/2047-217X-2-5.
7
Prize-based contests can provide solutions to computational biology problems.基于奖励的竞赛可为计算生物学问题提供解决方案。
Nat Biotechnol. 2013 Feb;31(2):108-11. doi: 10.1038/nbt.2495.
8
An integrated map of genetic variation from 1,092 human genomes.1092 个人类基因组遗传变异的综合图谱。
Nature. 2012 Nov 1;491(7422):56-65. doi: 10.1038/nature11632.
9
Wisdom of crowds for robust gene network inference.群体智慧在稳健基因网络推断中的应用。
Nat Methods. 2012 Jul 15;9(8):796-804. doi: 10.1038/nmeth.2016.
10
Ultrafast genome-wide scan for SNP-SNP interactions in common complex disease.用于常见复杂疾病中 SNP-SNP 相互作用的超快速全基因组扫描。
Genome Res. 2012 Nov;22(11):2230-40. doi: 10.1101/gr.137885.112. Epub 2012 Jul 5.