Freimuth Robert R, Stormo Gary D, McLeod Howard L
Department of Medicine, Washington University School of Medicine, St. Louis, Missouri 63110, USA.
Hum Mutat. 2005 Feb;25(2):110-7. doi: 10.1002/humu.20123.
Pharmacogenomic and disease-association studies rely on identifying a comprehensive set of polymorphisms within candidate genes. Public SNP databases are a rich source of polymorphism data, but mining them effectively requires overcoming at least four challenges: ensuring accurate annotations for genes and polymorphisms, eliminating both inter- and intra-database redundancy, integrating data from multiple public sources with data generated locally, and prioritizing the variants for further study. PolyMAPr (Polymorphism Mining and Annotation Programs)' was developed to overcome these challenges and to improve the efficiency of database mining and polymorphism annotation. PolyMAPr takes as input a file containing a list of genes to be processed and files containing each annotated gene sequence. Polymorphic sequences obtained from public databases (dbSNP, CGAP, and JSNP) or through local SNP discovery efforts, as well as oligonucleotide sequences (e.g., PCR primers), are mapped to the annotated gene sequences and named according to suggested nomenclature guidelines. The functional effects of nonsynonymous coding-region SNPs (cSNPs) and any variants that might alter exon splicing enhancer (ESE) sites, putative transcription factor binding sites, or intron-exon splice sites are predicted. The output files are accessible though a browser interface. In addition, the results are also provided in Extensible Markup Language (XML) format to facilitate uploading them into a local relational database. PolyMAPr increases the efficiency of mining public databases for genetic variants within candidate genes and provides a mechanism by which data from multiple sources (both public and private) can be uniformly integrated, thereby significantly reducing the effort required to obtain a comprehensive set of polymorphisms for pharmacogenomic and disease-association studies. PolyMAPr can be obtained from http://pharmacogenomics.wustl.edu.
药物基因组学和疾病关联研究依赖于在候选基因中识别出一套全面的多态性。公共单核苷酸多态性(SNP)数据库是多态性数据的丰富来源,但要有效地挖掘这些数据库,至少需要克服四个挑战:确保对基因和多态性进行准确注释,消除数据库间和数据库内的冗余,将来自多个公共来源的数据与本地生成的数据进行整合,以及对变异进行优先级排序以便进一步研究。开发PolyMAPr(多态性挖掘与注释程序)就是为了克服这些挑战,并提高数据库挖掘和多态性注释的效率。PolyMAPr以一个包含待处理基因列表的文件以及包含每个注释基因序列的文件作为输入。从公共数据库(dbSNP、CGAP和JSNP)或通过本地SNP发现工作获得的多态性序列,以及寡核苷酸序列(例如PCR引物),被映射到注释基因序列上,并根据建议的命名指南进行命名。预测非同义编码区SNP(cSNP)以及任何可能改变外显子剪接增强子(ESE)位点、假定转录因子结合位点或内含子-外显子剪接位点的变异的功能效应。输出文件可通过浏览器界面访问。此外,结果还以可扩展标记语言(XML)格式提供,以便于上传到本地关系数据库中。PolyMAPr提高了在候选基因中挖掘公共数据库以获取遗传变异的效率,并提供了一种机制,通过该机制可以统一整合来自多个来源(包括公共和私人来源)的数据,从而显著减少为药物基因组学和疾病关联研究获取一套全面多态性所需的工作量。可从http://pharmacogenomics.wustl.edu获取PolyMAPr。