SEXCMD：用于全外显子组/基因组和RNA测序的性别标记序列的开发与验证

SEXCMD: Development and validation of sex marker sequences for whole-exome/genome and RNA sequencing.

作者信息

Jeong Seongmun, Kim Jiwoong, Park Won, Jeon Hongmin, Kim Namshin

机构信息

Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon, Korea.

Quantitative Biomedical Research Center, Department of Clinical Sciences, University of Texas Southwestern Medical Center, Dallas, TX, United States of America.

出版信息

PLoS One. 2017 Sep 8;12(9):e0184087. doi: 10.1371/journal.pone.0184087. eCollection 2017.

DOI:10.1371/journal.pone.0184087

PMID:28886064

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5590872/

Abstract

Over the last decade, a large number of nucleotide sequences have been generated by next-generation sequencing technologies and deposited to public databases. However, most of these datasets do not specify the sex of individuals sampled because researchers typically ignore or hide this information. Male and female genomes in many species have distinctive sex chromosomes, XX/XY and ZW/ZZ, and expression levels of many sex-related genes differ between the sexes. Herein, we describe how to develop sex marker sequences from syntenic regions of sex chromosomes and use them to quickly identify the sex of individuals being analyzed. Array-based technologies routinely use either known sex markers or the B-allele frequency of X or Z chromosomes to deduce the sex of an individual. The same strategy has been used with whole-exome/genome sequence data; however, all reads must be aligned onto a reference genome to determine the B-allele frequency of the X or Z chromosomes. SEXCMD is a pipeline that can extract sex marker sequences from reference sex chromosomes and rapidly identify the sex of individuals from whole-exome/genome and RNA sequencing after training with a known dataset through a simple machine learning approach. The pipeline counts total numbers of hits from sex-specific marker sequences and identifies the sex of the individuals sampled based on the fact that XX/ZZ samples do not have Y or W chromosome hits. We have successfully validated our pipeline with mammalian (Homo sapiens; XY) and avian (Gallus gallus; ZW) genomes. Typical calculation time when applying SEXCMD to human whole-exome or RNA sequencing datasets is a few minutes, and analyzing human whole-genome datasets takes about 10 minutes. Another important application of SEXCMD is as a quality control measure to avoid mixing samples before bioinformatics analysis. SEXCMD comprises simple Python and R scripts and is freely available at https://github.com/lovemun/SEXCMD.

摘要

在过去十年中，新一代测序技术产生了大量核苷酸序列并存入公共数据库。然而，这些数据集中的大多数并未指明所采样个体的性别，因为研究人员通常会忽略或隐藏此信息。许多物种的雄性和雌性基因组具有独特的性染色体，即XX/XY和ZW/ZZ，并且许多与性别相关的基因的表达水平在两性之间存在差异。在此，我们描述了如何从性染色体的同线区域开发性别标记序列，并使用它们快速鉴定被分析个体的性别。基于阵列的技术通常使用已知的性别标记或X或Z染色体的B等位基因频率来推断个体的性别。同样的策略也已应用于全外显子组/基因组序列数据；然而，所有读段都必须比对到参考基因组上，以确定X或Z染色体的B等位基因频率。SEXCMD是一个流程，它可以从参考性染色体中提取性别标记序列，并在通过简单的机器学习方法用已知数据集进行训练后，从全外显子组/基因组和RNA测序中快速鉴定个体的性别。该流程会统计来自性别特异性标记序列的命中总数，并基于XX/ZZ样本没有Y或W染色体命中这一事实来鉴定所采样个体的性别。我们已成功使用哺乳动物（智人；XY）和鸟类（原鸡；ZW）基因组验证了我们的流程。将SEXCMD应用于人类全外显子组或RNA测序数据集时，典型的计算时间为几分钟，而分析人类全基因组数据集大约需要10分钟。SEXCMD的另一个重要应用是作为一种质量控制措施，以避免在生物信息学分析之前混合样本。SEXCMD由简单的Python和R脚本组成，可在https://github.com/lovemun/SEXCMD上免费获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a0f/5590872/1a8290c1a4fd/pone.0184087.g001.jpg

相似文献

SEXCMD: Development and validation of sex marker sequences for whole-exome/genome and RNA sequencing.

PLoS One. 2017 Sep 8;12(9):e0184087. doi: 10.1371/journal.pone.0184087. eCollection 2017.

Fully automated pipeline for detection of sex linked genes using RNA-Seq data.

BMC Bioinformatics. 2015 Mar 11;16(1):78. doi: 10.1186/s12859-015-0509-0.

Identification of sex-linked SNP markers using RAD sequencing suggests ZW/ZZ sex determination in Pistacia vera L.

BMC Genomics. 2015 Feb 18;16(1):98. doi: 10.1186/s12864-015-1326-6.

Extraction and annotation of human mitochondrial genomes from 1000 Genomes Whole Exome Sequencing data.

BMC Genomics. 2014;15 Suppl 3(Suppl 3):S2. doi: 10.1186/1471-2164-15-S3-S2. Epub 2014 May 6.

Sex-linked markers in the North American green frog (Rana clamitans) developed using DArTseq provide early insight into sex chromosome evolution.

BMC Genomics. 2016 Oct 28;17(1):844. doi: 10.1186/s12864-016-3209-x.

Identifying, understanding, and correcting technical artifacts on the sex chromosomes in next-generation sequencing data.

Gigascience. 2019 Jul 1;8(7). doi: 10.1093/gigascience/giz074.

A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments.

BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):43. doi: 10.1186/s12859-017-1471-9.

High-density linkage mapping aided by transcriptomics documents ZW sex determination system in the Chinese mitten crab Eriocheir sinensis.

Heredity (Edinb). 2015 Sep;115(3):206-15. doi: 10.1038/hdy.2015.26. Epub 2015 Apr 15.

Y and W Chromosome Assemblies: Approaches and Discoveries.

Trends Genet. 2017 Apr;33(4):266-282. doi: 10.1016/j.tig.2017.01.008. Epub 2017 Feb 22.

Integrated gene mapping and synteny studies give insights into the evolution of a sex proto-chromosome in Solea senegalensis.

Chromosoma. 2017 Mar;126(2):261-277. doi: 10.1007/s00412-016-0589-2. Epub 2016 Apr 14.

引用本文的文献

The genomic prehistory of the Indigenous peoples of Uruguay.

PNAS Nexus. 2022 Apr 21;1(2):pgac047. doi: 10.1093/pnasnexus/pgac047. eCollection 2022 May.

Considerations and challenges for sex-aware drug repurposing.

Biol Sex Differ. 2022 Mar 25;13(1):13. doi: 10.1186/s13293-022-00420-8.

The avian W chromosome is a refugium for endogenous retroviruses with likely effects on female-biased mutational load and genetic incompatibilities.

Philos Trans R Soc Lond B Biol Sci. 2021 Sep 13;376(1833):20200186. doi: 10.1098/rstb.2020.0186. Epub 2021 Jul 26.

Bioinformatics services for analyzing massive genomic datasets.

Genomics Inform. 2020 Mar;18(1):e8. doi: 10.5808/GI.2020.18.1.e8. Epub 2020 Mar 31.

Alternatives to amelogenin markers for sex determination in humans and their forensic relevance.

Mol Biol Rep. 2020 Mar;47(3):2347-2360. doi: 10.1007/s11033-020-05268-y. Epub 2020 Jan 25.

Novel human sex-typing strategies based on the autism candidate gene NLGN4X and its male-specific gametologue NLGN4Y.

Biol Sex Differ. 2019 Dec 18;10(1):62. doi: 10.1186/s13293-019-0279-x.

本文引用的文献

seXY: a tool for sex inference from genotype arrays.

Bioinformatics. 2017 Feb 15;33(4):561-563. doi: 10.1093/bioinformatics/btw696.

A global reference for human genetic variation.

Nature. 2015 Oct 1;526(7571):68-74. doi: 10.1038/nature15393.

Domain enhanced lookup time accelerated BLAST.

Biol Direct. 2012 Apr 17;7:12. doi: 10.1186/1745-6150-7-12.

Cost-effective prediction of gender-labeling errors and estimation of gender-labeling error rates in candidate-gene association studies.

Front Genet. 2011 Jun 15;2:31. doi: 10.3389/fgene.2011.00031. eCollection 2011.

The sequence read archive.

Nucleic Acids Res. 2011 Jan;39(Database issue):D19-21. doi: 10.1093/nar/gkq1019. Epub 2010 Nov 9.

Amelogenin-based sex identification as a strategy to control the identity of DNA samples in genetic association studies.

Pharmacogenomics. 2010 Mar;11(3):449-57. doi: 10.2217/pgs.10.14.

Finding unique filter sets in PLATO: a precursor to efficient interaction analysis in GWAS data.

Pac Symp Biocomput. 2010:315-26.

The Sequence Alignment/Map format and SAMtools.

Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8.

Comprehensive genomic characterization defines human glioblastoma genes and core pathways.

Nature. 2008 Oct 23;455(7216):1061-8. doi: 10.1038/nature07385. Epub 2008 Sep 4.

PLINK: a tool set for whole-genome association and population-based linkage analyses.

Am J Hum Genet. 2007 Sep;81(3):559-75. doi: 10.1086/519795. Epub 2007 Jul 25.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

SEXCMD：用于全外显子组/基因组和RNA测序的性别标记序列的开发与验证

SEXCMD: Development and validation of sex marker sequences for whole-exome/genome and RNA sequencing.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献