Jeong Seongmun, Kim Jiwoong, Park Won, Jeon Hongmin, Kim Namshin
Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon, Korea.
Quantitative Biomedical Research Center, Department of Clinical Sciences, University of Texas Southwestern Medical Center, Dallas, TX, United States of America.
PLoS One. 2017 Sep 8;12(9):e0184087. doi: 10.1371/journal.pone.0184087. eCollection 2017.
Over the last decade, a large number of nucleotide sequences have been generated by next-generation sequencing technologies and deposited to public databases. However, most of these datasets do not specify the sex of individuals sampled because researchers typically ignore or hide this information. Male and female genomes in many species have distinctive sex chromosomes, XX/XY and ZW/ZZ, and expression levels of many sex-related genes differ between the sexes. Herein, we describe how to develop sex marker sequences from syntenic regions of sex chromosomes and use them to quickly identify the sex of individuals being analyzed. Array-based technologies routinely use either known sex markers or the B-allele frequency of X or Z chromosomes to deduce the sex of an individual. The same strategy has been used with whole-exome/genome sequence data; however, all reads must be aligned onto a reference genome to determine the B-allele frequency of the X or Z chromosomes. SEXCMD is a pipeline that can extract sex marker sequences from reference sex chromosomes and rapidly identify the sex of individuals from whole-exome/genome and RNA sequencing after training with a known dataset through a simple machine learning approach. The pipeline counts total numbers of hits from sex-specific marker sequences and identifies the sex of the individuals sampled based on the fact that XX/ZZ samples do not have Y or W chromosome hits. We have successfully validated our pipeline with mammalian (Homo sapiens; XY) and avian (Gallus gallus; ZW) genomes. Typical calculation time when applying SEXCMD to human whole-exome or RNA sequencing datasets is a few minutes, and analyzing human whole-genome datasets takes about 10 minutes. Another important application of SEXCMD is as a quality control measure to avoid mixing samples before bioinformatics analysis. SEXCMD comprises simple Python and R scripts and is freely available at https://github.com/lovemun/SEXCMD.
在过去十年中,新一代测序技术产生了大量核苷酸序列并存入公共数据库。然而,这些数据集中的大多数并未指明所采样个体的性别,因为研究人员通常会忽略或隐藏此信息。许多物种的雄性和雌性基因组具有独特的性染色体,即XX/XY和ZW/ZZ,并且许多与性别相关的基因的表达水平在两性之间存在差异。在此,我们描述了如何从性染色体的同线区域开发性别标记序列,并使用它们快速鉴定被分析个体的性别。基于阵列的技术通常使用已知的性别标记或X或Z染色体的B等位基因频率来推断个体的性别。同样的策略也已应用于全外显子组/基因组序列数据;然而,所有读段都必须比对到参考基因组上,以确定X或Z染色体的B等位基因频率。SEXCMD是一个流程,它可以从参考性染色体中提取性别标记序列,并在通过简单的机器学习方法用已知数据集进行训练后,从全外显子组/基因组和RNA测序中快速鉴定个体的性别。该流程会统计来自性别特异性标记序列的命中总数,并基于XX/ZZ样本没有Y或W染色体命中这一事实来鉴定所采样个体的性别。我们已成功使用哺乳动物(智人;XY)和鸟类(原鸡;ZW)基因组验证了我们的流程。将SEXCMD应用于人类全外显子组或RNA测序数据集时,典型的计算时间为几分钟,而分析人类全基因组数据集大约需要10分钟。SEXCMD的另一个重要应用是作为一种质量控制措施,以避免在生物信息学分析之前混合样本。SEXCMD由简单的Python和R脚本组成,可在https://github.com/lovemun/SEXCMD上免费获取。