Center for Human Genome Variation, Duke University School of Medicine, Box 91009, Durham, NC 27708, USA.
Genome Biol. 2010;11(5):R57. doi: 10.1186/gb-2010-11-5-r57. Epub 2010 May 28.
There is considerable interest in the development of methods to efficiently identify all coding variants present in large sample sets of humans. There are three approaches possible: whole-genome sequencing, whole-exome sequencing using exon capture methods, and RNA-Seq. While whole-genome sequencing is the most complete, it remains sufficiently expensive that cost effective alternatives are important.
Here we provide a systematic exploration of how well RNA-Seq can identify human coding variants by comparing variants identified through high coverage whole-genome sequencing to those identified by high coverage RNA-Seq in the same individual. This comparison allowed us to directly evaluate the sensitivity and specificity of RNA-Seq in identifying coding variants, and to evaluate how key parameters such as the degree of coverage and the expression levels of genes interact to influence performance. We find that although only 40% of exonic variants identified by whole genome sequencing were captured using RNA-Seq; this number rose to 81% when concentrating on genes known to be well-expressed in the source tissue. We also find that a high false positive rate can be problematic when working with RNA-Seq data, especially at higher levels of coverage.
We conclude that as long as a tissue relevant to the trait under study is available and suitable quality control screens are implemented, RNA-Seq is a fast and inexpensive alternative approach for finding coding variants in genes with sufficiently high expression levels.
人们对于开发高效方法以鉴定人类大样本集中所有编码变异非常感兴趣。目前有三种可能的方法:全基因组测序、使用外显子捕获方法的全外显子组测序和 RNA-Seq。虽然全基因组测序最为全面,但它的成本仍然相当高,因此寻找具有成本效益的替代方法非常重要。
在这里,我们通过比较同一个体中高覆盖度全基因组测序鉴定的变异与高覆盖度 RNA-Seq 鉴定的变异,系统地探讨了 RNA-Seq 鉴定人类编码变异的能力。这种比较使我们能够直接评估 RNA-Seq 鉴定编码变异的灵敏度和特异性,并评估关键参数(如覆盖度程度和基因表达水平)如何相互作用影响性能。我们发现,虽然全基因组测序鉴定的 40%外显子变异可以通过 RNA-Seq 捕获;但当集中在来源组织中表达良好的已知基因时,这个数字上升到 81%。我们还发现,当处理 RNA-Seq 数据时,高假阳性率可能是一个问题,尤其是在更高的覆盖度水平下。
我们的结论是,只要有研究性状相关的组织可用,并且实施了适当的质量控制筛选,那么对于表达水平足够高的基因来说,RNA-Seq 是一种快速且经济实惠的寻找编码变异的替代方法。