Clavell-Revelles Pau, Reese Fairlie, Carbonell-Sala Sílvia, Degalez Fabien, Oliveros Winona, Arnan Carme, Guigó Roderic, Melé Marta
Life Sciences Department, Barcelona Supercomputing Center (BSC), Barcelona, Catalonia.
Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Catalonia.
bioRxiv. 2025 Mar 17:2025.03.14.643250. doi: 10.1101/2025.03.14.643250.
Accurate gene annotations are fundamental for interpreting genetic variation, cellular function, and disease mechanisms. However, current human gene annotations are largely derived from transcriptomic data of individuals with European ancestry, introducing potential biases that remain uncharacterized. Here, we generate over 800 million full-length reads with long-read RNA-seq in 43 lymphoblastoid cell line samples from eight genetically-diverse human populations and build a cross-ancestry gene annotation. We show that transcripts from non-European samples are underrepresented in reference gene annotations, leading to systematic biases in allele-specific transcript usage analyses. Furthermore, we show that personal genome assemblies enhance transcript discovery compared to the generic GRCh38 reference assembly, even though genomic regions unique to each individual are heavily depleted of genes. These findings underscore the urgent need for a more inclusive gene annotation framework that accurately represents global transcriptome diversity.
准确的基因注释对于解读遗传变异、细胞功能和疾病机制至关重要。然而,目前的人类基因注释大多来自欧洲血统个体的转录组数据,这引入了尚未明确的潜在偏差。在此,我们利用长读长RNA测序技术,在来自八个遗传背景不同的人类群体的43个淋巴母细胞系样本中生成了超过8亿条全长 reads,并构建了一个跨祖先基因注释。我们发现,非欧洲样本的转录本在参考基因注释中代表性不足,导致等位基因特异性转录本使用分析中出现系统性偏差。此外,我们还表明,与通用的GRCh38参考基因组组装相比,个人基因组组装能增强转录本的发现,尽管每个个体特有的基因组区域基因严重匮乏。这些发现凸显了迫切需要一个更具包容性的基因注释框架,以准确代表全球转录组多样性。