Vialle Ricardo A, Yu Lei, Li Yan, Raittz Roberto T, Farfel Jose M, De Jager Philip L, Schneider Julie A, Barnes Lisa L, Tasaki Shinya, Bennett David A
Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, IL, USA.
Graduate Program in Bioinformatics, Professional and Technical Education Sector (SEPT), Universidade Federal do Paraná (UFPR), Curitiba, Paraná, Brazil.
medRxiv. 2025 Apr 25:2025.04.23.25326276. doi: 10.1101/2025.04.23.25326276.
The TOMM40'523 poly-T repeat polymorphism (rs10524523), located in the gene and in linkage disequilibrium with , has been associated with cognitive decline and Alzheimer's disease (AD) progression. Accurate genotyping of this polymorphism is crucial for understanding its role in neurodegeneration. Challenges in processing whole-genome sequencing (WGS) data traditionally require additional PCR and targeted sequencing assays to genotype these polymorphisms. Here, we introduce a novel computational pipeline that integrates multiple short tandem repeat (STR) detection tools in an ensemble machine learning model using . This approach leverages STR tool predictions, k-mer counts, and related features to enhance poly-T repeat length estimation. Using a sample of 1,202 participants from four cohort studies, we benchmarked our method against PCR-based measures. Our ensemble model outperformed individual STR tools, improving repeat length estimation accuracy (R = 0.92) and achieving an accuracy rate of 93.2% with PCR-derived genotypes as the gold standard. Additionally, we validated our WGS-derived genotypes by replicating previously reported associations between TOMM40'523 variants and cognitive decline, demonstrating consistency with prior findings. Our results suggest that computational genotyping from WGS data is a scalable and reliable alternative to PCR-based assays, enabling broader investigations of variation in studies where WGS data is available.
位于该基因中且与处于连锁不平衡状态的TOMM40'523多聚T重复多态性(rs10524523),已被证明与认知能力下降和阿尔茨海默病(AD)进展相关。准确对这种多态性进行基因分型对于理解其在神经退行性变中的作用至关重要。传统上,处理全基因组测序(WGS)数据时面临的挑战需要额外的聚合酶链反应(PCR)和靶向测序分析来对这些多态性进行基因分型。在此,我们引入了一种新颖的计算流程,该流程在一个使用的集成机器学习模型中整合了多种短串联重复序列(STR)检测工具。这种方法利用STR工具预测、k-mer计数及相关特征来提高多聚T重复长度估计。我们使用来自四项队列研究的1202名参与者的样本,将我们的方法与基于PCR的测量方法进行了基准测试。我们的集成模型优于单个STR工具,提高了重复长度估计准确性(R = 0.92),并以PCR衍生的基因型作为金标准实现了93.2%的准确率。此外,我们通过复制先前报道的TOMM40'523变体与认知能力下降之间的关联,验证了我们从WGS数据得出的基因型,证明与先前的研究结果一致。我们的结果表明,从WGS数据进行计算基因分型是一种可扩展且可靠的替代基于PCR的分析方法,能够在有WGS数据的研究中对变异进行更广泛的研究。