Liebold Jeanine, Neuhaus Fabian, Geiser Janina, Kurtz Stefan, Baumbach Jan, Newaz Khalique
Institute for Computational Systems Biology, Universität Hamburg, Hamburg 22761, Germany.
Faculty of Mathematics, Informatics and Natural Sciences, ZBH-Center for Bioinformatics, Universität Hamburg, Hamburg 22761, Germany.
Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae762.
Transcription factors (TFs) are DNA-binding proteins that regulate gene expression. Traditional methods predict a protein as a TF if the protein contains any DNA-binding domains (DBDs) of known TFs. However, this approach fails to identify a novel TF that does not contain any known DBDs. Recently proposed TF prediction methods do not rely on DBDs. Such methods use features of protein sequences to train a machine learning model, and then use the trained model to predict whether a protein is a TF or not. Because the 3-dimensional (3D) structure of a protein captures more information than its sequence, using 3D protein structures will likely allow for more accurate prediction of novel TFs.
We propose a deep learning-based TF prediction method (StrucTFactor), which is the first method to utilize 3D secondary structural information of proteins. We compare StrucTFactor with recent state-of-the-art TF prediction methods based on ∼525 000 proteins across 12 datasets, capturing different aspects of data bias (including sequence redundancy) possibly influencing a method's performance. We find that StrucTFactor significantly (P-value < 0.001) outperforms the existing TF prediction methods, improving the performance over its closest competitor by up to 17% based on Matthews correlation coefficient.
Data and source code are available at https://github.com/lieboldj/StrucTFactor and on our website at https://apps.cosy.bio/StrucTFactor.
转录因子(TFs)是调节基因表达的DNA结合蛋白。传统方法如果一个蛋白质包含任何已知转录因子的DNA结合结构域(DBDs),就将其预测为转录因子。然而,这种方法无法识别不包含任何已知DBD的新型转录因子。最近提出的转录因子预测方法不依赖于DBD。此类方法利用蛋白质序列的特征来训练机器学习模型,然后使用训练好的模型来预测一个蛋白质是否为转录因子。由于蛋白质的三维(3D)结构比其序列捕获的信息更多,使用蛋白质的3D结构可能会更准确地预测新型转录因子。
我们提出了一种基于深度学习的转录因子预测方法(StrucTFactor),这是第一种利用蛋白质3D二级结构信息的方法。我们将StrucTFactor与最近基于12个数据集中约525000种蛋白质的最先进转录因子预测方法进行比较,涵盖可能影响方法性能的数据偏差(包括序列冗余)的不同方面。我们发现StrucTFactor显著(P值<0.001)优于现有的转录因子预测方法,基于马修斯相关系数,其性能比最接近的竞争对手提高了多达17%。
数据和源代码可在https://github.com/lieboldj/StrucTFactor以及我们的网站https://apps.cosy.bio/StrucTFactor上获取。