DeepPTM: Protein Post-translational Modification Prediction from Protein Sequences by Combining Deep Protein Language Model with Vision Transformers

Necla Soylu; Emre Sefer

doi:10.2174/0115748936283134240109054157

DeepPTM: Protein Post-translational Modification Prediction from Protein Sequences by Combining Deep Protein Language Model with Vision Transformers

作者: Soylu N.¹, Sefer E.²
隶属关系:
1. Department of Computer Science,, Ozyegin University
2. Department of Computer Science, Ozyegin University
期: 卷 19, 编号 9 (2024)
页面: 810-824
栏目: Life Sciences
URL: https://jdigitaldiagnostics.com/1574-8936/article/view/644062
DOI: https://doi.org/10.2174/0115748936283134240109054157
ID: 644062

如何引用文章

全文:

详细
作者简介
参考
补充文件
统计

详细

Introduction:More recent self-supervised deep language models, such as Bidirectional Encoder Representations from Transformers (BERT), have performed the best on some language tasks by contextualizing word embeddings for a better dynamic representation. Their proteinspecific versions, such as ProtBERT, generated dynamic protein sequence embeddings, which resulted in better performance for several bioinformatics tasks. Besides, a number of different protein post-translational modifications are prominent in cellular tasks such as development and differentiation. The current biological experiments can detect these modifications, but within a longer duration and with a significant cost.

Methods:In this paper, to comprehend the accompanying biological processes concisely and more rapidly, we propose DEEPPTM to predict protein post-translational modification (PTM) sites from protein sequences more efficiently. Different than the current methods, DEEPPTM enhances the modification prediction performance by integrating specialized ProtBERT-based protein embeddings with attention-based vision transformers (ViT), and reveals the associations between different modification types and protein sequence content. Additionally, it can infer several different modifications over different species.

Results:Human and mouse ROC AUCs for predicting Succinylation modifications were 0.793 and 0.661 respectively, once 10-fold cross-validation is applied. Similarly, we have obtained 0.776, 0.764, and 0.734 ROC AUC scores on inferring ubiquitination, crotonylation, and glycation sites, respectively. According to detailed computational experiments, DEEPPTM lessens the time spent in laboratory experiments while outperforming the competing methods as well as baselines on inferring all 4 modification sites. In our case, attention-based deep learning methods such as vision transformers look more favorable to learning from ProtBERT features than more traditional deep learning and machine learning techniques.

Conclusion:Additionally, the protein-specific ProtBERT model is more effective than the original BERT embeddings for PTM prediction tasks. Our code and datasets can be found at https://github.com/seferlab/deepptm.

关键词

ProtBERT, post-translational modification, vision transformers, deep learning, protein language model, bioinformatics.

作者简介

Necla Soylu

Department of Computer Science,, Ozyegin University

Email: info@benthamscience.net

Emre Sefer

Department of Computer Science, Ozyegin University

编辑信件的主要联系方式.
Email: info@benthamscience.net

参考

Conibear AC. Deciphering protein post-translational modifications using chemical biology tools. Nat Rev Chem 2020; 4(12): 674-95. doi: 10.1038/s41570-020-00223-8 PMID: 37127974
Ramazi S, Zahiri J. Post-translational modifications in proteins: Resources, tools and prediction methods. Database 2021; 2021: baab012. doi: 10.1093/database/baab012 PMID: 33826699
Kouzarides T. Chromatin modifications and their function. Cell 2007; 128(4): 693-705. doi: 10.1016/j.cell.2007.02.005 PMID: 17320507
Huang G, Li J. Feature extractions for computationally predicting protein post- translational modifications. Curr Bioinform 2018; 13(4): 387-95. doi: 10.2174/1574893612666170707094916
Yu H, Bu C, Liu Y, et al. Global crotonylome reveals CDYL-regulated RPA1 crotonylation in homologous recombinationmediated DNA repair. Sci Adv 2020; 6(11): eaay4697.https://www.science. org/doi/abs/10.1126/sciadv.aay4697 doi: 10.1126/sciadv.aay4697 PMID: 32201722
Zhang W, Tan X, Lin S, Gou Y, Han C, Zhang C, et al. CPLM 4.0: An updated database with rich annotations for protein lysine modifications. Nucleic Acids Research 2021; 50(D1): D451-9. doi: 10.1093/nar/gkab849
Li Z, Li S, Luo M, Jhong JH, Li W, Yao L, et al. dbPTM in 2022: An updated database for exploring regulatory networks and functional associations of protein post-translational modifications. Nucleic Acids Res 2022; 50(D1): D471-9. doi: 10.1093/nar/gkab1017
Reddy HM, Sharma A, Dehzangi A, Shigemizu D, Chandra AA, Tsunoda T. GlyStruct: Glycation prediction using structural properties of amino acid residues. BMC Bioinformatics 2019; 19(S13) (Suppl. 13): 547. doi: 10.1186/s12859-018-2547-x PMID: 30717650
Zhang X, Smits AH, van Tilburg GBA, Ovaa H, Huber W, Vermeulen M. Proteome-wide identification of ubiquitin interactions using UbIA-MS. Nat Protoc 2018; 13(3): 530-50. doi: 10.1038/nprot.2017.147 PMID: 29446774
Hendriks IA, Vertegaal ACO. A comprehensive compilation of SUMO proteomics. Nat Rev Mol Cell Biol 2016; 17(9): 581-95. doi: 10.1038/nrm.2016.81
Kori Y, Sidoli S, Yuan ZF, Lund PJ, Zhao X, Garcia BA. Proteome-wide acetylation dynamics in human cells. Sci Rep 2017; 7(1): 10296. doi: 10.1038/s41598-017-09918-3 PMID: 28860605
Sadhukhan S, Liu X, Ryu D, Nelson OD, Stupinski JA, Li Z, et al. Metabolomics-assisted proteomics identifies succinylation and SIRT5 as important regulators of cardiac function. Proceed Natl Acad Sci 2016; 133(16): 4320-5. Available from: https://www.pnas.org/doi/abs/10 doi: 10.1073/pnas.1519858113
Welsch DJ, Nelsestuen GL. Amino-terminal alanine functions in a calcium-specific process essential for membrane binding by prothrombin fragment 1. Biochemistry 1988; 27(13): 4939-45. doi: 10.1021/bi00413a052
Slade DJ, Subramanian V, Fuhrmann J, Thompson PR. Chemical and biological methods to detect post‐translational modifications of arginine. Biopolymers 2014; 101(2): 133-43. doi: 10.1002/bip.22256 PMID: 23576281
Umlauf D, Goto Y, Feil R. Site-Specific Analysis of Histone Methylation and Acetylation. Totowa, NJ: Humana Press 2004; pp. 99-120. doi: 10.1385/1-59259-828-5:099
Jaffrey SR, Erdjument-Bromage H, Ferris CD, Tempst P, Snyder SH. Protein S-nitrosylation: A physiological signal for neuronal nitric oxide. Nat Cell Biol 2001; 3(2): 193-7. doi: 10.1038/35055104 PMID: 11175752
Medzihradszky KF. Peptide sequence analysis. Biol Mass Spectrometry 2005; 402: 209-44. Available from: https://www.sciencedirect.com/science/article/pii/ doi: 10.1016/S0076-6879(05)02007-0
Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: Pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 2021; 37(15): 2112-20. doi: 10.1093/bioinformatics/btab083
Asgari E, Mofrad MRK. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLOS ONE 2015; 10(11): 1-15. doi: 10.1371/journal.pone.0141287
Heinzinger M, Elnaggar A, Wang Y, et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 2019; 20(1): 723. doi: 10.1186/s12859-019-3220-8 PMID: 31847804
Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. ProteinBERT: A universal deep-learning model of protein sequence and function. Bioinformatics 2022; 38(8): 2102-10. doi: 10.1093/bioinformatics/btac020
Elnaggar A, Heinzinger M, Dallago C, et al. ProtTrans: Toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 2022; 44(10): 7112-27. doi: 10.1109/TPAMI.2021.3095381 PMID: 34232869
Soylu NN, Sefer E. BERT2OME: Prediction of 2′-O-methylation modifications from rna sequence by transformer architecture based on BERT. IEEE/ACM Trans Comput Biol Bioinformatics 2023; 20(3): 2177-89. doi: 10.1109/TCBB.2023.3237769 PMID: 37819796
Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2019; 1: 4171-86. Available from: https://aclanthology.org/N19-1423
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16x16 words: Transformers for image recognition at scale arXiv. 2020. Available from: https://arxiv.org/abs/2010.11929
Oliveira GB, Pedrini H, Dias Z. TEMPROT: Protein function annotation using transformers embeddings and homology search. BMC Bioinformatics 2023; 24(1): 242. doi: 10.1186/s12859-023-05375-0 PMID: 37291492
Chandra A, Tünnermann L, Löfstedt T, Gratz R. Transformer-based deep learning for predicting protein properties in the life sciences. eLife 2023; 12: e82819. doi: 10.7554/eLife.82819 PMID: 36651724
Behjati A, Zare-Mirakabad F, Arab SS, Nowzari-Dalini A. Protein sequence profile prediction using ProtAlbert transformer. Comput Biol Chem 2022; 99: 107717.https://www.sciencedirect.com/science/article/pii/S1476927122000974 doi: 10.1016/j.compbiolchem.2022.107717 PMID: 35802991
Raad J, Bugnon LA, Milone DH, Stegmayer G. miRe2e: A full end-to-end deep model based on transformers for prediction of premiRNAs. Bioinformatics 2021; 38(5): 1191-7. doi: 10.1093/bioinformatics/btab823
Le NQK. Leveraging transformers-based language models in proteome bioinformatics. Proteomics Available from: https://analyticalsciencejournals. onlinelibrary.wiley.com/doi/abs/10.1002/pmic 2023, 23(23-24): e200011. doi: 10.1002/pmic.202300011
Le NQK. Potential of deep representative learning features to interpret the sequence information in proteomics. Proteomics 2022; 22(1-2): 2100232. Available from: https://api.semanticscholar.org/CorpusID:241107849 doi: 10.1002/pmic.202100232 PMID: 34730875
Chen Z, Liu X, Li F, Li C, Marquez-Lago T, Leier A, et al. Largescale comparative assessment of computational predictors for lysine post-translational modification sites. Briefings in Bioinformatics 2018; 20(6): 2267-90. doi: 10.1093/bib/bby089
Zou Q, Xing P, Wei L, Liu B. Gene2vec: gene subsequence embedding for prediction of mammalian N6 -methyladenosine sites from mRNA. RNA 2019; 25(2): 205-18. doi: 10.1261/rna.069112.118 PMID: 30425123
Akimov V, Barrio-Hernandez I, Hansen SVF, et al. UbiSite approach for comprehensive mapping of lysine and N-terminal ubiquitination sites. Nat Struct Mol Biol 2018; 25(7): 631-40. doi: 10.1038/s41594-018-0084-y PMID: 29967540
Liu Y, Li A, Zhao XM, Wang M. DeepTL-Ubi: A novel deep transfer learning method for effectively predicting ubiquitination sites of multiple species. Methods 2021; 192: 103-11. Available from: https://www.sciencedirect.com/science/article/pii/S1046202320301560
Pokharel S, Pratyush P, Heinzinger M, Newman RH, Kc DB. Improving protein succinylation sites prediction using embeddings from protein language model. Sci Rep 2022; 12(1): 16933. doi: 10.1038/s41598-022-21366-2 PMID: 36209286
Thapa N, Chaudhari M, McManus S, et al. DeepSuccinylSite: A deep learning based approach for protein succinylation site prediction. BMC Bioinformatics 2020; 21(S3) (Suppl. 3): 63. doi: 10.1186/s12859-020-3342-z PMID: 32321437
Lv H, Dao FY, Guan ZX, Yang H, Li YW, Lin H. Deep-Kcr: Accurate detection of lysine crotonylation sites using deep learning method. Briefings in Bioinformatics 2020; 22(4) doi: 10.1093/bib/bbaa255
Qiao Y, Zhu X, Gong H. BERT-Kcr: Prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models. Bioinformatics 2021; 38(3): 648-54. doi: 10.1093/bioinformatics/btab712
Liu Y, Liu Y, Wang GA, Cheng Y, Bi S, Zhu X. BERT-Kgly: A bidirectional encoder representations from transformers (BERT)- based model for predicting lysine glycation site for homo sapiens Frontiers in Bioinformatics 2022; 2 Available from: https://www.frontiersin.org/articles/10
Yang Y, Wang H, Li W, et al. Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks. BMC Bioinformatics 2021; 22(1): 171. doi: 10.1186/s12859-021-04101-y PMID: 33789579
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 2012; 28(23): 3150-2. doi: 10.1093/bioinformatics/bts565
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems. Volume 2: 3111-9.
Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) Doha, Qatar: Association for Computational Linguistics 2014; 1532-43. doi: 10.3115/v1/D14-1162
Sun C, Huang L, Qiu X. Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2019; 1: 380-5. Available from: https://aclanthology.org/N19-1035
Chan YH, Fan YC. A recurrent BERT-based model for question generation. In: Proceedings of the 2nd Workshop on Machine Reading for Question Answering Hong Kong, China: Association for Computational Linguistics 2019; 154-62. doi: 10.18653/v1/D19-5821
Jiang T, Jiao J, Huang S, Zhang Z, Wang D, Zhuang F, et al. PromptBERT: Improving BERT sentence embeddings with prompts. In: Goldberg Y, Kozareva Z, Zhang Y, Eds. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing Abu Dhabi, United Arab Emirates: Association for Computational Linguistics 2022; 8826-37.https://aclanthology.org/2022.emnlp-main.603 doi: 10.18653/v1/2022.emnlp-main.603
Araci D. FinBERT: Financial sentiment analysis with pre-trained language models 2019.
Maurício J, Domingues I, Bernardino J. Comparing vision transformers and convolutional neural networks for image classification: A literature review. Appl Sci 2023; 13(9): 5521. doi: 10.3390/app13095521
Li Y, Mao H, Girshick R, He K. Exploring plain vision transformer backbones for object detection. 17th European Conference. Tel Aviv, Israel. Berlin, Heidelberg: Springer-Verlag 2022; pp. October 23272022; : 280-96. doi: 10.1007/978-3-031-20077-9_17
Thisanke H, Deshan C, Chamith K, Seneviratne S, Vidanaarachchi R, Herath D. Semantic segmentation using vision transformers Survey Engr Appl Artif Intell 2023; 126(Pt.-A): 106669.
Lv Z, Ding H, Wang L, Zou Q. A convolutional neural network using dinucleotide one-hot encoder for identifying DNA N6-methyladenine sites in the rice genome. Neurocomputing 2021; 422: 214-21. doi: 10.1016/j.neucom.2020.09.056
Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M. Transformers in vision: A survey. ACM Comput Surv 2022; 54(10s): 1-41. doi: 10.1145/3505244
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is All you Need. Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors Advances in Neural Information Processing Systems 2017. Available from: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Hendrycks D, Gimpel K. Gaussian error linear units (GELUs). arXiv preprint arXiv:160608415 2016.
Breiman L. Random Forests. Mach Learn 2001; 45(1): 5-32.
Chen T, Guestrin C. XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD 16. 785-94. doi: 10.1145/2939672.2939785
Chollet F, et al. Keras is now available for JAX, TensorFlow, and PyTorch!. 2015. Available from: https://keras.io
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Online Association for Computational Linguistics 2020; pp 38-45.https://www.aclweb.org/anthology/2020.emnlp-demos.6 doi: 10.18653/v1/2020.emnlp-demos.6
Brown CD, Davis HT. Receiver operating characteristics curves and related decision measures: A tutorial. Chemom Intell Lab Syst 2006; 80(1): 24-38.https://www.sciencedirect.com/science/article/pii/S0169743905000766 doi: 10.1016/j.chemolab.2005.05.004
Zhang D, Wang S. A protein succinylation sites prediction method based on the hybrid architecture of LSTM network and CNN. J Bioinform Comput Biol 2022; 20(2): 2250003. doi: 10.1142/S0219720022500032 PMID: 35191361
Bailey TL. STREME: Accurate and versatile sequence motif discovery. Bioinformatics 2021; 37(18): 2834-40. doi: 10.1093/bioinformatics/btab203
Vacic V, Iakoucheva LM, Radivojac P. Two sample logo: A graphical representation of the differences between two sets of sequence alignments. Bioinformatics 2006; 22(12): 1536-7. doi: 10.1093/bioinformatics/btl151

补充文件

附件文件

动作

1. JATS XML

下载

用户名
密码
记住我

忘记您的密码?	注册

用户名
密码
记住我

忘记您的密码?	注册