Thorough Assessment of Machine Learning Techniques for Predicting Protein-Nucleic Acid Binding Hot Spots

Xianzhe Zou; Chen Zhang; Mingyan Tang; Lei Deng

doi:10.2174/1574893618666230913090436

Thorough Assessment of Machine Learning Techniques for Predicting Protein-Nucleic Acid Binding Hot Spots

Authors: Zou X.¹, Zhang C.¹, Tang M.¹, Deng L.¹
Affiliations:
1. School of Computer Science and Engineering, Central South University
Issue: Vol 19, No 2 (2024)
Pages: 144-161
Section: Life Sciences
URL: https://jdigitaldiagnostics.com/1574-8936/article/view/643793
DOI: https://doi.org/10.2174/1574893618666230913090436
ID: 643793

Cite item

Full Text

Abstract
About the authors
References
Supplementary files
Statistics

Abstract

Background:Proteins and nucleic acids are vital biomolecules that contribute significantly to biological life. The precise and efficient identification of hot spots at protein-nucleic acid interfaces is crucial for guiding drug development, advancing protein engineering, and exploring the underlying molecular recognition mechanisms. As experimental methods like alanine scanning mutagenesis prove to be time-consuming and expensive, a growing number of machine learning techniques are being employed to predict hot spots. However, the existing approach is distinguished by a lack of uniform standards, a scarcity of data, and a wide range of attributes. Currently, there is no comprehensive overview or evaluation of this field. As a result, providing a full overview and review is extremely helpful.

Methods:In this study, we present an overview of cutting-edge machine learning approaches utilized for hot spot prediction in protein-nucleic acid complexes. Additionally, we outline the feature categories currently in use, derived from relevant biological data sources, and assess conventional feature selection methods based on 600 extracted features. Simultaneously, we create two new benchmark datasets, PDHS87 and PRHS48, and develop distinct binary classification models based on these datasets to evaluate the advantages and disadvantages of various machine-learning techniques.

Results:Prediction of protein-nucleic acid interaction hotspots is a challenging task. The study demonstrates that structural neighborhood features play a crucial role in identifying hot spots. The prediction performance can be improved by choosing effective feature selection methods and machine learning methods. Among the existing prediction methods, XGBPRH has the best performance.

Conclusion:It is crucial to continue studying hot spot theories, discover new and effective features, add accurate experimental data, and utilize DNA/RNA information. Semi-supervised learning, transfer learning, and ensemble learning can optimize predictive ability. Combining computational docking with machine learning methods can potentially further improve predictive performance.

Keywords

Hot spots, protein-nucleic acid interfaces, feature selection, machine learning, protein-DNA complex, protein-RNA complex.

References

Deng L, Sui Y, Zhang J. XGBPRH: Prediction of binding hot spots at ProteinRNA interfaces utilizing extreme gradient boosting. Genes 2019; 10(3): 242. doi: 10.3390/genes10030242 PMID: 30901953
Clackson T, Wells JA. A hot spot of binding energy in a hormone-receptor interface. Science 1995; 267(5196): 383-6. doi: 10.1126/science.7529940 PMID: 7529940
Moreira IS, Fernandes PA, Ramos MJ. Hot spotsA review of the proteinprotein interface determinant aminoacid residues. Proteins. Structure 2007; 68(4): 803-12.
Wells JA. Systematic mutational analyses of protein-protein interfaces. Methods Enzymol 1991; 202: 390-411. doi: 10.1016/0076-6879(91)02020-A PMID: 1723781
Rajagopal S, Meza-Romero R, Ghosh I. Dual surface selection methodology for the identification of thrombin binding epitopes from hotspot biased phage-display libraries. Bioorg Med Chem Lett 2004; 14(6): 1389-93. doi: 10.1016/j.bmcl.2003.09.098 PMID: 15006368
Bogan AA, Thorn KS. Anatomy of hot spots in protein interfaces. J Mol Biol 1998; 280(1): 1-9. doi: 10.1006/jmbi.1998.1843 PMID: 9653027
Li J, Liu Q. Double water exclusion: A hypothesis refining the O-ring theory for the hot spots at protein interfaces. Bioinformatics 2009; 25(6): 743-50. doi: 10.1093/bioinformatics/btp058 PMID: 19179356
Krüger DM, Neubacher S, Grossmann TN. ProteinRNA interactions: Structural characteristics and hotspot amino acids. RNA 2018; 24(11): 1457-65. doi: 10.1261/rna.066464.118 PMID: 30093489
Yan KS, Yan S, Farooq A, Han A, Zeng L, Zhou MM. Structure and conserved RNA binding of the PAZ domain. Nature 2003; 426(6965): 469-74. doi: 10.1038/nature02129 PMID: 14615802
Yang M, Chen X, Militello K, et al. Alanine-scanning mutagenesis of Bacillus subtilis trp RNA-binding attenuation protein (TRAP) reveals residues involved in tryptophan binding and RNA binding. J Mol Biol 1997; 270(5): 696-710. doi: 10.1006/jmbi.1997.1149 PMID: 9245598
Hillisch A, Lorenz M, Diekmann S. Recent advances in FRET: distance determination in proteinDNA complexes. Curr Opin Struct Biol 2001; 11(2): 201-7. doi: 10.1016/S0959-440X(00)00190-1 PMID: 11297928
Teh HF, Peh WYX, Su X, Thomsen JS. Characterization of protein--DNA interactions using surface plasmon resonance spectroscopy with various assay schemes. Biochemistry 2007; 46(8): 2127-35. doi: 10.1021/bi061903t PMID: 17266332
Mei LC, Wang YL, Wu FX, Wang F, Hao GF, Yang GF. HISNAPI: A bioinformatic tool for dynamic hot spot analysis in nucleic acidprotein interface with a case study. Brief Bioinform 2021; 22(5): bbaa373. doi: 10.1093/bib/bbaa373 PMID: 33406224
Deng L, Guan J, Wei X, Yi Y, Zhou S. Boosting prediction performance of protein-protein interaction hot spots by using structural neighborhood properties. J Comput Biol 2013; 20(11): 878-91.
Wang H, Liu C, Deng L. Enhanced prediction of hot spots at protein-protein interfaces using extreme gradient boosting. Sci Rep 2018; 8(1): 14285. doi: 10.1038/s41598-018-32511-1 PMID: 30250210
Moreira IS, Koukos PI, Melo R. SpotOn: High accuracy identification of protein-protein interface hot-spots. Sci Rep 2017; 7.
Qiao Y, Xiong Y, Gao H, Zhu X, Chen P. Protein-protein interface hot spots prediction based on a hybrid feature selection strategy. BMC Bioinform 2018; 19(1): 14. doi: 10.1186/s12859-018-2009-5 PMID: 29334889
Xia JF, Zhao XM, Song J, Huang DS. APIS: Accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility. BMC Bioinform 2010; 11(1): 174-4. doi: 10.1186/1471-2105-11-174 PMID: 20377884
Nagarajan R, Ahmad S, Michael GM. Novel approach for selecting the best predictor for identifying the binding sites in DNA binding proteins. Nucleic Acids Res 2013; 41(16): 7606-14. doi: 10.1093/nar/gkt544 PMID: 23788679
Walia RR, Caragea C, Lewis BA, et al. Protein-RNA interface residue prediction using machine learning: An assessment of the state of the art. BMC Bioinform 2012; 13(1): 89-9. doi: 10.1186/1471-2105-13-89 PMID: 22574904
Yan J, Friedrich S, Kurgan L. A comprehensive comparative review of sequence-based predictors of DNA- and RNA-binding residues. Brief Bioinform 2016; 17(1): 88-105. doi: 10.1093/bib/bbv023 PMID: 25935161
Zhang J, Ma Z, Kurgan L. Comprehensive review and empirical analysis of hallmarks of DNA-, RNA- and protein-binding residues in protein chains. Brief Bioinform 2019; 20(4): 1250-68. doi: 10.1093/bib/bbx168 PMID: 29253082
Ho Thanh Lam L, Le NH, Van TL, et al. Machine learning model for identifying antioxidant proteins using features calculated from primary sequences. Biology 2020; 9(10): 325. doi: 10.3390/biology9100325 PMID: 33036150
Tahir M, Khan F, Hayat M, Alshehri MD. An effective machine learning-based model for the prediction of proteinprotein interaction sites in health systems. Neural Comput Appl 2022; 1-11. doi: 10.1007/s00521-022-07024-8
Liu L, Xiong Y, Gao H, Wei D, Mitchell JC, Zhu X. dbAMEPNI: A database of alanine mutagenic effects for proteinnucleic acid interactions. Database 2018; 2018: bay034.
Zhang N, Chen Y, Zhao F, Yang Q, Simonetti FL, Li M. PremPDI estimates and interprets the effects of missense mutations on protein-DNA interactions. PLOS Comput Biol 2018; 14(12): e1006615. doi: 10.1371/journal.pcbi.1006615 PMID: 30533007
Peng Y, Sun L, Jia Z, Li L, Alexov E. Predicting proteinDNA binding free energy change upon missense mutations using modified MM/PBSA approach: SAMPDI webserver. Bioinformatics 2018; 34(5): 779-86. doi: 10.1093/bioinformatics/btx698 PMID: 29091991
Harini K, Srivastava A, Kulandaisamy A, Gromiha MM. ProNAB: database for binding affinities of proteinnucleic acid complexes and their mutants. Nucleic Acids Res 2022; 50(D1): D1528-34. doi: 10.1093/nar/gkab848 PMID: 34606614
Liu S, Liu C, Deng L. Machine learning approaches for protein−protein interaction hot spot prediction: Progress and comparative assessment. Molecules 2018; 23(10): 2535.
Cho K, Kim D, Lee D. A feature-based approach to modeling proteinprotein interaction hot spots. Nucleic Acids Res 2009; 37(8): 2672-87. doi: 10.1093/nar/gkp132 PMID: 19273533
Yu L, Sun X, Tian S, Shi X, Yan Y. Drug and nondrug classification based on deep learning with various feature selection strategies. Curr Bioinform 2018; 13(3): 253-9. doi: 10.2174/1574893612666170125124538
Zou Q, Zeng J, Cao L, Ji R. A novel features ranking metric with application to scalable visual and bioinformatics data classification. Neurocomputing 2016; 173: 346-54. doi: 10.1016/j.neucom.2014.12.123
Kawashima S, Kanehisa M. AAindex: Amino acid index database. Nucleic Acids Res 2000; 27(1): 368-9.
Xia J, Yue Z, Di Y, Zhu X, Zheng CH. Predicting hot spots in protein interfaces based on protrusion index, pseudo hydrophobicity and electron-ion interaction pseudopotential features. Oncotarget 2016; 7(14): 18065-75. doi: 10.18632/oncotarget.7695 PMID: 26934646
Altschul S, Madden TL, Schäffer AA, et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 1997; 25(17): 3389-402. doi: 10.1093/nar/25.17.3389 PMID: 9254694
Chan C, Liang H-K, Hsiao N-W, Ko M-T, Lyu P-C, Hwang J-K. Relationship between local structural entropy and protein thermostabilty. Proteins 2004; 57(4): 684-91.
Jones DT, Cozzetto D. DISOPRED3: Precise disordered region predictions with annotated protein-binding activity. Bioinformatics 2015; 31(6): 857-63. doi: 10.1093/bioinformatics/btu744 PMID: 25391399
Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB. Protein disorder prediction: Implications for structural proteomics. Structure 2003; 11(11): 1453-9. doi: 10.1016/j.str.2003.10.002 PMID: 14604535
Mészáros B, Simon I, Dosztányi Z. Prediction of protein binding regions in disordered proteins. PLOS Comput Biol 2009; 5(5): e1000376. doi: 10.1371/journal.pcbi.1000376 PMID: 19412530
Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci 1992; 89(22): 10915-9. doi: 10.1073/pnas.89.22.10915 PMID: 1438297
Cilia E, Pancsa R, Tompa P, Lenaerts T, Vranken WF. From protein sequence to dynamics and disorder with DynaMine. Nat Commun 2013; 4(1): 2741. doi: 10.1038/ncomms3741 PMID: 24225580
Mishra A, Pokhrel P, Hoque MT. StackDPPred: A stacking based prediction of DNA-binding protein from sequence. Bioinformatics 2019; 35(3): 433-41. doi: 10.1093/bioinformatics/bty653 PMID: 30032213
Zhang S, Zhao L, Zheng CH, Xia J. A feature-based approach to predict hot spots in proteinDNA binding interfaces. Brief Bioinform 2020; 21(3): 1038-46. doi: 10.1093/bib/bbz037 PMID: 30957840
Kabsch W, Sander C. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983; 22(12): 2577-637. doi: 10.1002/bip.360221211 PMID: 6667333
Heffernan R, Paliwal K, Lyons J, et al. Improving prediction of secondary structure, local backbone angles and solvent accessible surface area of proteins by iterative deep learning. Sci Rep 2015; 5(1): 11476. doi: 10.1038/srep11476 PMID: 26098304
Liang S, Grishin NV. Effective scoring function for protein sequence design. Proteins 2003; 54(2): 271-81.
Tuncbag N, Gürsoy A, Keskin O. Identification of computational hot spots in protein interfaces: combining solvent accessibility and inter-residue potentials improves the accuracy. Bioinformatics 2009; 25(12): 1513-20. doi: 10.1093/bioinformatics/btp240 PMID: 19357097
Keskin O, Bahar I, Jernigan RL, Badretdinov AY, Ptitsyn OB. Empirical solvent-mediated potentials hold for both intra-molecular and inter-molecular inter-residue interactions. Protein Sci 1998; 7(12): 2578-86. doi: 10.1002/pro.5560071211 PMID: 9865952
McDonald IK, Thornton JM. Satisfying hydrogen bonding potential in proteins. J Mol Biol 1994; 238(5): 777-93. doi: 10.1006/jmbi.1994.1334 PMID: 8182748
Northey TC, Bareić A, Martin ACR. IntPred: A structure-based predictor of proteinprotein interaction sites. Bioinformatics 2018; 34(2): 223-9. doi: 10.1093/bioinformatics/btx585 PMID: 28968673
Malleshappa GS, Chatterjee J, Chaudhuri T, Paul K. Prediction and analysis of surface hydrophobic residues in tertiary structure of proteins. ScientificWorldJournal 2014; 2014: 971258. doi: 10.1155/2014/971258
Liang S, Meroueh SO, Wang G, Qiu C, Zhou Y. Consensus scoring for enriching near native structures from proteinprotein docking decoys. Proteins 2009; 75(2): 397-403.
Mihel J, ikić M, Tomić S, Jeren B, Vlahoviček K. PSAIA - protein structure and interaction analyzer. BMC Struct Biol 2008; 8(1): 21-1. doi: 10.1186/1472-6807-8-21 PMID: 18400099
Chakrabarty B, Parekh N. NAPS: Network analysis of protein structures. Nucleic Acids Res 2016; 44(W1): W375-82. doi: 10.1093/nar/gkw383 PMID: 27151201
Li Y, Wen Z, Xiao J, et al. Predicting disease-associated substitution of a single amino acid by analyzing residue interactions. BMC Bioinform 2011; 12(1): 14-4. doi: 10.1186/1471-2105-12-14 PMID: 21223604
Pan Y, Liu D, Deng L. Accurate prediction of functional effects for variants by combining gradient tree boosting with optimal neighborhood properties. PLoS One 2017; 12(6): e0179314. doi: 10.1371/journal.pone.0179314 PMID: 28614374
Pan Y, Wang Z, Zhan W, Deng L. Computational identification of binding energy hot spots in proteinRNA complexes using an ensemble approach. Bioinformatics 2018; 34(9): 1473-80. doi: 10.1093/bioinformatics/btx822 PMID: 29281004
Hamelryck T. An amino acid has two sides: A new 2D measure provides a different view of solvent exposure. Proteins 2005; 59(1): 38-48.
Song J, Tan H, Takemoto K, Akutsu T. HSEpred: Predict half-sphere exposure from protein sequences. Bioinformatics 2008; 24(13): 1489-97. doi: 10.1093/bioinformatics/btn222 PMID: 18467349
Munteanu CR, Pimenta AC, Fernandez-Lozano C, Melo A, Cordeiro MNDS, Moreira IS. Solvent accessible surface area-based hot-spot detection methods for protein-protein and protein-nucleic acid interfaces. J Chem Inf Model 2015; 55(5): 1077-86. doi: 10.1021/ci500760m PMID: 25845030
Zhu X, Liu L, He J, Fang T, Xiong Y, Mitchell JC. iPNHOT: A knowledge-based approach for identifying protein-nucleic acid interaction hot spots. BMC Bioinformatics 2019; 21. PMID: 32631222
Nguyen TB, Myung Y, de Sá AGC, Pires DEV, Ascher DB. mmCSM-NA: Accurately predicting effects of single and multiple mutations on proteinnucleic acid binding affinity. NAR Genom Bioinform 2021; 3(4): lqab109. doi: 10.1093/nargab/lqab109 PMID: 34805992
Hapfelmeier A, Ulm K. A new variable selection approach using Random Forests. Comput Stat Data Anal 2013; 60: 50-69. doi: 10.1016/j.csda.2012.09.020
Li K, Zhang S, Yan D, Bin Y, Xia J. Prediction of hot spots in proteinDNA binding interfaces based on supervised isometric feature mapping and extreme gradient boosting. BMC Bioinformatics 2020; 21(S13): 381. doi: 10.1186/s12859-020-03683-3 PMID: 32938395
Pan Y, Zhou S, Guan J. Computationally identifying hot spots in protein-DNA binding interfaces using an ensemble approach. BMC Bioinformatics 2020; 21(S13) (Suppl. 13): 384. doi: 10.1186/s12859-020-03675-3 PMID: 32938375
Zhang S, Wang L, Zhao L, et al. An improved DNA-binding hot spot residues prediction method by exploring interfacial neighbor properties. BMC Bioinformatics 2021; 22(S3): 253. doi: 10.1186/s12859-020-03871-1 PMID: 34000983
Barik A, Nithin C, Karampudi NBR, Mukherjee S, Bahadur RP. Probing binding hot spots at proteinRNA recognition sites. Nucleic Acids Res 2016; 44(2): e9-9. doi: 10.1093/nar/gkv876 PMID: 26365245
Kursa MB, Jankowski A, Rudnicki WR. Boruta A system for feature selection. Fundam Inform 2010; 101(4): 271-85. doi: 10.3233/FI-2010-288
Zhang S, Zhao L, Xia J. SPHot: Prediction of hot spots in protein-RNA complexes by protein sequence information and ensemble classifier. IEEE Access 2019; 7: 104941-6.
Zhou T, Rong J, Liu Y, Gong W, Li C. An ensemble approach to predict binding hotspots in proteinRNA interactions based on SMOTE data balancing and Random Grouping feature selection strategies. Bioinformatics 2022; 38(9): 2452-8. doi: 10.1093/bioinformatics/btac138 PMID: 35253843
Herzog S, Tetzlaff C, Wörgötter F. Evolving artificial neural networks with feedback. Neural Netw 2019; 123: 153-62.
Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theory 1967; 13(1): 21-7. doi: 10.1109/TIT.1967.1053964
Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995; 20(3): 273-97. doi: 10.1007/BF00994018
Noble WS. What is a support vector machine? Nat Biotechnol 2006; 24(12): 1565-7. doi: 10.1038/nbt1206-1565 PMID: 17160063
Barros RC, Basgalupp MP, de Carvalho ACPLF, Freitas AA. Automatic design of decision-tree algorithms with evolutionary algorithms. Evol Comput 2013; 21(4): 659-84. doi: 10.1162/EVCO_a_00101 PMID: 23339552
Breiman L. Random forests. Mach Learn 2001; 45(1): 5-32. doi: 10.1023/A:1010933404324
Nick TG, Campbell KM. Logistic regression. Methods Mol Biol 2007; 404: 273-301. doi: 10.1007/978-1-59745-530-5_14 PMID: 18450055
Assi SA, Tanaka T, Rabbitts TH, Fernandez-Fuentes N. PCRPi: Presaging critical residues in protein interfaces, a new computational tool to chart hot spots in protein interfaces. Nucleic Acids Res 2010; 38(6): e86-6. doi: 10.1093/nar/gkp1158 PMID: 20008102
Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Mach Learn 1997; 29(2/3): 131-63. doi: 10.1023/A:1007465528199
Hastie T, Tibshirani R, Friedman J. Ensemble learning. In: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York, NY: Springer New York 2009; pp. 605-24. doi: 10.1007/978-0-387-84858-7_16
Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal 2002; 38(4): 367-78. doi: 10.1016/S0167-9473(01)00065-2
Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 1997; 55(1): 119-39. doi: 10.1006/jcss.1997.1504
Chen T, He T, Benesty M, et al. Extreme Gradient Boosting R package xgboost version 1.2.0.1. 2020.
Hancock JT, Khoshgoftaar TM. CatBoost for big data: An interdisciplinary review. J Big Data 2020; 7(1): 94. doi: 10.1186/s40537-020-00369-8 PMID: 33169094
Hady MFA, Schwenker F. Semi-supervised Learning. In: Bianchini M, Maggini M, Jain LC, Eds. Handbook on Neural Information Processing. Berlin, Heidelberg: Springer Berlin Heidelberg 2013; pp. 215-39. doi: 10.1007/978-3-642-36657-4_7
Deng L, Guan J, Dong Q, Zhou S. Prediction of protein-protein interaction sites using an ensemble method. BMC Bioinform 2009; 10(1): 426. doi: 10.1186/1471-2105-10-426 PMID: 20015386
Hubbard SJ. NACCESS-Computer program 1993.
Le NQK, Ou YY. Incorporating efficient radial basis function networks and significant amino acid pairs for predicting GTP binding sites in transport proteins. BMC Bioinform 2016; 17(S19): 501. doi: 10.1186/s12859-016-1369-y PMID: 28155651
Le NQK, Nguyen TTD, Ou YY. Identifying the molecular functions of electron transport proteins using radial basis function networks and biochemical properties. J Mol Graph Model 2017; 73: 166-78. doi: 10.1016/j.jmgm.2017.01.003 PMID: 28285094
Soleymani F, Paquet E, Viktor H, Michalowski W, Spinello D. Proteinprotein interaction prediction with deep learning: A comprehensive review. Comput Struct Biotechnol J 2022; 20: 5316-41. doi: 10.1016/j.csbj.2022.08.070 PMID: 36212542
Syrlybaeva R, Strauch EM. Deep learning of protein sequence design of proteinprotein interactions. Bioinformatics 2023; 39(1): btac733. doi: 10.1093/bioinformatics/btac733 PMID: 36377772
Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021; 596(7873): 583-9. doi: 10.1038/s41586-021-03819-2 PMID: 34265844

Supplementary files

Supplementary Files

Action

1. JATS XML

Download

Username
Password
Remember me

Forgot password?	Register

Username
Password
Remember me

Forgot password?	Register

Thorough Assessment of Machine Learning Techniques for Predicting Protein-Nucleic Acid Binding Hot Spots

Full Text

Abstract

Keywords

About the authors

Xianzhe Zou

Chen Zhang

Mingyan Tang

Lei Deng

References

Supplementary files