A Novel Natural Graph for Efficient Clustering of Virus Genome Sequences
- Authors: Song H.1, Sun N.2, Yu W.3, Yau S.4
-
Affiliations:
- , Walnut High School
- Department of Mathematical Sciences, Tsinghua University
- College of Artificial Intelligence, Tianjin University of Science and Technology
- Department of Mathematical Science, Tsinghua University
- Issue: Vol 19, No 8 (2024)
- Pages: 687-703
- Section: Life Sciences
- URL: https://jdigitaldiagnostics.com/1574-8936/article/view/644011
- DOI: https://doi.org/10.2174/0115748936269106231025064143
- ID: 644011
Cite item
Full Text
Abstract
Background:This study addresses the need for analyzing viral genome sequences and understanding their genetic relationships. The focus is on introducing a novel natural graph approach as a solution.
Objective:The objective of this study is to demonstrate the effectiveness and advantages of the proposed natural graph approach in clustering viral genome sequences into distinct clades, subtypes, or districts. Additionally, the aim is to explore its interpretability, potential applications, and implications for pandemic control and public health interventions.
Methods:The study utilizes the proposed natural graph algorithm to cluster viral genome sequences. The results are compared with existing methods and multidimensional scaling to evaluate the performance and effectiveness of the approach.
Results:The natural graph approach successfully clusters viral genome sequences, providing valuable insights into viral evolution and transmission dynamics. The ability to generate directed connections between nodes enhances the interpretability of the results, facilitating the investigation of transmission pathways and viral fitness.
Conclusion:The findings highlight the potential applications of the natural graph algorithm in pandemic control, transmission tracing, and vaccine design. Future research directions may involve scaling up the analysis to larger datasets and incorporating additional genetic features for improved resolution.
:The natural graph approach presents a promising tool for viral genomics research with implications for public health interventions.
About the authors
Harris Song
, Walnut High School
Email: info@benthamscience.net
Nan Sun
Department of Mathematical Sciences, Tsinghua University
Email: info@benthamscience.net
Wenping Yu
College of Artificial Intelligence, Tianjin University of Science and Technology
Author for correspondence.
Email: info@benthamscience.net
Stephen Yau
Department of Mathematical Science, Tsinghua University
Author for correspondence.
Email: info@benthamscience.net
References
- Nucleic Acid. Available from: https://www.genome.gov/genetics-glossary/Nucleic-Acids (accessed June, 2023)
- What is DNA. Available from: https://whatisdna.net/ (accessed June, 2023)
- Watson JD, Crick FHC. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 1953; 171(4356): 737-8. doi: 10.1038/171737a0 PMID: 13054692
- Sun N, Pei S, He L, Yin C, He RL, Yau SST. Geometric construction of viral genome space and its applications. Comput Struct Biotechnol J 2021; 19: 4226-34. doi: 10.1016/j.csbj.2021.07.028 PMID: 34429843
- Yu C, Deng M, Cheng SY, Yau SC, He RL, Yau SST. Protein space: A natural method for realizing the nature of protein universe. J Theor Biol 2013; 318: 197-204. doi: 10.1016/j.jtbi.2012.11.005 PMID: 23154188
- Deng M, Yu C, Liang Q, He RL, Yau SST. A novel method of characterizing genetic sequences: Genome space with biological distance and applications. PLoS One 2011; 6(3): e17293. doi: 10.1371/journal.pone.0017293 PMID: 21399690
- Training E-E. What is genetic variation Available from: https://www.ebi.ac.uk/training/online/course/human-genetic-variation-i-introduction/what-genetic-variation (accessed June, 2023)
- Genetic Variation. Available from: https://www.genome.gov/genetics-glossary/Genomic-Variation (accessed June, 2023)
- Ciccarelli FD, Doerks T, Von Mering C, Creevey CJ, Snel B, Bork P. Toward automatic reconstruction of a highly resolved tree of life. Science 2006; 311(5765): 1283-7. doi: 10.1126/science.1123061
- Wolf YI, Rogozin IB, Grishin NV, Koonin EV. Genome trees and the tree of life. Trends Genet 2002; 18(9): 472-9. doi: 10.1016/S0168-9525(02)02744-0 PMID: 12175808
- Tavassoly I, Goldfarb J, Iyengar R. Systems biology primer: The basic methods and approaches. Essays Biochem 2018; 62(4): 487-500. doi: 10.1042/EBC20180003 PMID: 30287586
- Baitaluk M. System biology of gene regulation. Methods Mol Biol 2009; 569: 55-87.
- Wen J, Zhang Y, Yau SST. k-mer Sparse matrix model for genetic sequence and its applications in sequence comparison. J Theor Biol 2014; 363: 145-50. doi: 10.1016/j.jtbi.2014.08.028 PMID: 25158165
- Vinje H, Liland KH, Almøy T, Snipen L. Comparing K-mer based methods for improved classification of 16S sequences. BMC Bioinformatics 2015; 16(1): 205. doi: 10.1186/s12859-015-0647-4 PMID: 26130333
- Bohnsack KS, Kaden M, Abel J, Villmann T. Alignment-free sequence comparison: A systematic survey from a machine learning perspective. IEEE/ACM Trans Comput Biol Bioinformatics 2022; 1. doi: 10.1109/TCBB.2022.3140873
- Gao L, Qi J. Whole genome molecular phylogeny of large dsDNA viruses using composition vector method. BMC Evol Biol 2007; 7(1): 41. doi: 10.1186/1471-2148-7-41 PMID: 17359548
- Wang Y, Hill K, Singh S, Kari L. The spectrum of genomic signatures: From dinucleotides to chaos game representation. Gene 2005; 346: 173-85. doi: 10.1016/j.gene.2004.10.021 PMID: 15716010
- Cheng J, Zeng X, Ren G, Liu Z. CGAP: A new comprehensive platform for the comparative analysis of chloroplast genomes. BMC Bioinformatics 2013; 14(1): 95. doi: 10.1186/1471-2105-14-95 PMID: 23496817
- Ondov BD, Treangen TJ, Melsted P, et al. Mash: Fast genome and metagenome distance estimation using MinHash. Genome Biol 2016; 17(1): 132. doi: 10.1186/s13059-016-0997-x PMID: 27323842
- Ondov BD, Starrett GJ, Sappington A, et al. Mash Screen: High-throughput sequence containment estimation for genome discovery. Genome Biol 2019; 20(1): 232. doi: 10.1186/s13059-019-1841-x PMID: 31690338
- Wen J, Chan RHF, Yau SC, He RL, Yau SST. K-mer natural vector and its application to the phylogenetic analysis of genetic sequences. Gene 2014; 546(1): 25-34. doi: 10.1016/j.gene.2014.05.043 PMID: 24858075
- Zhang Y, Wen J, Yau SST. Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method. Genomics 2019; 111(6): 1298-305. doi: 10.1016/j.ygeno.2018.08.010 PMID: 30195069
- Sun N, Yang J, Yau SST. Identification of HIV rapid mutations using differences in nucleotide distribution over time. Genes 2022; 13(2): 170. doi: 10.3390/genes13020170 PMID: 35205215
- Zhao X, Tian K, He RL, Yau SST. Convex hull principle for classification and phylogeny of eukaryotic proteins. Genomics 2019; 111(6): 1777-84. doi: 10.1016/j.ygeno.2018.11.033 PMID: 30529533
- Huang HH, Yu C, Zheng H, et al. Global comparison of multiple-segmented viruses in 12-dimensional genome space. Mol Phylogenet Evol 2014; 81: 29-36. doi: 10.1016/j.ympev.2014.08.003 PMID: 25172357
- Yu C, Liang Q, Yin C, He RL, Yau SST. A novel construction of genome space with biological geometry. DNA Res 2010; 17(3): 155-68. doi: 10.1093/dnares/dsq008 PMID: 20360268
- Li Y, Tian K, Yin C, He RL, Yau SST. Virus classification in 60-dimensional protein space. Mol Phylogenet Evol 2016; 99: 53-62. doi: 10.1016/j.ympev.2016.03.009 PMID: 26988414
- Fang M, Xu J, Sun N, Yau SS-T. Generating minimal models of H1N1 NS1 gene sequences using alignment-based and alignment-free algorithms. Genes 2023; 14(1): 186. doi: 10.3390/genes14010186
- Yu C. Real time classification of viruses in 12 Dimensions. Plos one 2013; 8: e64328. doi: 10.1371/journal.pone.0064328
- Tian K, Yang X, Kong Q, Yin C, He RL, Yau SST. Two dimensional Yau-Hausdorff distance with applications on comparison of DNA and protein sequences. PLoS One 2015; 10(9): e0136577. doi: 10.1371/journal.pone.0136577 PMID: 26384293
- Dong R, Zhu Z, Yin C, He RL, Yau SST. A new method to cluster genomes based on cumulative Fourier power spectrum. Gene 2018; 673: 239-50. doi: 10.1016/j.gene.2018.06.042 PMID: 29935353
- Pei S, Dong W, Chen X, He RL, Yau SST. Fast and accurate genome comparison using genome images: The Extended Natural Vector Method. Mol Phylogenet Evol 2019; 141: 106633. doi: 10.1016/j.ympev.2019.106633 PMID: 31563612
- Sun N, Zhao X, Yau SST. An efficient numerical representation of genome sequence: Natural vector with covariance component. PeerJ 2022; 10: e13544. doi: 10.7717/peerj.13544 PMID: 35729905
- Dong R, Pei S, Guan M, et al. Full chromosomal relationships between populations and the origin of humans. Front Genet 2022; 12: 828805. doi: 10.3389/fgene.2021.828805 PMID: 35186019
- Sokal M. A statistical method for evaluating systematic relationships.University of Kansas Science Bulletin. 1958; 38: pp. 1409-38.
- Garcia-Vallvé S, Puigbo P. DendroUPGMA: A dendrogram construction utility. Universitat Rovira i Virgili 2009; pp. 1-14.
- Murtagh F. Complexities of hierarchic clustering algorithms: State of the art. Comput Stat Quarterly 1984; 1(2): 101-13.
- Olsen GJ. Phylogenetic analysis using ribosomal RNA. Methods in enzymology. Elsevier 1988; 164: pp. 793-812.
- Erdmann VA, Wolters J. Collection of published 5S, 5.8 S and 4.5 S ribosomal RNA sequences. Nucleic Acids Res 1986; 14(1): 1.
- Saitou N, Nei M. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol Biol Evol 1987; 4(4): 406-25. PMID: 3447015
- Mihaescu R, Levy D, Pachter L. Why neighbor-joining works. Algorithmica 2009; 54(1): 1-24. doi: 10.1007/s00453-007-9116-4
- Kuhner MK, Felsenstein J. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol 1994; 11(3): 459-68. PMID: 8015439
- Kidd KK, Sgaramella-Zonta LA. Phylogenetic analysis: Concepts and methods. Am J Hum Genet 1971; 23(3): 235-52. PMID: 5089842
- Catanzaro D. The minimum evolution problem: Overview and classification. Networks 2009; 53(2): 112-25. doi: 10.1002/net.20280
- Rzhetsky A, Nei M. Theoretical foundation of the minimum-evolution method of phylogenetic inference. Mol Biol Evol 1993; 10(5): 1073-95. PMID: 8412650
- Fitch WM, Margoliash E. Construction of phylogenetic trees. Science 1967; 155(3760): 279-84. doi: 10.1126/science.155.3760.279 PMID: 5334057
- Saitou N, Imanishi T. Relative efficiencies of the Fitch-Margoliash, maximum-parsimony, maximum-likelihood, minimum-evolution, and neighbor-joining methods of phylogenetic tree construction in obtaining the correct tree. Mol Biol Evolu 1989; 6(5): 514.
- Leitner T, Escanilla D, Franzén C, Uhlén M, Albert J. Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis. Proc Natl Acad Sci 1996; 93(20): 10864-9. doi: 10.1073/pnas.93.20.10864 PMID: 8855273
- Sullivan J, Joyce P. Model selection in phylogenetics. Annu Rev Ecol Evol Syst 2005; 36(1): 445-66. doi: 10.1146/annurev.ecolsys.36.102003.152633
- Pol D. Empirical problems of the hierarchical likelihood ratio test for model selection. Syst Biol 2004; 53(6): 949-62. doi: 10.1080/10635150490888868 PMID: 15764562
- Abadi S, Azouri D, Pupko T, Mayrose I. Model selection may not be a mandatory step for phylogeny reconstruction. Nat Commun 2019; 10(1): 934. doi: 10.1038/s41467-019-08822-w PMID: 30804347
- Noureddine FY, Chakkour M, El Roz A, et al. The emergence of SARS-CoV-2 variant (s) and its impact on the prevalence of COVID-19 cases in the Nabatieh Region, Lebanon. Med Sci 2021; 9(2): 40. doi: 10.3390/medsci9020040 PMID: 34199617
- Alm E, Broberg EK, Connor T, et al. Geographical and temporal distribution of SARS-CoV-2 clades in the WHO European Region, January to June 2020. Euro Surveill 2020; 25(32): 2001410. doi: 10.2807/1560-7917.ES.2020.25.32.2001410 PMID: 32794443
- GISAID - hCov19 Variants. Available from: https://gisaid.org/hcov19-variants/ (accessed June, 2023)
- GISAID. Clade tree. Available from: https://www.gisaid.org/fileadmin/c/gisaid/files/images/clade_tree.jpg (accessed June, 2023)
- Zhukova A, Blassel L, Lemoine F, Morel M, Voznica J, Gascuel O. Origin, evolution and global spread of SARS-CoV-2. C R Biol 2021; 344(1): 57-75. doi: 10.5802/crbiol.29 PMID: 33274614
- Lefort V, Desper R, Gascuel O. FastME 2.0: A comprehensive, accurate, and fast distance-based phylogeny inference program. Mol Biol Evol 2015; 32(10): 2798-800. doi: 10.1093/molbev/msv150 PMID: 26130081
- Gascuel O. BIONJ: An improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol 1997; 14(7): 685-95. doi: 10.1093/oxfordjournals.molbev.a025808 PMID: 9254330
- Gilbert PB, McKeague IW, Eisen G, et al. Comparison of HIV-1 and HIV-2 infectivity from a prospective cohort study in Senegal. Stat Med 2003; 22(4): 573-93. doi: 10.1002/sim.1342 PMID: 12590415
- Douek DC, Roederer M, Koup RA. Emerging concepts in the immunopathogenesis of AIDS. Annu Rev Med 2009; 60(1): 471-84. doi: 10.1146/annurev.med.60.041807.123549 PMID: 18947296
- Shankarappa R, Margolick JB, Gange SJ, et al. Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J Virol 1999; 73(12): 10489-502. doi: 10.1128/JVI.73.12.10489-10502.1999 PMID: 10559367
- Hemelaar J, Gouws E, Ghys PD, Osmanov S. Global and regional distribution of HIV-1 genetic subtypes and recombinants in 2004. AIDS 2006; 20(16): W13-23. doi: 10.1097/01.aids.0000247564.73009.bc PMID: 17053344
- Smith DM, Richman DD, Little SJ. HIV Superinfection. J Infect Dis 2005; 192(3): 438-44. doi: 10.1086/431682 PMID: 15995957
- Sun N, Yau SS-T. In-depth investigation of the point mutation pattern of HIV-1. Front Cell Infect Microbiol 2022; 12: 1033481. doi: 10.3389/fcimb.2022.1033481
- Krammer F, Smith GJD, Fouchier RAM, et al. Influenza. Nat Rev Dis Primers 2018; 4(1): 3. doi: 10.1038/s41572-018-0002-y PMID: 29955068
- Sautto GA, Kirchenbaum GA, Ross TM. Towards a universal influenza vaccine: Different approaches for one goal. Virol J 2018; 15(1): 17. doi: 10.1186/s12985-017-0918-y PMID: 29370862
- Eisfeld AJ, Neumann G, Kawaoka Y. At the centre: Influenza A virus ribonucleoproteins. Nat Rev Microbiol 2015; 13(1): 28-41. doi: 10.1038/nrmicro3367 PMID: 25417656
- Goka EA, Vallely PJ, Mutton KJ, Klapper PE. Mutations associated with severity of the pandemic influenza A(H1N1)pdm09 in humans: A systematic review and meta-analysis of epidemiological evidence. Arch Virol 2014; 159(12): 3167-83. doi: 10.1007/s00705-014-2179-z
- Zhang Y, Wen J, Xi K, Pan Q. Exploring the dynamic variations of viral genomes via a novel genetic network. Mol Phylogenet Evol 2022; 175: 107583. doi: 10.1016/j.ympev.2022.107583 PMID: 35810971
- Chen C-h, Härdle W, Unwin A, Cox MA, Cox TF. Multidimensional scaling. Handbook of data visualization. Berlin, Heidelberg: Springer 2008.
- Gordon A. The Users Guide to Multidimensional Scaling, with Special Reference to the Mds (X). Library of Computer Programs. Wiley 1983. doi: 10.2307/2987947
- Green PE. Marketing Applications of MDS: Assessment and Outlook: After a decade of development, what have we learned from MDS in marketing? J Mark 1975; 39(1): 24-31. doi: 10.1177/002224297503900105
Supplementary files
