Clustering of Large Human Protein Sequences Using Spark Based Hierarchical Clustering Approach
Document Type
Article
Publication Date
2-2026
Original Citation
Paul S,
Banerjee U,
Sengupta K,
Singh M,
Bandyopadhyay S.
Clustering of Large Human Protein Sequences Using Spark Based Hierarchical Clustering Approach LNNS: Springer Science and Business Media Deutschland GmbH; 2026. p. 261-70.
Keywords
JGM
JAX Source
LNNS: Springer Science and Business Media Deutschland GmbH; 2026. p. 261-70.
DOI
https://doi.org/10.1007/978-981-96-9494-5_22
Abstract
Modern high-throughput sequencing techniques have led to a significant increase in meta-genomic sequence accumulation, potentially improving large-scale functional annotation. Processing this large and redundant sequences have become a significant difficulty for researchers. Clustering based on similarity is a key step in reducing redundancy and analyzing big biological sequences. The n-gram feature representation, often used in sequence clustering and classification, creates high-dimensional input spaces for greater values of n. with the increase in dimension, it is not feasible to cluster these large sequences. In the current era of big data, frameworks like Apache Spark have revolutionized the field of data analysis, offering unparalleled scalability and efficiency in handling massive datasets. In this paper we have proposed a spark based hierarchical clustering algorithm to cluster large scale human protein sequences by exploring n-gram features (n < = 3). Visualization through dendrograms enhances interpretability, and detailed cluster analysis, including singleton identification, provides insights into protein groupings. Experimental results in human protein sequence data shows efficient speedup in clustering and almost 100% of non-singleton clusters.