Clustering of Large Human Protein Sequences Using Spark Based Hierarchical Clustering Approach

Document Type

Article

Publication Date

2-2026

Keywords

JGM

JAX Source

LNNS: Springer Science and Business Media Deutschland GmbH; 2026. p. 261-70.

DOI

https://doi.org/10.1007/978-981-96-9494-5_22

Abstract

Modern high-throughput sequencing techniques have led to a significant increase in meta-genomic sequence accumulation, potentially improving large-scale functional annotation. Processing this large and redundant sequences have become a significant difficulty for researchers. Clustering based on similarity is a key step in reducing redundancy and analyzing big biological sequences. The n-gram feature representation, often used in sequence clustering and classification, creates high-dimensional input spaces for greater values of n. with the increase in dimension, it is not feasible to cluster these large sequences. In the current era of big data, frameworks like Apache Spark have revolutionized the field of data analysis, offering unparalleled scalability and efficiency in handling massive datasets. In this paper we have proposed a spark based hierarchical clustering algorithm to cluster large scale human protein sequences by exploring n-gram features (n <  = 3). Visualization through dendrograms enhances interpretability, and detailed cluster analysis, including singleton identification, provides insights into protein groupings. Experimental results in human protein sequence data shows efficient speedup in clustering and almost 100% of non-singleton clusters.

Please contact the Joan Staats Library for information regarding this document.

Share

COinS