posted on 2025-05-11, 07:42authored byAhmed Shamsul Arefin
THE amount of data in our world has been exploding. Computer-based methods used to analyze data ten years ago are impractical today, as the continuously evolving data acquiring technologies are producing more raw data than these methods can handle. For instance, today’s high throughput technologies like DNA microarrays can produce millions of data elements from a particular experiment, where most of the relevant analysis tools are designed to work with only a few tens of thousands. Even though the scalability of these methods/tools may be improved by porting the relevant implementations to a highly expensive super-computer or a cluster of computers, their existing fully connected data representation model can still pose many other restrictions. In this work, instead of using the traditional distance matrix based microarray data analysis model, we propose to use a novel, fast and scalable κ-Nearest Neighbor (κNN) graph-based approach. Moreover, instead of constructing the graph/network on a highly expensive system, we show its construction on graphics processing units (GPUs), which are now widely available as inexpensive, highly parallel devices. The outcome of our κNN graph construction method (termed as GPU-FS-κNN) can be used to carry out many other important computational tasks. In particular, we demonstrate its applications in two popular data analysis methods: clustering and centrality analysis. To do this, we first propose a GPU-based fast method for constructing minimum spanning trees (MST) from the κNN graphs (termed as κNN-Borůvka) and a method for partitioning the trees in an agglomerative fashion (termed as κNN-Borůvka-Agglomerative). Then, we demonstrate the use of κNN graphs in accelerating and scaling the computations of two degree-based (e.g., degree and eigenvectors) and three shortest path based (closeness, eccentricity and betweenness) centrality metrics. At the end, we integrate the developed methods and combinedly apply them on two publicly available gene-expression data sets (Alzheimer’s disease and breast cancer) and their large-scale artificial expansions. Our investigations show that the proposed integrated approach can find both numerically and biologically significant results. We also demonstrate the method’s application in extracting a robust set of gene markers that may warrant further investigations, due to their conspicuous positions in our results.
History
Year awarded
2013.0
Thesis category
Doctoral Degree
Degree
Computer Science
Supervisors
Moscato, Pablo (The University of Newcastle); Berretta, Regina (The University of Newcastle); Riveros, Carlos (The University of Newcastle)
Language
en, English
College/Research Centre
Faculty of Engineering and Built Environment
School
School of Electrical Engineering and Computer Science