Open Research Newcastle
Browse

An integrated, fast and scalable approach for large-scale biological network analysis

thesis
posted on 2025-05-11, 07:42 authored by Ahmed Shamsul Arefin
THE amount of data in our world has been exploding. Computer-based methods used to analyze data ten years ago are impractical today, as the continuously evolving data acquiring technologies are producing more raw data than these methods can handle. For instance, today’s high throughput technologies like DNA microarrays can produce millions of data elements from a particular experiment, where most of the relevant analysis tools are designed to work with only a few tens of thousands. Even though the scalability of these methods/tools may be improved by porting the relevant implementations to a highly expensive super-computer or a cluster of computers, their existing fully connected data representation model can still pose many other restrictions. In this work, instead of using the traditional distance matrix based microarray data analysis model, we propose to use a novel, fast and scalable κ-Nearest Neighbor (κNN) graph-based approach. Moreover, instead of constructing the graph/network on a highly expensive system, we show its construction on graphics processing units (GPUs), which are now widely available as inexpensive, highly parallel devices. The outcome of our κNN graph construction method (termed as GPU-FS-κNN) can be used to carry out many other important computational tasks. In particular, we demonstrate its applications in two popular data analysis methods: clustering and centrality analysis. To do this, we first propose a GPU-based fast method for constructing minimum spanning trees (MST) from the κNN graphs (termed as κNN-Borůvka) and a method for partitioning the trees in an agglomerative fashion (termed as κNN-Borůvka-Agglomerative). Then, we demonstrate the use of κNN graphs in accelerating and scaling the computations of two degree-based (e.g., degree and eigenvectors) and three shortest path based (closeness, eccentricity and betweenness) centrality metrics. At the end, we integrate the developed methods and combinedly apply them on two publicly available gene-expression data sets (Alzheimer’s disease and breast cancer) and their large-scale artificial expansions. Our investigations show that the proposed integrated approach can find both numerically and biologically significant results. We also demonstrate the method’s application in extracting a robust set of gene markers that may warrant further investigations, due to their conspicuous positions in our results.

History

Year awarded

2013.0

Thesis category

  • Doctoral Degree

Degree

Computer Science

Supervisors

Moscato, Pablo (The University of Newcastle); Berretta, Regina (The University of Newcastle); Riveros, Carlos (The University of Newcastle)

Language

  • en, English

College/Research Centre

Faculty of Engineering and Built Environment

School

School of Electrical Engineering and Computer Science

Rights statement

Copyright 2013 Ahmed Shamsul Arefin

Usage metrics

    Theses

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC