Open Research Newcastle
Browse

A new feature selection approach based on proximity graphs and evolutionary computation

thesis
posted on 2025-05-11, 13:16 authored by Amer Abu Zaher
Feature selection is an important step for generating new knowledge from datasets that originate from high-throughput biotechnologies. The requirements posed by highdimensional data arising in several bioinformatics’ algorithmic pipelines require new ideas to be tested. This research aims at reducing the dimensionality of a dataset by a search process which iteratively removing features that do not compromise the classification of a set of samples. To accomplish that, this work present a new type of supervised and multivariate feature selection approach that incorporates the use of proximity graphs. Three proximity graphs are studied in this work: the minimum spanning tree (MST), the K-nearest neighbours graph (KNN) and the relative neighbourhood graph (RNG). The input is assumed to be a matrix where each row represents a feature, and each column, a sample. The class of each sample is also given. The method works by interactively calculating a distance matrix between samples, but this calculation is performed only considering a subset of the features that have been selected. A proximity graph is computed based on the distance matrix. The number of edges in the proximity graph connecting the nodes of different classes, the size of selected features and a score that is calculated from the computed proximity graph are then used in designing four different merit functions to measure the quality of the feature subset were investigated as scores to guide the search process. Evolutionary computation methods are then employed to select the subset of features that minimizes the number of edges of the proximity graph that are connecting nodes of different classes; other score functions have been tested as a comparison. The work also explores the use of different algorithms that are based on evolutionary computation (EC): the steady state genetic algorithm (SSGA), the generational genetic algorithm (GGA) and a memetic algorithm (MA). Four publicly available datasets are used (one also includes a training and test set independent set of samples) for evaluating different algorithms. The classification performance is evaluated using a total of 49 methods from the highly popular open source data mining and machine learning package known as WEKA; this selection helps to the reproducibility of our methods and results. In addition, the performance of the final feature selection algorithm is compared against that one of a panel of nine well-known feature selection methods. The proposed method provides an alternative to other types of feature selection methods and highlights the potential of using proximity graphs as a proxy for quality of discrimination between pairs of labelled samples in high-dimensional spaces.

History

Year awarded

2017.0

Thesis category

  • Doctoral Degree

Degree

Doctor of Philosophy (PhD)

Supervisors

Berretta, Regina (University of Newcastle); Noman, Nasimul (University of Newcastle); Moscato, Pablo (University of Newcastle)

Language

  • en, English

College/Research Centre

Faculty of Engineering and Built Environment

School

School of Electrical Engineering and Computer Science

Rights statement

Copyright 2017 Amer Abu Zaher

Usage metrics

    Theses

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC