posted on 2025-05-11, 13:16authored byAmer Abu Zaher
Feature selection is an important step for generating new knowledge from datasets that originate from high-throughput biotechnologies. The requirements posed by highdimensional data arising in several bioinformatics’ algorithmic pipelines require new ideas to be tested. This research aims at reducing the dimensionality of a dataset by a search process which iteratively removing features that do not compromise the classification of a set of samples. To accomplish that, this work present a new type of supervised and multivariate feature selection approach that incorporates the use of proximity graphs. Three proximity graphs are studied in this work: the minimum spanning tree (MST), the K-nearest neighbours graph (KNN) and the relative neighbourhood graph (RNG). The input is assumed to be a matrix where each row represents a feature, and each column, a sample. The class of each sample is also given. The method works by interactively calculating a distance matrix between samples, but this calculation is performed only considering a subset of the features that have been selected. A proximity graph is computed based on the distance matrix. The number of edges in the proximity graph connecting the nodes of different classes, the size of selected features and a score that is calculated from the computed proximity graph are then used in designing four different merit functions to measure the quality of the feature subset were investigated as scores to guide the search process. Evolutionary computation methods are then employed to select the subset of features that minimizes the number of edges of the proximity graph that are connecting nodes of different classes; other score functions have been tested as a comparison. The work also explores the use of different algorithms that are based on evolutionary computation (EC): the steady state genetic algorithm (SSGA), the generational genetic algorithm (GGA) and a memetic algorithm (MA). Four publicly available datasets are used (one also includes a training and test set independent set of samples) for evaluating different algorithms. The classification performance is evaluated using a total of 49 methods from the highly popular open source data mining and machine learning package known as WEKA; this selection helps to the reproducibility of our methods and results. In addition, the performance of the final feature selection algorithm is compared against that one of a panel of nine well-known feature selection methods. The proposed method provides an alternative to other types of feature selection methods and highlights the potential of using proximity graphs as a proxy for quality of discrimination between pairs of labelled samples in high-dimensional spaces.
History
Year awarded
2017.0
Thesis category
Doctoral Degree
Degree
Doctor of Philosophy (PhD)
Supervisors
Berretta, Regina (University of Newcastle); Noman, Nasimul (University of Newcastle); Moscato, Pablo (University of Newcastle)
Language
en, English
College/Research Centre
Faculty of Engineering and Built Environment
School
School of Electrical Engineering and Computer Science