Open Research Newcastle
Browse

Clustering meta-validation and instance spaces

thesis
posted on 2025-10-23, 22:56 authored by Connor SimpsonConnor Simpson
<p dir="ltr">Clustering is an unsupervised learning technique that relies heavily on internal validation measures to assess the quality of candidate solutions. Selecting appropriate clustering validity indexes is hence critical to the clustering process. The abundance of available indexes for validation, each with distinct strengths and weaknesses, presents an ongoing challenge in determining the most suitable index for a given task. This thesis aims to address this challenge by providing a comprehensive investigation into the behaviour and performance of internal validity indexes, exploring their relationships with different problem characteristics, clustering algorithms, and external validation metrics.</p><p dir="ltr">Despite the importance of this topic, existing literature on clustering validation remains limited and outdated, particularly in comparison to other areas of machine learning. Current benchmarking efforts often rely on small, simplistic problem sets and prioritise overall rankings of indexes rather than context-specific performance. In contrast, however, other fields of machine learning, including clustering algorithm selection, have made significant advances through the use of meta-learning approaches that account for algorithm-problem relationships in greater depth.</p><p dir="ltr">To address this gap, this thesis introduces updated and extended benchmarking methodologies in addition to a novel extension to the Instance Space Analysis meta-learning framework. Using these tools, this thesis presents two extensive studies of clustering validity indexes, firstly, a more traditional benchmark study presenting novel evaluation scenarios, and secondly, an exploratory analysis using meta-learning techniques. In these studies, a more nuanced examination of validity index behaviour across a wide variety of clustering scenarios is conducted. Using these advancements, the performance and behaviour of clustering validity indexes are investigated to provide more accurate data-driven recommendations and to highlight important relationships between clustering indexes and clustering problem meta-features.</p><p dir="ltr">In the first study, this thesis demonstrates that the context in which a validity index is used plays a crucial role in its effectiveness at identifying good solutions. Each aspect of clustering was demonstrated to be important to the performance of clustering algorithms, including how an index was used, with differences found between identifying the best solution and discriminating between good and bad solutions. This work further identified the impact of clustering algorithms on clustering validation and examined the impact of ground-truth properties on validation. Experiments also revealed important limitations in the use of traditional external validity indexes, which, under some circumstances, failed to accurately reflect solution quality due to their lack of sensitivity to the geometric properties of clustering problems. Notably, this work highlights previously unreported non-linear relationships between internal and external indexes, contributing new fundamental insights to the clustering validation literature.</p><p dir="ltr">The findings of the second study provided empirical evidence highlighting that each validity index presents unique regions of clustering problems where it performs well, with both the selection of the clustering algorithm and dataset impacting the performance of the validity indexes. It was demonstrated that these areas of performance could be identified through measurable meta-features, providing practical guidance. Traditional overall measures were found to overlook important considerations for index performance, as the diversity of problems covered was often not tied to the index with the highest average performance. Similarly, an index performing well overall, or even covering a diverse range of problems, did not always translate to performing well in regions of problems identified as difficult. This reinforced the long-held knowledge that no singular index is applicable for all situations. </p><p dir="ltr">Together, these studies provide empirical evidence and practical tools for improving clustering validation. By integrating meta-learning approaches and recognising the critical importance of context in index selection, this thesis offers a more nuanced and effective framework for understanding and applying internal clustering validation techniques. This represents a significant step forward in addressing the limitations of conventional evaluation practices, which often overlook the diversity and complexity of real-world clustering scenarios. The contributions of this work extend beyond theoretical insight: they deliver actionable strategies for selecting appropriate indexes based on problem characteristics, thereby enhancing the reliability and interpretability of clustering outcomes in applied settings. As such, this thesis serves not only as a valuable guide for practitioners but also as a foundational reference for advancing future research in this under-explored area of machine learning. By highlighting the intricate relationships between clustering algorithms, data characteristics and validation measures, this work provides strong foundations for more adaptive, context-aware approaches to unsupervised learning.</p>

History

Year awarded

2025

Thesis category

  • Doctoral Degree

Degree

Doctor of Philosophy (PhD)

Supervisors

Stojanovski, Elizabeth (University of Newcastle); Campello, Ricardo Jose Gabrielli Barreto (University of Southern Denmark)

Language

  • en, English

College/Research Centre

College of Engineering, Science & Environment

School

School of Information and Physical Sciences

Rights statement

Copyright 2025, Connor Simpson

Usage metrics

    Theses

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC