Open Research Newcastle
Browse

Source code plagiarism detection in the presence of pervasive plagiarism-hiding source code modifications

thesis
posted on 2025-05-09, 17:47 authored by Hayden John Cheers
Source code similarity is a well-studied area of software engineering. One notable area applied to education is the detection of source code plagiarism from undergraduate computing students. Many prior works have proposed automated source code plagiarism detection tools to identify indications of source code plagiarism in undergraduate programming assignments. However, there are three important problems with existing works on the development and evaluation of source code plagiarism detection tools. Firstly, the evaluations of source code plagiarism detection tools are commonly not reproducible. Secondly, source code plagiarism detection tools do not indicate what assignment submissions are suspicious of plagiarism. Thirdly, there are no comprehensive studies evaluating the impact of source code modifications used to hide plagiarism on source code plagiarism detection tools. The work in this thesis is designed to initially address these three problems, and proceeds to propose a novel source code plagiarism detection tool that is more robust and accurate than existing tools. Firstly, evaluations of source code plagiarism detection tools are not reproducible as evaluation data sets are not released, and proposed source code plagiarism detection tools are not made available for reuse. Neither of these factors can be directly addressed. However, to present a solution to this problem, this work presents tools for the automatic generation of source code plagiarism detection tool evaluation data sets, and a pipeline that facilitates the automated evaluation of source code plagiarism detection tools. This is to afford a semi-automatic and reproducible method of evaluating source code plagiarism detection tools. Secondly, an approach for identifying assignment submissions suspicious of plagiarism is presented. The approach applies clustering to identify similar groups of assignment submissions with similar source code similarity scores. The relations between clustered scores are analysed and used to identify groups of assignment submissions that are suspicious of plagiarism. This then affords a semi-automatic method of suggesting groups of students that are suspected of plagiarising in their assignment submissions. Thirdly, an empirical evaluation of source code plagiarism detection tools pervasive against source code modifications representative of undergraduate plagiarisers is presented. This evaluation measures the performance of available source code plagiarism detection tools against a selection of 14 source code transformations, and the injection of 4 different fragment types of source code. The results of this evaluation indicate that existing source code plagiarism detection tools are not robust against pervasive plagiarism-hiding source code modifications, and as a result can suffer from poor accuracy. Finally, in order to address the identified poor robustness and accuracy of existing source code plagiarism detection tools identified in the empirical evaluation, this work presents the design and evaluation of a novel source code plagiarism detection tool. The presented source code plagiarism detection tool identifies indications of plagiarism by analysing the runtime behaviour of assignment submissions. This approach is then demonstrated to be both more robust and accurate against currently available source code plagiarism detection tools.

History

Year awarded

2021.0

Thesis category

  • Doctoral Degree

Degree

Doctor of Philosophy (PhD)

Supervisors

Lin, Yuqing (University of Newcastle); Smith, Shamus (University of Newcastle)

Language

  • en, English

College/Research Centre

College of Engineering, Science and Environment

School

School of Engineering

Rights statement

Copyright 2021 Hayden John Cheers

Usage metrics

    Theses

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC