Open Research Newcastle
Browse

Improving sample efficiency in deep reinforcement learning based control of dynamic systems

thesis
posted on 2025-05-10, 19:47 authored by Chayan Banerjee
Actor-critic (AC) algorithms are a class of model-free deep reinforcement learning (DRL) algorithms that have proven their efficacy in diverse domains. Being model-free, they require many agent-environment interactions/ samples for policy learning. AC thus suffers from low sample efficiency, making it unsuitable for many real-world applications where samples are costly/ hazardous to obtain. Resolving this issue has been the topic of active research for quite some time. Sample inefficiency mitigation approaches can be typically classified under on-policy, off-policy, and exploration-boosting classes. Despite their effectiveness, these approaches suffer from several limitations and constraints. Our research is on the central theme of mitigating the low sample efficiency issue. We contribute to the aforementioned classes of traditional mitigation approaches and design new, more efficient algorithms. The thesis presents our three works, which seek to improve certain limitations of the three abovementioned classes of approaches. We improve the sample efficiency of an on-policy algorithm by optimizing the training dataset meant for the optimal policy network. The optimization comprises a best episode only operation, a policy parameter-fitness model, and a genetic algorithm module. Next, we introduce crucial modifications to boost the performance of an off-policy AC algorithm. The resulting algorithm features a novel prioritization scheme for selecting better samples from the experience replay buffer. It also uses a mixture of the prioritized off-policy data and the latest on-policy data for training the policy and the value function networks. Finally, regarding the exploration boosting approach, we propose a new algorithm to boost exploration through an intrinsic reward based on the measurement of a state's novelty and the associated benefit of exploring the state (with regard to policy optimization), altogether called plausible novelty. The algorithm can be paired with any off-policy AC algorithm to improve sample efficiency. All algorithms were extensively evaluated on the OpenAI Gym platform's benchmark environments. All the proposed algorithms performed substantially better than the conventional counterparts and successfully improved the sample efficiency of the algorithms.

History

Year awarded

2023.0

Thesis category

  • Doctoral Degree

Degree

Doctor of Philosophy (PhD)

Supervisors

Chen, Zhiyong (University of Newcastle); Noman, Nasimul (University of Newcastle)

Language

  • en, English

College/Research Centre

College of Engineering, Science and Environment

School

School of Engineering

Rights statement

Copyright 2023 Chayan Banerjee

Usage metrics

    Theses

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC