posted on 2025-05-10, 19:47authored byChayan Banerjee
Actor-critic (AC) algorithms are a class of model-free deep reinforcement learning (DRL) algorithms that have proven their efficacy in diverse domains. Being model-free, they require many agent-environment interactions/ samples for policy learning. AC thus suffers from low sample efficiency, making it unsuitable for many real-world applications where samples are costly/ hazardous to obtain.
Resolving this issue has been the topic of active research for quite some time. Sample inefficiency mitigation approaches can be typically classified under on-policy, off-policy, and exploration-boosting classes. Despite their effectiveness, these approaches suffer from several limitations and constraints. Our research is on the central theme of mitigating the low sample efficiency issue. We contribute to the aforementioned classes of traditional mitigation approaches and design new, more efficient algorithms. The thesis presents our three works, which seek to improve certain limitations of the three abovementioned classes of approaches. We improve the sample efficiency of an on-policy algorithm by optimizing the training dataset meant for the optimal policy network. The optimization comprises a best episode only operation, a policy parameter-fitness model, and a genetic algorithm module. Next, we introduce crucial modifications to boost the performance of an off-policy AC algorithm. The resulting algorithm features a novel prioritization scheme for selecting better samples from the experience replay buffer. It also uses a mixture of the prioritized off-policy data and the latest on-policy data for training the policy and the value function networks. Finally, regarding the exploration boosting approach, we propose a new algorithm to boost exploration through an intrinsic reward based on the measurement of a state's novelty and the associated benefit of exploring the state (with regard to policy optimization), altogether called plausible novelty. The algorithm can be paired with any off-policy AC algorithm to improve sample efficiency. All algorithms were extensively evaluated on the OpenAI Gym platform's benchmark environments. All the proposed algorithms performed substantially better than the conventional counterparts and successfully improved the sample efficiency of the algorithms.
History
Year awarded
2023.0
Thesis category
Doctoral Degree
Degree
Doctor of Philosophy (PhD)
Supervisors
Chen, Zhiyong (University of Newcastle); Noman, Nasimul (University of Newcastle)