Optimizing large language model utilization through scheduling strategies
Large Language Models (LLMs) have garnered significant attention within Machine-Learning-as-a-Service (MLaaS) offerings due to their remarkable capabilities. As the variety of available LLMs grows, users face the challenge of selecting LLMs that best balance their needs for cost and performance.
This thesis investigates the cost-effective allocation of jobs to LLMs, aiming to simultaneously enhance the percentage of correctly processed jobs and reduce costs. The study begins with an empirical exploration of the potential for applying scheduling optimization to improve performance and cost in LLMs utilization. Since the correctness cannot be determined until the output from the LLM is received, we employ a method combining prediction and optimization. Based on the predicted accuracy and cost, search-based algorithms select the most suitable LLM for each job. The results prove that while scheduling demonstrates significant potential to enhance performance and cost efficiency, improvements are needed in both prediction accuracy and search algorithms.
To address these challenges, we propose OptLLM. OptLLM operates in two modes: either optimizing single objective between LLM accuracy and cost, or generating a set of non-dominated solutions that strike a good balance. It predicts the performance of candidate LLMs for each job using a multi-label classification model with uncertainty estimation and iteratively refines the allocation schedule through destruction and reconstruction. Although OptLLM can provide an efficient schedule solution, collecting training data for the prediction module is costly, particularly when dealing with diverse task types and multiple available LLMs. For instance, creating training data often requires submitting the same job to all candidate LLMs, resulting in substantial computational and financial costs. Therefore, we propose CPLS to adapt training data from one task to another task by transfer learning, improving the practicality of the prediction model in real-world scenarios. Despite the benefits, both OptLLM and CPLS are statics frameworks with predictive scheduling, which may not fully adapt to dynamic real-world conditions. In addition, they only consider invocation costs while overlooking uncertain generation costs. To address these limitations, we further propose SLM, a dynamic optimization framework. SLM incorporates an Adaptive Cache Manager, a Performance-Cost Optimized Scheduler, and a Dynamic Update Manager to achieve dynamic optimization through periodic prediction and optimization. By leveraging real-world feedback, SLM updates the cache and retrains the prediction model, ensuring continuous improvement.
In summary, this thesis presents comprehensive approaches for LLM allocation to enhance performance and reduce costs. Through extensive experiments on various LLM-based tasks, we validate the effectiveness of the proposed methods, demonstrating their potential to address both static and dynamic optimization challenges in LLMs' utilization.
History
Year awarded
2025Thesis category
- Doctoral Degree
Degree
Doctor of Philosophy (PhD)Supervisors
Hongyu Zhang, University of Newcastle Sky Miao, University of NewcastleLanguage
- en, English
College/Research Centre
College of Engineering, Science & EnvironmentSchool
School of Information and Physical ScienceOpen access
- Open Access