Published on May 23, 2024

Cost Reduction Methods for Running LLMs

To Download this Report

The report provides a detailed overview of methods and tools that organizations using APIs of Large Language Models (LLMs) can leverage to balance application usage costs and inference performance.

Large Language Models (LLMs) have transformed the field of natural language processing, emerging as essential systems for enhancing business operations and decision-making. However, as their usage increases, so do the associated costs. Optimizing expenditure while maintaining performance is a critical aspect of sustainable LLM utilization. According to AIM and other media reports, it costs Open AI about $700,000 per day to run ChatGPT.

Since the research paper on LLM cost reduction from Stanford University was published in May 2023, the cost reduction strategy called FrugalGPT has been widely discussed in the Generative AI community. By referring to FrugalGPT’s concept and many other resources from major organizations such as Microsoft, Databricks, and Google, we have provided an in-depth guide on the cost reduction methods and tools specifically for running LLMs.

The report will cover a range of strategies, from prompt compression to innovative hosting solutions, offering a comprehensive overview of methods to make LLM usage more financially viable. By implementing these strategies, LLM application developers and organizations can strike a balance between the inference performance of LLMs and managing their economic impact.

Key Findings:

1. Cost and Performance Tradeoffs:

When working towards cost optimization through prompt size reduction methods, the accuracy of results may get affected.
Therefore, understanding the trade-off between costs and benchmark parameters such as latency and cold start time is critical.

2. Organizations focus on Token Optimization to reduce LLM costs

As technologies like chain-of-thought (CoT) prompting and in-context learning (ICL) evolve, the prompts provided to LLMs are growing more extensive, sometimes surpassing tens of thousands of tokens.
The cost of an LLM query increases linearly with the size of the prompt. The cost can be reduced by reducing the size of lengthy prompts.

3. Prompt Compression, Model Routing / Cascade, LLM Caching, Optimizing Server Utilization, and Cost Monitoring and Analysis are identified as the key methods for reducing the cost of running LLMs.

Table of Contents:

Executive Summary
Research Methodology
Key Findings
Cost Reduction Methods and Tools for Running LLMs
The Potential of Different Cost Reduction Methods
Introduction to Cost Reduction Methods for LLMs
Deep Dive into Cost Reduction Methods and Tools
- Prompt Compression
- LLM Caching
- Model Routing / Cascade
- Optimizing Server Utilization (Serverless LLM Hosting, Continuous Batching)
- Cost Monitoring and Analysis
Implementation of Cost Reduction Methods and Tools
Conclusion