Token Cost Optimization: Reducing Your LLM API Bill by 80%

Large Language Models (LLMs) are revolutionizing how we build applications, but their API costs can quickly spiral out of control. We’ve been wrestling with this challenge at MisuJob, where our AI-powered job matching processes 1M+ job listings to connect professionals with the right opportunities. Through rigorous experimentation and optimization, we’ve managed to slash our LLM API expenses by 80% – and we’re sharing our strategies to help you do the same.

The Token Cost Problem: A Real-World Example

LLMs charge based on “tokens,” which are roughly equivalent to words or parts of words. The more tokens you send in your prompt and receive in the response, the higher the cost. Consider a scenario where we’re using an LLM to extract key skills from job descriptions aggregated from multiple sources.

Initially, we naively fed the entire job description into the LLM. A typical job description might be 2000 words (around 2500 tokens). At current LLM pricing (e.g., GPT-4), this could cost several cents per job description. When processing 1M+ listings, these costs become unsustainable.

We quickly realized we needed to optimize. Here’s how we tackled the problem.

Strategy 1: Prompt Engineering for Brevity

The most impactful change we made was refining our prompts. A poorly crafted prompt can lead to longer responses and unnecessary processing.

From Verbose to Concise: An Iterative Approach

Our initial prompt looked something like this:

Extract all the skills mentioned in the following job description.  List them as comma-separated values.  Also, provide a brief (one sentence) explanation of why each skill is important for the role.  The job description is: [JOB_DESCRIPTION]

This prompt is overly verbose and asks for explanations, which significantly increases the response length. We replaced it with a much simpler prompt:

Extract the key skills from the following job description. List them as comma-separated values. Job Description: [JOB_DESCRIPTION]

This simple change dramatically reduced the output token count. We further refined the prompt by adding constraints:

Limit the number of skills: “Extract a maximum of 5 key skills…”
Specify the output format: “Output a JSON array of skills…”

The Power of Structured Output

Forcing the LLM to output data in a structured format like JSON is crucial. It not only simplifies parsing but also limits the LLM’s freedom to generate verbose responses.

import json

def extract_skills(llm_response):
    try:
        skills = json.loads(llm_response)
        return skills
    except json.JSONDecodeError:
        print("Error decoding JSON.  Check LLM response.")
        return []

Strategy 2: Pre-processing and Context Reduction

Before sending data to the LLM, we aggressively pre-process it to remove irrelevant information. The less text you send, the lower the cost.

Removing Boilerplate and Unnecessary Text

Many job descriptions contain boilerplate text, legal disclaimers, and company overviews that are irrelevant for skill extraction. We developed a set of regular expressions and keyword filters to remove this noise.

import re

def remove_boilerplate(text):
    # Remove legal disclaimers
    text = re.sub(r"©.*All rights reserved.*", "", text)
    # Remove application instructions
    text = re.sub(r"To apply, please visit.*", "", text)
    # Remove company overview sections (using keywords)
    text = re.sub(r"(About the company:|Company mission:).*", "", text, flags=re.IGNORECASE)
    return text

By removing this unnecessary text, we reduced the average job description length by 30%, leading to a direct reduction in token cost.

Selective Data Input: The Chunking Approach

Instead of sending the entire job description at once, we experimented with breaking it into smaller chunks. We focused on sections most likely to contain skills, such as the “Responsibilities” and “Requirements” sections.

We developed a heuristic-based algorithm to identify these key sections. This approach significantly reduced the input token count without sacrificing accuracy.

Strategy 3: Model Selection and Fine-tuning

Choosing the right LLM is critical. More powerful models like GPT-4 offer better accuracy but come at a higher price. For many tasks, a smaller, less expensive model like GPT-3.5 or a fine-tuned open-source model can provide sufficient performance.

Benchmark Your Options

We rigorously benchmarked different LLMs on our skill extraction task. We measured both accuracy (precision and recall) and token cost. This allowed us to identify the most cost-effective model for our specific use case.

We found that while GPT-4 provided slightly better accuracy, GPT-3.5 Turbo offered a significantly better price-performance ratio. For our use case, the small accuracy difference didn’t justify the higher cost.

The Potential of Fine-tuning

Fine-tuning an open-source LLM on a dataset of job descriptions and skills can further improve performance and reduce costs. This involves training a smaller model on your specific task, allowing it to learn the nuances of your data.

We are currently exploring fine-tuning options using models like Llama 2. This requires a significant investment in data preparation and training infrastructure but can offer substantial long-term cost savings.

Strategy 4: Caching and Deduplication

Avoid redundant API calls by implementing caching and deduplication strategies. If you’ve already processed a job description, store the results in a cache and reuse them when the same description appears again.

Implementing a Cache Layer

We implemented a cache layer using Redis to store the results of our skill extraction process. We used the job description’s hash as the cache key.

import redis
import hashlib

redis_client = redis.Redis(host='localhost', port=6379, db=0)

def get_skills_from_cache(job_description):
    job_hash = hashlib.sha256(job_description.encode('utf-8')).hexdigest()
    cached_skills = redis_client.get(job_hash)
    if cached_skills:
        return json.loads(cached_skills.decode('utf-8'))
    return None

def store_skills_in_cache(job_description, skills):
    job_hash = hashlib.sha256(job_description.encode('utf-8')).hexdigest()
    redis_client.set(job_hash, json.dumps(skills))

This simple caching mechanism significantly reduced the number of API calls, especially for frequently updated job listings.

Deduplication at Scale

MisuJob aggregates from multiple sources, so duplicate job postings are common. Before sending a job description to the LLM, we compare it to existing descriptions in our database. If we find a near-duplicate, we reuse the previously extracted skills.

The Results: An 80% Reduction in Costs

By implementing these strategies, we achieved a remarkable 80% reduction in our LLM API costs. This translates to significant savings, allowing us to scale our AI-powered job matching without breaking the bank.

Here’s a breakdown of the cost reduction:

Strategy	Cost Reduction (%)
Prompt Engineering	30%
Pre-processing	20%
Model Selection	15%
Caching and Deduplication	15%
Total	80%

Salary Data: Powered by Optimized LLM Processing

Our ability to efficiently process large volumes of job data allows us to provide valuable salary insights to our users. Here’s a sample of salary ranges for Software Engineers in various European countries:

Country	Average Salary (€)	Salary Range (€)
Germany	65,000	50,000 - 85,000
United Kingdom	60,000	45,000 - 80,000
Netherlands	62,000	48,000 - 82,000
France	55,000	42,000 - 70,000
Switzerland	90,000	70,000 - 120,000
Spain	40,000	30,000 - 55,000
Sweden	58,000	45,000 - 75,000

Note: These are average salaries and can vary based on experience, location within the country, and company size.

This data is constantly updated and refined, thanks to our optimized LLM processing pipeline. By extracting relevant information from job descriptions efficiently, we can provide accurate and up-to-date salary information to job seekers across Europe.

Conclusion

Optimizing LLM API costs is crucial for building sustainable AI-powered applications. By focusing on prompt engineering, pre-processing, model selection, and caching, we significantly reduced our expenses without sacrificing accuracy. These strategies are applicable to a wide range of LLM use cases and can help you unlock the full potential of LLMs while staying within your budget.

We at MisuJob are committed to pushing the boundaries of AI-powered job matching, and cost optimization is a key enabler of this mission. We hope these strategies help you on your own LLM journey.