Engineering

Token Cost Optimization: Reducing Your LLM API Bill by 80%

Cut LLM API costs by 80%! Learn proven token optimization strategies from MisuJob, processing 1M+ job listings. Reduce expenses and improve efficiency now.

· Founder & Engineer · · 6 min read
Chart showing a dramatic decrease in LLM API costs after token optimization implementation.

Large Language Models (LLMs) are revolutionizing how we build applications, but their API costs can quickly spiral out of control. We’ve been wrestling with this challenge at MisuJob, where our AI-powered job matching processes 1M+ job listings to connect professionals with the right opportunities. Through rigorous experimentation and optimization, we’ve managed to slash our LLM API expenses by 80% – and we’re sharing our strategies to help you do the same.

The Token Cost Problem: A Real-World Example

LLMs charge based on “tokens,” which are roughly equivalent to words or parts of words. The more tokens you send in your prompt and receive in the response, the higher the cost. Consider a scenario where we’re using an LLM to extract key skills from job descriptions aggregated from multiple sources.

Initially, we naively fed the entire job description into the LLM. A typical job description might be 2000 words (around 2500 tokens). At current LLM pricing (e.g., GPT-4), this could cost several cents per job description. When processing 1M+ listings, these costs become unsustainable.

We quickly realized we needed to optimize. Here’s how we tackled the problem.

Strategy 1: Prompt Engineering for Brevity

The most impactful change we made was refining our prompts. A poorly crafted prompt can lead to longer responses and unnecessary processing.

From Verbose to Concise: An Iterative Approach

Our initial prompt looked something like this:

Extract all the skills mentioned in the following job description.  List them as comma-separated values.  Also, provide a brief (one sentence) explanation of why each skill is important for the role.  The job description is: [JOB_DESCRIPTION]

This prompt is overly verbose and asks for explanations, which significantly increases the response length. We replaced it with a much simpler prompt:

Extract the key skills from the following job description. List them as comma-separated values. Job Description: [JOB_DESCRIPTION]

This simple change dramatically reduced the output token count. We further refined the prompt by adding constraints:

  • Limit the number of skills: “Extract a maximum of 5 key skills…”
  • Specify the output format: “Output a JSON array of skills…”

The Power of Structured Output

Forcing the LLM to output data in a structured format like JSON is crucial. It not only simplifies parsing but also limits the LLM’s freedom to generate verbose responses.

import json

def extract_skills(llm_response):
    try:
        skills = json.loads(llm_response)
        return skills
    except json.JSONDecodeError:
        print("Error decoding JSON.  Check LLM response.")
        return []

Strategy 2: Pre-processing and Context Reduction

Before sending data to the LLM, we aggressively pre-process it to remove irrelevant information. The less text you send, the lower the cost.

Removing Boilerplate and Unnecessary Text

Many job descriptions contain boilerplate text, legal disclaimers, and company overviews that are irrelevant for skill extraction. We developed a set of regular expressions and keyword filters to remove this noise.

import re

def remove_boilerplate(text):
    # Remove legal disclaimers
    text = re.sub(r"©.*All rights reserved.*", "", text)
    # Remove application instructions
    text = re.sub(r"To apply, please visit.*", "", text)
    # Remove company overview sections (using keywords)
    text = re.sub(r"(About the company:|Company mission:).*", "", text, flags=re.IGNORECASE)
    return text

By removing this unnecessary text, we reduced the average job description length by 30%, leading to a direct reduction in token cost.

Selective Data Input: The Chunking Approach

Instead of sending the entire job description at once, we experimented with breaking it into smaller chunks. We focused on sections most likely to contain skills, such as the “Responsibilities” and “Requirements” sections.

We developed a heuristic-based algorithm to identify these key sections. This approach significantly reduced the input token count without sacrificing accuracy.

Strategy 3: Model Selection and Fine-tuning

Choosing the right LLM is critical. More powerful models like GPT-4 offer better accuracy but come at a higher price. For many tasks, a smaller, less expensive model like GPT-3.5 or a fine-tuned open-source model can provide sufficient performance.

Benchmark Your Options

We rigorously benchmarked different LLMs on our skill extraction task. We measured both accuracy (precision and recall) and token cost. This allowed us to identify the most cost-effective model for our specific use case.

We found that while GPT-4 provided slightly better accuracy, GPT-3.5 Turbo offered a significantly better price-performance ratio. For our use case, the small accuracy difference didn’t justify the higher cost.

The Potential of Fine-tuning

Fine-tuning an open-source LLM on a dataset of job descriptions and skills can further improve performance and reduce costs. This involves training a smaller model on your specific task, allowing it to learn the nuances of your data.

We are currently exploring fine-tuning options using models like Llama 2. This requires a significant investment in data preparation and training infrastructure but can offer substantial long-term cost savings.

Strategy 4: Caching and Deduplication

Avoid redundant API calls by implementing caching and deduplication strategies. If you’ve already processed a job description, store the results in a cache and reuse them when the same description appears again.

Implementing a Cache Layer

We implemented a cache layer using Redis to store the results of our skill extraction process. We used the job description’s hash as the cache key.

import redis
import hashlib

redis_client = redis.Redis(host='localhost', port=6379, db=0)

def get_skills_from_cache(job_description):
    job_hash = hashlib.sha256(job_description.encode('utf-8')).hexdigest()
    cached_skills = redis_client.get(job_hash)
    if cached_skills:
        return json.loads(cached_skills.decode('utf-8'))
    return None

def store_skills_in_cache(job_description, skills):
    job_hash = hashlib.sha256(job_description.encode('utf-8')).hexdigest()
    redis_client.set(job_hash, json.dumps(skills))

This simple caching mechanism significantly reduced the number of API calls, especially for frequently updated job listings.

Deduplication at Scale

MisuJob aggregates from multiple sources, so duplicate job postings are common. Before sending a job description to the LLM, we compare it to existing descriptions in our database. If we find a near-duplicate, we reuse the previously extracted skills.

The Results: An 80% Reduction in Costs

By implementing these strategies, we achieved a remarkable 80% reduction in our LLM API costs. This translates to significant savings, allowing us to scale our AI-powered job matching without breaking the bank.

Here’s a breakdown of the cost reduction:

StrategyCost Reduction (%)
Prompt Engineering30%
Pre-processing20%
Model Selection15%
Caching and Deduplication15%
Total80%

Salary Data: Powered by Optimized LLM Processing

Our ability to efficiently process large volumes of job data allows us to provide valuable salary insights to our users. Here’s a sample of salary ranges for Software Engineers in various European countries:

CountryAverage Salary (€)Salary Range (€)
Germany65,00050,000 - 85,000
United Kingdom60,00045,000 - 80,000
Netherlands62,00048,000 - 82,000
France55,00042,000 - 70,000
Switzerland90,00070,000 - 120,000
Spain40,00030,000 - 55,000
Sweden58,00045,000 - 75,000

Note: These are average salaries and can vary based on experience, location within the country, and company size.

This data is constantly updated and refined, thanks to our optimized LLM processing pipeline. By extracting relevant information from job descriptions efficiently, we can provide accurate and up-to-date salary information to job seekers across Europe.

Conclusion

Optimizing LLM API costs is crucial for building sustainable AI-powered applications. By focusing on prompt engineering, pre-processing, model selection, and caching, we significantly reduced our expenses without sacrificing accuracy. These strategies are applicable to a wide range of LLM use cases and can help you unlock the full potential of LLMs while staying within your budget.

We at MisuJob are committed to pushing the boundaries of AI-powered job matching, and cost optimization is a key enabler of this mission. We hope these strategies help you on your own LLM journey.

llm api optimization tokens cost reduction ai
Share
P
Pablo Inigo

Founder & Engineer

Building MisuJob - an AI-powered job matching platform processing 1M+ job listings daily.

Engineering updates

Technical deep dives delivered to your inbox.

Find your next role with AI

Upload your CV. Get matched to 50,000+ jobs. Apply to the best fits effortlessly.

Get Started Free

User

Dashboard Profile Subscription