Large Language Models (LLMs) are revolutionizing how we build applications, but their API costs can quickly spiral out of control. We’ve been wrestling with this challenge at MisuJob, where our AI-powered job matching processes 1M+ job listings to connect professionals with the right opportunities. Through rigorous experimentation and optimization, we’ve managed to slash our LLM API expenses by 80% – and we’re sharing our strategies to help you do the same.
The Token Cost Problem: A Real-World Example
LLMs charge based on “tokens,” which are roughly equivalent to words or parts of words. The more tokens you send in your prompt and receive in the response, the higher the cost. Consider a scenario where we’re using an LLM to extract key skills from job descriptions aggregated from multiple sources.
Initially, we naively fed the entire job description into the LLM. A typical job description might be 2000 words (around 2500 tokens). At current LLM pricing (e.g., GPT-4), this could cost several cents per job description. When processing 1M+ listings, these costs become unsustainable.
We quickly realized we needed to optimize. Here’s how we tackled the problem.
Strategy 1: Prompt Engineering for Brevity
The most impactful change we made was refining our prompts. A poorly crafted prompt can lead to longer responses and unnecessary processing.
From Verbose to Concise: An Iterative Approach
Our initial prompt looked something like this:
Extract all the skills mentioned in the following job description. List them as comma-separated values. Also, provide a brief (one sentence) explanation of why each skill is important for the role. The job description is: [JOB_DESCRIPTION]
This prompt is overly verbose and asks for explanations, which significantly increases the response length. We replaced it with a much simpler prompt:
Extract the key skills from the following job description. List them as comma-separated values. Job Description: [JOB_DESCRIPTION]
This simple change dramatically reduced the output token count. We further refined the prompt by adding constraints:
- Limit the number of skills: “Extract a maximum of 5 key skills…”
- Specify the output format: “Output a JSON array of skills…”
The Power of Structured Output
Forcing the LLM to output data in a structured format like JSON is crucial. It not only simplifies parsing but also limits the LLM’s freedom to generate verbose responses.
import json
def extract_skills(llm_response):
try:
skills = json.loads(llm_response)
return skills
except json.JSONDecodeError:
print("Error decoding JSON. Check LLM response.")
return []
Strategy 2: Pre-processing and Context Reduction
Before sending data to the LLM, we aggressively pre-process it to remove irrelevant information. The less text you send, the lower the cost.
Removing Boilerplate and Unnecessary Text
Many job descriptions contain boilerplate text, legal disclaimers, and company overviews that are irrelevant for skill extraction. We developed a set of regular expressions and keyword filters to remove this noise.
import re
def remove_boilerplate(text):
# Remove legal disclaimers
text = re.sub(r"©.*All rights reserved.*", "", text)
# Remove application instructions
text = re.sub(r"To apply, please visit.*", "", text)
# Remove company overview sections (using keywords)
text = re.sub(r"(About the company:|Company mission:).*", "", text, flags=re.IGNORECASE)
return text
By removing this unnecessary text, we reduced the average job description length by 30%, leading to a direct reduction in token cost.
Selective Data Input: The Chunking Approach
Instead of sending the entire job description at once, we experimented with breaking it into smaller chunks. We focused on sections most likely to contain skills, such as the “Responsibilities” and “Requirements” sections.
We developed a heuristic-based algorithm to identify these key sections. This approach significantly reduced the input token count without sacrificing accuracy.
Strategy 3: Model Selection and Fine-tuning
Choosing the right LLM is critical. More powerful models like GPT-4 offer better accuracy but come at a higher price. For many tasks, a smaller, less expensive model like GPT-3.5 or a fine-tuned open-source model can provide sufficient performance.
Benchmark Your Options
We rigorously benchmarked different LLMs on our skill extraction task. We measured both accuracy (precision and recall) and token cost. This allowed us to identify the most cost-effective model for our specific use case.
We found that while GPT-4 provided slightly better accuracy, GPT-3.5 Turbo offered a significantly better price-performance ratio. For our use case, the small accuracy difference didn’t justify the higher cost.
The Potential of Fine-tuning
Fine-tuning an open-source LLM on a dataset of job descriptions and skills can further improve performance and reduce costs. This involves training a smaller model on your specific task, allowing it to learn the nuances of your data.
We are currently exploring fine-tuning options using models like Llama 2. This requires a significant investment in data preparation and training infrastructure but can offer substantial long-term cost savings.
Strategy 4: Caching and Deduplication
Avoid redundant API calls by implementing caching and deduplication strategies. If you’ve already processed a job description, store the results in a cache and reuse them when the same description appears again.
Implementing a Cache Layer
We implemented a cache layer using Redis to store the results of our skill extraction process. We used the job description’s hash as the cache key.
import redis
import hashlib
redis_client = redis.Redis(host='localhost', port=6379, db=0)
def get_skills_from_cache(job_description):
job_hash = hashlib.sha256(job_description.encode('utf-8')).hexdigest()
cached_skills = redis_client.get(job_hash)
if cached_skills:
return json.loads(cached_skills.decode('utf-8'))
return None
def store_skills_in_cache(job_description, skills):
job_hash = hashlib.sha256(job_description.encode('utf-8')).hexdigest()
redis_client.set(job_hash, json.dumps(skills))
This simple caching mechanism significantly reduced the number of API calls, especially for frequently updated job listings.
Deduplication at Scale
MisuJob aggregates from multiple sources, so duplicate job postings are common. Before sending a job description to the LLM, we compare it to existing descriptions in our database. If we find a near-duplicate, we reuse the previously extracted skills.
The Results: An 80% Reduction in Costs
By implementing these strategies, we achieved a remarkable 80% reduction in our LLM API costs. This translates to significant savings, allowing us to scale our AI-powered job matching without breaking the bank.
Here’s a breakdown of the cost reduction:
| Strategy | Cost Reduction (%) |
|---|---|
| Prompt Engineering | 30% |
| Pre-processing | 20% |
| Model Selection | 15% |
| Caching and Deduplication | 15% |
| Total | 80% |
Salary Data: Powered by Optimized LLM Processing
Our ability to efficiently process large volumes of job data allows us to provide valuable salary insights to our users. Here’s a sample of salary ranges for Software Engineers in various European countries:
| Country | Average Salary (€) | Salary Range (€) |
|---|---|---|
| Germany | 65,000 | 50,000 - 85,000 |
| United Kingdom | 60,000 | 45,000 - 80,000 |
| Netherlands | 62,000 | 48,000 - 82,000 |
| France | 55,000 | 42,000 - 70,000 |
| Switzerland | 90,000 | 70,000 - 120,000 |
| Spain | 40,000 | 30,000 - 55,000 |
| Sweden | 58,000 | 45,000 - 75,000 |
Note: These are average salaries and can vary based on experience, location within the country, and company size.
This data is constantly updated and refined, thanks to our optimized LLM processing pipeline. By extracting relevant information from job descriptions efficiently, we can provide accurate and up-to-date salary information to job seekers across Europe.
Conclusion
Optimizing LLM API costs is crucial for building sustainable AI-powered applications. By focusing on prompt engineering, pre-processing, model selection, and caching, we significantly reduced our expenses without sacrificing accuracy. These strategies are applicable to a wide range of LLM use cases and can help you unlock the full potential of LLMs while staying within your budget.
We at MisuJob are committed to pushing the boundaries of AI-powered job matching, and cost optimization is a key enabler of this mission. We hope these strategies help you on your own LLM journey.

