Building a CV Parser with LLMs: Lessons from Production

Building a reliable CV parser is a critical step in optimizing the job search experience, especially when you process 1M+ job listings. At MisuJob, we’ve moved beyond traditional methods and embraced Large Language Models (LLMs) to extract valuable insights from CVs, powering our AI-powered job matching and providing career advice to tech professionals in Europe.

The Challenge: From Pixels to Parsable Data

CVs come in all shapes and sizes – PDFs, DOCs, even images. Each format presents its own unique challenges. Early attempts at rule-based parsing yielded inconsistent results, often missing key skills or misinterpreting job titles. This led to suboptimal job recommendations and a frustrating experience for our users. We needed a solution that was robust, adaptable, and could handle the inherent ambiguity of human language. The DACH region, with its diverse languages and CV styles, added another layer of complexity.

Why LLMs?

Traditional methods often rely on regular expressions and fixed templates. These are brittle and require constant maintenance as CV formats evolve. LLMs, on the other hand, offer several advantages:

Contextual Understanding: LLMs can understand the context of words and phrases, allowing them to accurately identify skills, experience, and education even when presented in different formats.
Adaptability: LLMs can be fine-tuned to specific domains, such as the tech industry, improving their accuracy and relevance.
Scalability: Once trained, LLMs can process large volumes of CVs quickly and efficiently.

Building Our LLM-Powered Parser: A Phased Approach

We adopted a phased approach to building our CV parser, starting with a proof-of-concept and gradually iterating towards a production-ready system.

Phase 1: Proof of Concept

We started by experimenting with pre-trained LLMs available through cloud-based APIs. We fed them sample CVs and evaluated their ability to extract key information such as:

Personal Information (Name, Email, Phone Number)
Work Experience (Job Title, Company, Dates, Description)
Education (Degree, University, Dates)
Skills (Programming Languages, Tools, Frameworks)

The initial results were promising, but accuracy was inconsistent. We observed issues with:

Hallucinations: The LLM sometimes invented information that wasn’t actually present in the CV.
Inconsistent Formatting: The extracted data was not always consistently formatted, making it difficult to process.
Difficulty with Uncommon Skills: The LLM struggled to identify skills that were not commonly found in its training data.

Phase 2: Fine-Tuning and Data Augmentation

To address these issues, we decided to fine-tune a pre-trained LLM using a dataset of labeled CVs. We created our own dataset by manually annotating hundreds of CVs, focusing on the key information we wanted to extract. This process was time-consuming, but it significantly improved the accuracy of the LLM.

We also implemented data augmentation techniques to increase the size and diversity of our training data. This included:

Synonym Replacement: Replacing words with their synonyms to create variations of the same sentence.
Back Translation: Translating sentences to another language and then back to English to introduce subtle changes.
Random Insertion/Deletion: Randomly inserting or deleting words from sentences.

Here’s an example of how we used synonym replacement in Python:

import nltk
from nltk.corpus import wordnet

def replace_synonyms(text, n=1):
    words = text.split()
    new_words = []
    for word in words:
        synonyms = wordnet.synsets(word)
        if synonyms:
            synonym = synonyms[0].lemmas()[0].name()
            if synonym != word: #Avoid replacing with the same word
                new_words.append(synonym)
            else:
                new_words.append(word)
        else:
            new_words.append(word)
    return " ".join(new_words)

example_text = "Experienced software engineer with proficiency in Python."
augmented_text = replace_synonyms(example_text)
print(f"Original: {example_text}")
print(f"Augmented: {augmented_text}")

This code snippet demonstrates a simple way to augment text data by replacing words with their synonyms using the NLTK library. This helps the LLM generalize better to different writing styles.

Phase 3: Production Deployment and Monitoring

Once we were satisfied with the performance of the fine-tuned LLM, we deployed it to production using a cloud-based inference service. We implemented robust monitoring and alerting to detect any issues with the parser.

We tracked key metrics such as:

Extraction Accuracy: The percentage of key information fields that were correctly extracted.
Latency: The time it took to process a CV.
Error Rate: The percentage of CVs that resulted in an error.

We also implemented a feedback mechanism that allowed users to report any errors or inaccuracies in the extracted data. This feedback was used to further improve the LLM.

Optimizing for Speed and Cost

LLMs can be computationally expensive to run, so we explored several optimization techniques to reduce latency and cost.

Model Quantization

We used model quantization to reduce the size of the LLM, which improved its inference speed and reduced memory consumption. Quantization involves converting the model’s weights from floating-point numbers to integers. This reduces the precision of the model, but it can significantly improve its performance.

Caching

We implemented caching to store the results of frequently requested CVs. This reduced the number of times the LLM had to be invoked, saving both time and money. We used a distributed cache to ensure that the cached data was available to all instances of the parser.

Here’s a simplified example using Redis as a cache (using Python):

import redis
import hashlib

redis_client = redis.Redis(host='localhost', port=6379, db=0)

def parse_cv(cv_content):
    # Generate a unique key based on the CV content
    key = hashlib.sha256(cv_content.encode('utf-8')).hexdigest()

    # Check if the result is in the cache
    cached_result = redis_client.get(key)
    if cached_result:
        print("Result found in cache!")
        return eval(cached_result.decode('utf-8')) # Deserialize

    # If not in cache, parse the CV using the LLM
    print("Parsing CV using LLM...")
    parsed_data = llm_parse(cv_content)  # Replace with your LLM parsing function

    # Store the result in the cache (serialized)
    redis_client.set(key, str(parsed_data))
    return parsed_data

# Example Usage
cv_text = "John Doe\nSoftware Engineer at Google\nSkills: Python, Java"
parsed_cv_data = parse_cv(cv_text)
print(parsed_cv_data)

This example shows how to cache the results of CV parsing using Redis. The CV content is hashed to create a unique key, and the parsed data is stored in the cache. Subsequent requests for the same CV will be served from the cache, avoiding the need to invoke the LLM again.

Batch Processing

We processed CVs in batches to improve throughput. This allowed us to amortize the cost of invoking the LLM over multiple CVs.

Here’s an example of how we used batch processing in Python:

import asyncio

async def process_cv(cv_content):
    # Simulate LLM processing (replace with actual LLM call)
    await asyncio.sleep(0.1) # Simulate latency
    return f"Processed: {cv_content}"

async def process_batch(cv_batch):
    tasks = [process_cv(cv) for cv in cv_batch]
    results = await asyncio.gather(*tasks)
    return results

async def main():
    cv_list = ["CV 1", "CV 2", "CV 3", "CV 4", "CV 5"]
    batch_size = 2
    for i in range(0, len(cv_list), batch_size):
        batch = cv_list[i:i + batch_size]
        results = await process_batch(batch)
        print(f"Batch Results: {results}")

if __name__ == "__main__":
    asyncio.run(main())

This code snippet demonstrates how to use asynchronous programming to process CVs in batches. This allows us to process multiple CVs concurrently, improving throughput. In a real-world scenario, process_cv would make a call to your LLM-based CV parsing service.

Lessons Learned: From Prototype to Production

Building a CV parser with LLMs has been a challenging but rewarding experience. Here are some of the key lessons we learned:

Data Quality is Paramount: The accuracy of the LLM depends heavily on the quality of the training data. Invest time and effort in creating a high-quality, diverse dataset.
Fine-Tuning is Essential: Pre-trained LLMs provide a good starting point, but fine-tuning is necessary to achieve optimal performance in a specific domain.
Monitoring is Crucial: Continuously monitor the performance of the parser and implement a feedback mechanism to identify and address any issues.
Optimization is Key: LLMs can be computationally expensive, so explore various optimization techniques to reduce latency and cost.
Embrace Iteration: Building a CV parser is an iterative process. Start with a proof-of-concept, gradually iterate towards a production-ready system, and continuously improve the parser based on user feedback and performance data.

Impact on MisuJob Users and Data

Our enhanced CV parser significantly improves the accuracy of our AI-powered job matching. We’ve seen a 25% increase in the number of relevant job recommendations presented to users. This leads to a better user experience and a higher likelihood of finding the right job. Furthermore, the enriched data extracted from CVs allows us to provide more accurate salary insights. For example, we can now provide more granular salary ranges based on specific skills and experience levels in the DACH region:

Skill	Years of Experience	Average Salary (EUR)	Range (EUR)
Python	3-5	75,000	65,000-85,000
Java	5-7	80,000	70,000-90,000
React	2-4	70,000	60,000-80,000
AWS	1-3	72,000	62,000-82,000
Data Science	4-6	85,000	75,000-95,000

This table provides specific salary benchmarks, helping tech professionals in Europe negotiate their salaries more effectively. This level of insight is only possible because of the accuracy and detail provided by our LLM-powered CV parser as it processes 1M+ job listings and associated data.

Conclusion

Building a CV parser with LLMs is a complex but rewarding endeavor. By embracing a phased approach, focusing on data quality, and implementing robust monitoring and optimization techniques, we’ve created a system that powers our AI-powered job matching and provides valuable career advice to tech professionals in Europe. As LLMs continue to evolve, we’re excited to explore new ways to leverage them to further improve the job search experience on MisuJob.