Building a Job Matching Algorithm: Skill Extraction Meets Vector Search

Finding the perfect job feels like finding a needle in a haystack. At MisuJob, we’re building the tools to make that haystack significantly smaller, and that needle much easier to find.

Building a Job Matching Algorithm: Skill Extraction Meets Vector Search

Our mission at MisuJob is to connect professionals with the right opportunities across Europe. With our AI-powered job matching, we aim to move beyond keyword searches and towards a deeper, more semantic understanding of both job requirements and candidate skills. This post dives into the technical details of how we approach this challenge, focusing on skill extraction and vector search. MisuJob processes 1M+ job listings, aggregating from multiple sources to create a comprehensive view of the European job market.

The Problem: Keyword Matching Isn’t Enough

Traditional job search relies heavily on keyword matching. While simple to implement, this approach has significant limitations. For example, a search for “Python developer” might miss candidates who are proficient in Python but use related terms like “data science” or “backend engineering” in their profiles. Similarly, job descriptions often use different terms for the same skills, leading to mismatches.

We needed a system that could understand the semantic meaning of both job descriptions and candidate profiles, moving beyond surface-level keyword overlap. This led us to explore skill extraction and vector search.

Skill Extraction: Identifying What Matters

The first step in building a robust job matching algorithm is accurately identifying the skills required for a job. We employ a multi-faceted approach to skill extraction, combining natural language processing (NLP) techniques with a curated knowledge base of skills and their synonyms.

Our skill extraction pipeline involves the following steps:

Text Preprocessing: Cleaning and normalizing the job description text, including removing irrelevant characters, handling punctuation, and converting text to lowercase.
Named Entity Recognition (NER): Identifying entities in the text that represent skills, technologies, tools, and other relevant concepts. We use pre-trained NER models fine-tuned on a large dataset of job descriptions.
Skill Normalization: Mapping identified entities to a standardized skill vocabulary. This involves resolving synonyms and handling variations in skill names (e.g., “JavaScript” vs. “JS”). We maintain a comprehensive knowledge graph of skills and their relationships to facilitate this process.
Contextual Analysis: Using contextual information to disambiguate skill mentions and identify the most relevant skills for a given job. For example, the word “Java” might refer to the programming language or the island of Java, depending on the context.

Here’s an example of how we might represent extracted skills in JSON format:

{
  "job_id": "12345",
  "skills": [
    {
      "name": "Python",
      "relevance_score": 0.95
    },
    {
      "name": "Data Analysis",
      "relevance_score": 0.80
    },
    {
      "name": "SQL",
      "relevance_score": 0.75
    }
  ]
}

This JSON structure includes the extracted skill name and a relevance score, indicating the importance of the skill for the job. This score is derived from the frequency of the skill mention, its context within the job description, and its overall importance within our knowledge graph.

Vector Search: Finding Semantic Similarity

Once we have extracted skills from job descriptions and candidate profiles, we need a way to efficiently find jobs that match a candidate’s skillset. This is where vector search comes in. Vector search allows us to represent skills as vectors in a high-dimensional space, where the distance between two vectors represents the semantic similarity between the corresponding skills.

We use word embeddings, specifically transformer-based models like Sentence Transformers, to generate these vectors. These models are trained on a massive corpus of text and can capture the nuances of language and the relationships between words.

Here’s a simplified example of how we generate embeddings using Python and Sentence Transformers:

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Example skills
skills = ["Python programming", "Data analysis with Pandas", "SQL database management"]

# Generate embeddings
embeddings = model.encode(skills)

# Print the shape of the embeddings
print(f"Shape of embeddings: {embeddings.shape}") # Output: Shape of embeddings: (3, 384)

# Example of calculating cosine similarity between two skill embeddings
from numpy import dot
from numpy.linalg import norm

def cosine_similarity(a, b):
    return dot(a, b)/(norm(a)*norm(b))

print(f"Cosine similarity between 'Python programming' and 'Data analysis with Pandas': {cosine_similarity(embeddings[0], embeddings[1])}")

In this example, each skill is represented by a 384-dimensional vector. The cosine_similarity function calculates the cosine similarity between two vectors, which is a measure of their similarity. A higher cosine similarity indicates a greater degree of semantic similarity.

We store these embeddings in a vector database, such as Pinecone or Milvus, which allows for efficient similarity search. When a candidate searches for a job or updates their profile, we generate an embedding for their skills and use the vector database to find jobs with similar skill requirements.

Putting It All Together: The MisuJob Matching Engine

Our job matching engine combines skill extraction and vector search to provide highly relevant job recommendations to our users. The process works as follows:

Candidate Profile: The candidate creates a profile on MisuJob, listing their skills and experience.
Skill Extraction: Our skill extraction pipeline processes the candidate’s profile, identifying their skills and generating relevance scores.
Embedding Generation: We generate a vector embedding for the candidate’s skills using Sentence Transformers.
Vector Search: We use the vector database to find jobs with similar skill requirements, based on the cosine similarity between the candidate’s embedding and the job embeddings.
Ranking and Filtering: We rank the matching jobs based on their similarity scores and apply additional filters based on location, salary expectations, and other criteria.
Recommendations: We present the candidate with a list of highly relevant job recommendations.

This approach allows us to provide more accurate and personalized job recommendations than traditional keyword-based search.

Optimizing for Performance and Scalability

Building a job matching engine that can handle millions of job listings and candidate profiles requires careful attention to performance and scalability. We have implemented several optimizations to ensure that our system can handle the load:

Vector Database Indexing: We use efficient indexing techniques in our vector database to speed up similarity searches. We continuously monitor query performance and adjust the indexing parameters as needed.
Caching: We cache frequently accessed data, such as skill embeddings and job descriptions, to reduce the load on our database.
Asynchronous Processing: We use asynchronous processing to handle long-running tasks, such as skill extraction and embedding generation, without blocking the main application thread.
Horizontal Scaling: We have designed our system to be horizontally scalable, allowing us to add more servers as needed to handle increasing traffic.

Real-World Impact: Improved Job Matching Accuracy

The implementation of skill extraction and vector search has significantly improved the accuracy of our job matching engine. We have seen a noticeable increase in the number of candidates who find relevant jobs through MisuJob.

Here’s an example of how vector search improved our matching accuracy compared to keyword-based search for a “Data Scientist” role in Berlin:

Metric	Keyword Search	Vector Search	Improvement
Relevant Job Recommendations	3	8	+167%
Click-Through Rate (CTR)	1.2%	2.8%	+133%
Application Rate	0.4%	1.1%	+175%

This table shows that vector search resulted in a significant increase in relevant job recommendations, click-through rate, and application rate compared to keyword-based search. This demonstrates the effectiveness of our approach in improving job matching accuracy.

Salary Insights and Market Trends

Beyond job matching, the data we process allows us to provide valuable salary insights and market trends to our users. We can analyze salary ranges for different roles across various European countries, helping candidates negotiate their salaries and make informed career decisions.

Here’s a sample of salary ranges for Data Scientists across several European countries:

Country	Average Salary (EUR)	Salary Range (EUR)
Germany	75,000	60,000 - 90,000
United Kingdom	70,000	55,000 - 85,000
Netherlands	72,000	58,000 - 88,000
France	65,000	52,000 - 80,000
Spain	55,000	45,000 - 65,000

These figures are based on our analysis of job postings and salary surveys. They provide a general overview of the salary landscape for Data Scientists in Europe.

We also track market trends, such as the demand for specific skills and the growth of different industries. This information helps candidates identify promising career paths and develop the skills needed to succeed in the job market.

Future Directions

We are constantly working to improve our job matching engine and provide even more value to our users. Some of our future directions include:

Personalized Skill Recommendations: Recommending skills that candidates should learn to improve their job prospects, based on their existing skills and the demands of the job market.
Career Path Prediction: Predicting potential career paths for candidates based on their skills and experience, helping them plan their career development.
Improved Contextual Understanding: Enhancing our NLP models to better understand the context of job descriptions and candidate profiles, leading to more accurate skill extraction and matching.
Explainable AI: Providing explanations for why a particular job was recommended to a candidate, building trust and transparency in our system.

Scaling Vector Search with Quantization

As MisuJob grows, the number of job postings and candidate profiles increases, so we must optimize our vector search for speed and efficiency. One technique we use is quantization, which reduces the memory footprint of our embeddings. Quantization converts floating-point numbers to integers, significantly reducing memory usage while sacrificing some precision. For example, using 8-bit integers (int8) instead of 32-bit floats (float32) reduces memory consumption by a factor of four.

Here’s how we might implement quantization using a Python library like scikit-learn:

from sklearn.cluster import MiniBatchKMeans
import numpy as np

# Assume embeddings is a numpy array of float32 vectors
n_clusters = 256  # Number of quantization levels
kmeans = MiniBatchKMeans(n_clusters=n_clusters, random_state=42, batch_size=256, n_init=10) # increased n_init for stability
kmeans.fit(embeddings)

# Replace each vector with its closest centroid index
quantized_embeddings = kmeans.predict(embeddings).astype(np.uint8)

# To reconstruct the approximate vector:
reconstructed_embeddings = kmeans.cluster_centers_[quantized_embeddings]

print(f"Original embeddings dtype: {embeddings.dtype}")
print(f"Quantized embeddings dtype: {quantized_embeddings.dtype}")

This code quantizes the embeddings by clustering them and then replacing each vector with the index of its nearest cluster centroid. The kmeans.cluster_centers_ array stores the centroids, allowing us to reconstruct an approximation of the original vectors when needed. Using this method, we have achieved a significant reduction in memory usage and improved search speed, without a substantial drop in search accuracy.

Fine-Tuning Sentence Transformers for Job-Specific Language

General-purpose Sentence Transformers work well, but we’ve found that fine-tuning them on our specific dataset of job descriptions and candidate profiles can further improve performance. This involves training the model on a dataset of paired job descriptions and candidate profiles that are known to be a good match. The model learns to generate embeddings that are more similar for matching pairs and more dissimilar for non-matching pairs.

We create a training dataset by pairing job descriptions with candidate profiles that resulted in successful hires or high engagement. We also include negative examples by randomly pairing job descriptions with candidate profiles that are unlikely to be a good fit.

The loss function we use for fine-tuning is typically a contrastive loss or a triplet loss. These loss functions encourage the model to generate similar embeddings for positive pairs and dissimilar embeddings for negative pairs.

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create training data (list of InputExample objects)
train_examples = [
    InputExample(texts=['Job description 1', 'Candidate profile 1'], label=1.0), # Matching pair
    InputExample(texts=['Job description 2', 'Candidate profile 2'], label=1.0), # Matching pair
    InputExample(texts=['Job description 3', 'Candidate profile 3'], label=0.0), # Non-matching pair
    # ... more examples
]

# Define a loss function (e.g., CosineSimilarityLoss)
train_loss = losses.CosineSimilarityLoss(model)

# Use DataLoader for efficient batching
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# Configure the training
num_epochs = 3
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1) #10% of train data for warm-up

# Train the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=num_epochs,
    warmup_steps=warmup_steps,
    output_path='fine_tuned_model' # Save the fine-tuned model
)

This code snippet demonstrates how to fine-tune a Sentence Transformer model using PyTorch. We create a DataLoader to efficiently feed training data to the model and use the CosineSimilarityLoss to encourage the model to generate similar embeddings for matching pairs. Fine-tuning our models in this manner has resulted in a significant improvement in the quality of our job recommendations.

Key Takeaways

Keyword matching is insufficient for modern job search; skill extraction and vector search provide a more semantic understanding.
Skill extraction involves NLP techniques like NER, skill normalization, and contextual analysis.
Vector search uses word embeddings to represent skills in a high-dimensional space, enabling efficient similarity search.
Optimizations like vector database indexing, caching, and asynchronous processing are crucial for performance and scalability.
Quantization and fine-tuning significantly improve vector search efficiency and accuracy.
MisuJob’s AI-powered job matching helps professionals across Europe find the right opportunities.