Text Similarity at Scale: Cosine Similarity, TF-IDF & Sentence Transformers

Understanding the semantic similarity between job descriptions and candidate profiles is crucial for powering effective AI-driven job matching. At MisuJob, where we process 1M+ job listings from across Europe, we’ve learned a thing or two about scaling text similarity algorithms.

Text Similarity at Scale: Cosine Similarity, TF-IDF & Sentence Transformers

Finding the perfect job is often about more than just matching keywords. It’s about understanding the underlying meaning of both the job description and the candidate’s skillset. This is where text similarity algorithms come into play. We’ve explored various methods to tackle this challenge, from traditional approaches like TF-IDF and cosine similarity to more advanced techniques using sentence transformers. Each has its strengths and weaknesses, particularly when scaling across millions of documents.

Why Text Similarity Matters for Job Matching

At MisuJob, our mission is to connect talented individuals with the right opportunities across Europe. To achieve this, we rely on AI-powered job matching that goes beyond simple keyword searches. Text similarity enables us to:

Improve Match Accuracy: Identify candidates whose skills and experience align with the intent of the job description, not just the specific keywords used.
Surface Hidden Opportunities: Recommend jobs that candidates might not have found through traditional keyword searches.
Personalize Recommendations: Tailor job suggestions to each user’s unique background and career goals.
Understand Skillset Equivalence: Determine how well a candidate’s “Python” skills match a job requiring “data analysis experience”, for example.

TF-IDF and Cosine Similarity: A Baseline Approach

One of the simplest, yet surprisingly effective methods for text similarity is using Term Frequency-Inverse Document Frequency (TF-IDF) combined with cosine similarity.

TF-IDF: This technique assigns a weight to each word in a document based on its frequency within that document (TF) and its rarity across the entire corpus (IDF). Common words like “the” are down-weighted, while more specific terms receive higher scores.
Cosine Similarity: After converting documents into TF-IDF vectors, cosine similarity measures the angle between these vectors. A smaller angle (cosine closer to 1) indicates higher similarity.

Here’s a Python example using scikit-learn:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

documents = [
    "Software Engineer with Python and Java experience.",
    "Data Scientist specializing in machine learning and Python.",
    "Senior Java Developer with experience in backend systems.",
    "Marketing Manager with expertise in digital marketing."
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Calculate cosine similarity between the first and second documents
similarity = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])

print(f"Cosine similarity between document 1 and 2: {similarity[0][0]}")

While TF-IDF and cosine similarity are relatively easy to implement, they have limitations:

Semantic Understanding: They treat words as independent entities and don’t capture semantic relationships (e.g., “machine learning” and “artificial intelligence” are treated as distinct).
Out-of-Vocabulary Words: They struggle with words not seen during training.
Scalability: Calculating cosine similarity for all pairs of documents in a large corpus can be computationally expensive.

Sentence Transformers: Capturing Semantic Meaning

Sentence transformers offer a more sophisticated approach by encoding entire sentences into dense vector representations that capture semantic meaning. These vectors are designed such that semantically similar sentences are located close to each other in the vector space.

We use sentence transformers trained on large datasets to achieve high accuracy in understanding the context of both job descriptions and candidate profiles.

Here’s an example using the sentence-transformers library:

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-mpnet-base-v2') # A good general-purpose model

sentences = [
    "Software Engineer with Python and Java experience.",
    "Data Scientist specializing in machine learning and Python.",
    "Senior Java Developer with experience in backend systems.",
    "Marketing Manager with expertise in digital marketing."
]

embeddings = model.encode(sentences)

# Calculate cosine similarity between the first and second sentences
similarity = cosine_similarity(embeddings[0].reshape(1, -1), embeddings[1].reshape(1, -1))

print(f"Cosine similarity between sentence 1 and 2: {similarity[0][0]}")

Sentence transformers offer several advantages:

Semantic Understanding: They capture semantic relationships between words and phrases.
Contextual Embeddings: They generate different embeddings for the same word depending on the context.
Pre-trained Models: Pre-trained models are available, reducing the need for extensive training.

However, they also have drawbacks:

Computational Cost: Encoding sentences can be more computationally expensive than TF-IDF.
Model Size: Sentence transformer models can be large, requiring significant memory.

Scaling Text Similarity with FAISS

To address the scalability challenges of calculating cosine similarity across millions of job listings and candidate profiles, we leverage Facebook AI Similarity Search (FAISS). FAISS is a library specifically designed for efficient similarity search in high-dimensional spaces.

FAISS allows us to:

Index Embeddings: Create an index of sentence embeddings for fast retrieval.
Approximate Nearest Neighbor Search: Find the most similar embeddings without exhaustively comparing all pairs.
GPU Acceleration: Leverage GPUs for significant performance improvements.

Here’s a simplified example of how we might use FAISS:

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-mpnet-base-v2')

# Example data (replace with actual job descriptions)
job_descriptions = [
    "Senior Software Engineer specializing in backend development with Java and Spring Boot.",
    "Data Scientist with experience in machine learning, Python, and data visualization.",
    "Frontend Developer proficient in React and JavaScript.",
    "Full-Stack Engineer with expertise in Node.js, React, and PostgreSQL."
]

# Generate sentence embeddings
embeddings = model.encode(job_descriptions)
dimension = embeddings.shape[1] #embedding dimension

# Build the FAISS index
index = faiss.IndexFlatL2(dimension) # Using L2 distance
index.add(embeddings)

# Example query (replace with a candidate profile embedding)
query_embedding = model.encode(["Experienced Software Engineer with Python and machine learning skills."])

# Perform the search
k = 2  # Number of nearest neighbors to retrieve
distances, indices = index.search(query_embedding, k)

print("Distances:", distances) #distance between query and retrieved vectors
print("Indices:", indices) #index of retrieved vectors from job_descriptions

This example creates a IndexFlatL2 index, which performs an exact search using L2 distance (Euclidean distance). For even faster search, especially with very large datasets, you can explore approximate nearest neighbor indices like IndexIVFFlat or IndexHNSWFlat.

Performance Comparison and Tuning

We conducted extensive benchmarks to compare the performance of different text similarity methods on our dataset of European job listings. Here’s a simplified table illustrating the relative performance (lower time is better):

Method	Accuracy (Relative)	Indexing Time (Relative)	Query Time (Relative)	Memory Usage (Relative)
TF-IDF + Cosine Similarity	0.75	1.0	1.0	1.0
Sentence Transformer + FAISS	0.90	5.0	2.0	4.0
Sentence Transformer + Exact	0.90	N/A	10.0	4.0

Important Notes:

“Accuracy” is a relative score based on our internal evaluation metrics.
“Relative” times are normalized to the TF-IDF + Cosine Similarity baseline.
Exact search using sentence transformers is extremely slow and not feasible for large datasets.

As you can see, sentence transformers with FAISS offer a good balance between accuracy and performance. We continuously tune our FAISS indexes and experiment with different sentence transformer models to optimize for both speed and accuracy. We also consider other similarity metrics like dot product when appropriate, especially as FAISS supports many.

Salary Insights with Text Similarity

Text similarity can also be used to gain insights into salary expectations for different roles and skillsets. By clustering similar job descriptions based on their embeddings, we can analyze the salary ranges associated with each cluster.

For example, we might find that “Data Scientist” roles with experience in “deep learning” and “natural language processing” command a higher salary than those focused solely on “statistical modeling.”

Here’s a table showcasing average salary ranges for Software Engineers with different specializations across various European countries, derived from aggregated job listing data.

Specialization	Germany (€)	UK (£)	Netherlands (€)	France (€)	Spain (€)
Backend (Java/Spring)	65,000-85,000	55,000-75,000	60,000-80,000	50,000-70,000	40,000-60,000
Frontend (React/JavaScript)	60,000-80,000	50,000-70,000	55,000-75,000	45,000-65,000	35,000-55,000
Full-Stack (Node/React)	70,000-90,000	60,000-80,000	65,000-85,000	55,000-75,000	45,000-65,000
DevOps (AWS/Kubernetes)	75,000-95,000	65,000-85,000	70,000-90,000	60,000-80,000	50,000-70,000
Machine Learning (Python/TensorFlow)	80,000-100,000	70,000-90,000	75,000-95,000	65,000-85,000	55,000-75,000

Disclaimer: These are average salary ranges and can vary based on experience, location within the country, company size, and other factors. The data is derived from MisuJob’s analysis of aggregated job data from multiple sources across Europe.

Future Directions

We are continuously exploring new ways to improve our text similarity algorithms. Some of our future directions include:

Fine-tuning Sentence Transformers: Fine-tuning pre-trained models on our specific dataset of job listings and candidate profiles.
Cross-Lingual Embeddings: Developing models that can handle multiple languages seamlessly to better serve the European job market.
Knowledge Graphs: Incorporating knowledge graphs to represent relationships between skills, technologies, and industries.
Personalized Embeddings: Creating personalized embeddings for each user based on their past job search behavior and preferences.

Conclusion

Text similarity is a powerful tool for AI-powered job matching. By combining techniques like TF-IDF, cosine similarity, sentence transformers, and FAISS, we can effectively understand the semantic meaning of job descriptions and candidate profiles at scale. This enables us to improve match accuracy, surface hidden opportunities, and personalize recommendations, ultimately connecting talented individuals with the right opportunities across Europe.

Key Takeaways:

Text similarity is critical for understanding the meaning of job descriptions and candidate profiles, not just matching keywords.
TF-IDF and cosine similarity provide a simple but limited baseline.
Sentence transformers capture semantic relationships but require more computational resources.
FAISS enables efficient similarity search at scale.
By leveraging these technologies, MisuJob improves job matching accuracy and provides personalized recommendations.