Engineering Originally on dev.to

The Unsolved Problem: Deduplicating Job Listings at Scale

The same job appears 10+ times across sources. Our multi-layer approach catches 95% of duplicates -- and why 100% is impossible.

P
Pablo Inigo · Founder & Engineer
2 min read
Multiple duplicate documents merging into a single unique document

“Senior React Developer” at Spotify. Posted on one ATS. Syndicated to LinkedIn. Aggregated on 3 job boards. Imported by 5 platforms. That’s the same job appearing 10+ times in your database.

Deduplicating job listings is harder than you think. Here’s our multi-layer approach that catches 95% of duplicates — and why 100% is probably impossible.

Why Simple Matching Fails

-- This misses most duplicates
WHERE lower(title) = lower($1) AND lower(company) = lower($2)

Company name variations:

  • “Deutsche Telekom” = “T-Systems” = “Telekom Deutschland”
  • “Alphabet” = “Google” = “Google LLC” = “Google Ireland”

Title variations:

  • “Sr. React Dev” = “Senior React Developer” = “React.js Engineer (Senior)”
  • “SWE III” = “Software Engineer 3” = “Senior Software Engineer”

Same job, different IDs:

  • Company archives and reposts with identical content but a new URL

Our Multi-Layer Approach

Layer 1: URL Match (100% precision, low recall)

const urlMatch = await pool.query(
  'SELECT id FROM jobs WHERE url = $1 AND is_active = true',
  [job.url]
);

If the exact URL exists, it’s definitely a duplicate. But the same job on different sources has different URLs, so this only catches republished jobs.

Layer 2: Title + Company (high precision, medium recall)

const contentMatch = await pool.query(`
  SELECT id FROM jobs
  WHERE lower(title) = lower($1)
    AND lower(company) = lower($2)
    AND is_active = true
`, [job.title, job.company]);

Catches identical postings across sources. Misses title and company name variations.

Layer 3: Content Hash (medium precision, catches reposts)

const descHash = crypto
  .createHash('md5')
  .update(job.description.substring(0, 500).toLowerCase())
  .digest('hex');

const hashMatch = await pool.query(
  'SELECT id FROM jobs WHERE content_hash = $1 AND is_active = true',
  [descHash]
);

Catches reposted jobs with different titles/URLs but identical descriptions. Using only the first 500 chars avoids false negatives from minor footer changes.

Performance: The Index That Matters

Without this index, Layer 2 does a full table scan on 1M rows:

CREATE INDEX idx_jobs_content_dedup
ON jobs (lower(title), lower(company))
WHERE is_active = true;

Dedup checks went from 2+ seconds to < 10ms each. At 10,000 new jobs/day, that’s the difference between 5.5 hours and 1.7 minutes of dedup time.

What We Still Can’t Catch

  • “Google” vs “Alphabet” (parent company mapping would help, but maintaining it is a full-time job)
  • “Senior Engineer” vs “Staff Engineer” for the same role
  • Jobs reposted with AI-rewritten descriptions (increasingly common)

95% accuracy is where we’ve landed. The remaining 5% would require NLP-based semantic matching, which isn’t worth the complexity at our scale.

This runs at MisuJob, keeping 1M+ listings clean across 50+ sources.


How do you handle dedup in your data pipeline? We’d love to hear alternative approaches.

Algorithms Database Backend Data Engineering
Share
P
Pablo Inigo

Founder & Engineer

Building MisuJob — an AI-powered job matching platform processing 1M+ tech job listings daily.

Engineering updates

Technical deep dives delivered to your inbox.

Find your next role with AI

Upload your CV. Get matched to 50,000+ jobs. Auto-apply to the best fits.

Get Started Free

User

Dashboard Profile Subscription