The Unsolved Problem: Deduplicating Job Listings at Scale

“Senior React Developer” at Spotify. Posted on one ATS. Syndicated to LinkedIn. Aggregated on 3 job boards. Imported by 5 platforms. That’s the same job appearing 10+ times in your database.

Deduplicating job listings is harder than you think. Here’s our multi-layer approach that catches 95% of duplicates — and why 100% is probably impossible.

Why Simple Matching Fails

-- This misses most duplicates
WHERE lower(title) = lower($1) AND lower(company) = lower($2)

Company name variations:

“Deutsche Telekom” = “T-Systems” = “Telekom Deutschland”
“Alphabet” = “Google” = “Google LLC” = “Google Ireland”

Title variations:

“Sr. React Dev” = “Senior React Developer” = “React.js Engineer (Senior)”
“SWE III” = “Software Engineer 3” = “Senior Software Engineer”

Same job, different IDs:

Company archives and reposts with identical content but a new URL

Our Multi-Layer Approach

Layer 1: URL Match (100% precision, low recall)

const urlMatch = await pool.query(
  'SELECT id FROM jobs WHERE url = $1 AND is_active = true',
  [job.url]
);

If the exact URL exists, it’s definitely a duplicate. But the same job on different sources has different URLs, so this only catches republished jobs.

Layer 2: Title + Company (high precision, medium recall)

const contentMatch = await pool.query(`
  SELECT id FROM jobs
  WHERE lower(title) = lower($1)
    AND lower(company) = lower($2)
    AND is_active = true
`, [job.title, job.company]);

Catches identical postings across sources. Misses title and company name variations.

Layer 3: Content Hash (medium precision, catches reposts)

const descHash = crypto
  .createHash('md5')
  .update(job.description.substring(0, 500).toLowerCase())
  .digest('hex');

const hashMatch = await pool.query(
  'SELECT id FROM jobs WHERE content_hash = $1 AND is_active = true',
  [descHash]
);

Catches reposted jobs with different titles/URLs but identical descriptions. Using only the first 500 chars avoids false negatives from minor footer changes.

Performance: The Index That Matters

Without this index, Layer 2 does a full table scan on 1M rows:

CREATE INDEX idx_jobs_content_dedup
ON jobs (lower(title), lower(company))
WHERE is_active = true;

Dedup checks went from 2+ seconds to < 10ms each. At 10,000 new jobs/day, that’s the difference between 5.5 hours and 1.7 minutes of dedup time.

What We Still Can’t Catch

“Google” vs “Alphabet” (parent company mapping would help, but maintaining it is a full-time job)
“Senior Engineer” vs “Staff Engineer” for the same role
Jobs reposted with AI-rewritten descriptions (increasingly common)

95% accuracy is where we’ve landed. The remaining 5% would require NLP-based semantic matching, which isn’t worth the complexity at our scale.

This runs at MisuJob, keeping 1M+ listings clean across 50+ sources.

How do you handle dedup in your data pipeline? We’d love to hear alternative approaches.