Engineering Originally on dev.to

How We Process 1M+ Job Listings from 50+ ATS Platforms (Node.js)

Real-world battle scars from building a system that ingests 1M+ job listings from 50+ ATS platforms.

P
Pablo Inigo · Founder & Engineer
7 min read
Data pipeline funnel processing job listings through multiple stages

Last year we built MisuJob, an AI-powered job matching platform. The idea was simple: aggregate tech jobs from 50+ ATS platforms and match them to candidates using AI.

The reality? We spent 80% of our time understanding how to do things, fixing broken importers at 3 AM, and learning that deduplicating job listings is an unsolved problem.

Here’s everything we learned.

The Architecture (Before It All Went Wrong)

┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│  Schedulers  │────>│  Bull Queues  │────>│   Workers    │
│  (node-cron) │     │  (Redis)      │     │  (3 pools)   │
└─────────────┘     └──────────────┘     └──────────────┘
                         ┌─────────────────────┘
                         v
                   ┌──────────────┐     ┌──────────────┐
                   │  PostgreSQL   │     │  Google       │
                   │  1M+ jobs     │     │  Indexing API │
                   └──────────────┘     └──────────────┘

Each ATS platform has its own importer. Some have public APIs, others require parsing their public career pages. Every importer feeds into Bull queues backed by Redis.

Simple, right? Let me tell you what went wrong.

Mistake #1: One Queue To Rule Them All

Our first design had a single import queue with 3 workers. It worked great until we onboarded a large ATS with 1,275 companies.

One of our largest ATS sources has a public API, but it only returns basic listing info — no full descriptions, no structured skills data. To get complete listings, we needed to call the API for each company and enrich the data. We had 1,275 companies to process. Each one takes 5-30 minutes (fetch listings, paginate through all jobs, extract and normalize structured data).

So at 3 AM, the scheduler enqueued 1,275 company-level import jobs. They filled the entire queue. All other ATS imports were blocked behind 1,275 bulk jobs that would take days to clear.

The fix: Queue isolation.

// QueueManager.ts - The fix that saved my sanity
export class QueueManager {
  private queues: Map<string, Queue> = new Map();

  constructor() {
    // Core imports - NEVER blocked by bulk operations
    this.createQueue('import', { concurrency: 3 });

    // Heavy bulk operations - isolated
    this.createQueue('import-bulk', { concurrency: 2 });

    // Rate-limited sources get their own queue
    this.createQueue('rate-limited', { concurrency: 2 });
  }
}

Now bulk operations run in import-bulk (2 workers) while API-based imports run in import (3 workers), completely isolated. Rate-limited sources get their own queue so a single 429 doesn’t block everything else.

Lesson: Never put bulk operations in the same queue as time-sensitive imports.

Mistake #2: Fetching More Data Than You Need

Some ATS platforms expose rich career pages with images, custom fonts, and tracking scripts. When you’re importing from thousands of companies, all that unnecessary data adds up fast.

Our first implementation fetched everything — full responses with all embedded resources.

Day one: 3 GB of bandwidth consumed just on imports. When you’re paying per-GB for infrastructure, that’s not sustainable.

The fix: Strip responses down to only the structured data you actually need. Ignore images, stylesheets, fonts, and media assets — you only need the job title, description, skills, and metadata.

Result: ~30% less bandwidth. We only need the text data — not the company logos, hero images, or Google Fonts.

Mistake #3: The Deduplication Nightmare

When you import from 50+ sources, the same job appears everywhere. “Senior React Developer at Spotify” might be posted on one ATS, show up on LinkedIn, appear on another platform, and get syndicated to three aggregators.

Our first approach: match on title + company.

-- Naive dedup - doesn't work
SELECT id FROM jobs
WHERE lower(title) = lower($1)
  AND lower(company) = lower($2)
  AND is_active = true;

Problem 1: Same company, different names. “Deutsche Telekom” vs “T-Systems” vs “Telekom Deutschland” - all the same parent company.

Problem 2: Same job, different titles. “Sr. React Dev” vs “Senior React Developer” vs “React.js Engineer (Senior)”.

Problem 3: Reposted jobs. Company archives a listing and reposts it with a new ID but identical content.

The solution that actually works is a multi-layer approach:

async function isDuplicate(job: JobData): Promise<boolean> {
  // Layer 1: Exact URL match (fastest)
  if (job.url) {
    const urlMatch = await pool.query(
      'SELECT id FROM jobs WHERE url = $1 AND is_active = true',
      [job.url]
    );
    if (urlMatch.rows.length > 0) return true;
  }

  // Layer 2: Title + Company exact match
  const contentMatch = await pool.query(`
    SELECT id FROM jobs
    WHERE lower(title) = lower($1)
      AND lower(company) = lower($2)
      AND is_active = true
  `, [job.title, job.company]);
  if (contentMatch.rows.length > 0) return true;

  // Layer 3: Content hash for reposted jobs
  // (more expensive, only run if layers 1-2 pass)
  if (job.description) {
    const descHash = crypto
      .createHash('md5')
      .update(job.description.substring(0, 500).toLowerCase())
      .digest('hex');

    const hashMatch = await pool.query(
      'SELECT id FROM jobs WHERE content_hash = $1 AND is_active = true',
      [descHash]
    );
    if (hashMatch.rows.length > 0) return true;
  }

  return false;
}

This catches ~95% of duplicates. The remaining 5% are edge cases like jobs reposted with slightly reworded descriptions.

Critical index for performance (without this, dedup checks take 2+ seconds each on 1M rows):

CREATE INDEX idx_jobs_content_dedup
ON jobs (lower(title), lower(company))
WHERE is_active = true;

Mistake #4: Not Handling ATS Rate Limits Gracefully

Each ATS platform has different tolerance for request volume. Some APIs are generous, others return 429s after moderate traffic, and a few provide helpful Retry-After headers.

Our first implementation: exponential backoff on everything.

The problem? A single 429 would cause the worker to sleep, which blocked the entire queue for that source.

Better approach: Per-source rate limiting with Bull’s built-in limiter:

// Each source gets its own rate limit profile
const RATE_LIMITS: Record<string, { max: number; duration: number }> = {
  'source-fast':    { max: 100, duration: 60000 }, // generous API
  'source-medium':  { max: 50,  duration: 60000 }, // moderate
  'source-strict':  { max: 20,  duration: 60000 }, // strict rate limits
  'source-heavy':   { max: 5,   duration: 60000 }, // resource-intensive
};

Combined with queue isolation, a rate limit hit on one source never affects the others.

Mistake #5: The 2 AM Core Dump Disaster

Our VM has 30 GB of disk. One morning we woke up to everything dead - disk 100% full. PM2 couldn’t restart, logs couldn’t write, even ls was slow.

The culprit? Node.js core dumps. During heavy scheduled jobs (2-3 AM), occasional OOM crashes generated 2.5 GB core dump files. After a week: 18 core dumps, 11.5 GB gone.

# The permanent fix
echo 'kernel.core_pattern=/dev/null' | sudo tee /etc/sysctl.d/50-no-coredumps.conf
sudo sysctl -p /etc/sysctl.d/50-no-coredumps.conf

echo 'ulimit -c 0' | sudo tee /etc/profile.d/no-coredumps.sh

Lesson: If you’re running heavy Node.js processes on a small VM, disable core dumps. You’re never going to debug a 2.5 GB binary dump anyway.

Mistake #6: Trusting posted_date From External Sources

Job listings come with a posted_date field. Simple, right? Just store it.

Except some ATS platforms return garbage in that field. We’ve seen:

  • "Reportar empleo" (Spanish for “Report job” — parsed from the wrong field)
  • "Apply Now"
  • "3 days ago" (relative dates that need parsing)
  • Empty strings, nulls, Unix timestamps in seconds vs milliseconds

Our archive job command was crashing because it tried to insert these as dates:

-- The fix: Validate before inserting
INSERT INTO archived_jobs (posted_date, ...)
VALUES (
  CASE WHEN posted_date ~ '^\d' THEN posted_date::date
       ELSE NULL
  END,
  ...
);

Lesson: Never trust external data. Validate everything, especially dates.

The Numbers

After 6 months of iteration:

MetricValue
Active job listings1,000,000+
ATS platforms integrated50+
Daily new jobs processed3,000 - 10,000
Dedup accuracy~95%
Average import time (API-based)2-5 minutes
Average import time (page-based)15-45 minutes
Bandwidth reduction30% after optimization
PostgreSQL indexes55+ (yes, really)

What I’d Do Differently

  1. Start with queue isolation from day one. The “one queue” approach seems simpler but creates cascading failures the moment one source gets slow.

  2. Build a dedup service, not dedup logic. Having dedup scattered across importers leads to inconsistencies. Centralize it.

  3. Monitor bandwidth per source. We didn’t track which importer was consuming the most resources until costs spiked.

  4. Use OFFSET pagination never. On a 1M row table, OFFSET 50000 means PostgreSQL scans and discards 50,000 rows. Use keyset pagination always:

-- Bad: Gets slower as offset increases
SELECT * FROM jobs ORDER BY id LIMIT 50 OFFSET 50000;

-- Good: Constant time regardless of position
SELECT * FROM jobs WHERE id > $lastSeenId ORDER BY id LIMIT 50;
  1. Validate every field from external sources. Dates, URLs, salary ranges - everything can and will contain garbage.

Try It

If you want to see the result of all this pain, check out MisuJob. It’s free to use - upload your CV and the AI matches you to relevant positions across 1M+ listings.

The category pages like Remote Jobs, Python Jobs, or Jobs in Berlin are all server-side rendered with full SEO - another rabbit hole I’ll write about next time.


Building something that ingests data from dozens of external sources? We’d love to hear your war stories in the comments.

Node.js Web Development Architecture TypeScript
Share
P
Pablo Inigo

Founder & Engineer

Building MisuJob — an AI-powered job matching platform processing 1M+ tech job listings daily.

Engineering updates

Technical deep dives delivered to your inbox.

Find your next role with AI

Upload your CV. Get matched to 50,000+ jobs. Auto-apply to the best fits.

Get Started Free

User

Dashboard Profile Subscription