Last year we built MisuJob, an AI-powered job matching platform. The idea was simple: aggregate tech jobs from 50+ ATS platforms and match them to candidates using AI.
The reality? We spent 80% of our time understanding how to do things, fixing broken importers at 3 AM, and learning that deduplicating job listings is an unsolved problem.
Here’s everything we learned.
The Architecture (Before It All Went Wrong)
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Schedulers │────>│ Bull Queues │────>│ Workers │
│ (node-cron) │ │ (Redis) │ │ (3 pools) │
└─────────────┘ └──────────────┘ └──────────────┘
│
┌─────────────────────┘
v
┌──────────────┐ ┌──────────────┐
│ PostgreSQL │ │ Google │
│ 1M+ jobs │ │ Indexing API │
└──────────────┘ └──────────────┘
Each ATS platform has its own importer. Some have public APIs, others require parsing their public career pages. Every importer feeds into Bull queues backed by Redis.
Simple, right? Let me tell you what went wrong.
Mistake #1: One Queue To Rule Them All
Our first design had a single import queue with 3 workers. It worked great until we onboarded a large ATS with 1,275 companies.
One of our largest ATS sources has a public API, but it only returns basic listing info — no full descriptions, no structured skills data. To get complete listings, we needed to call the API for each company and enrich the data. We had 1,275 companies to process. Each one takes 5-30 minutes (fetch listings, paginate through all jobs, extract and normalize structured data).
So at 3 AM, the scheduler enqueued 1,275 company-level import jobs. They filled the entire queue. All other ATS imports were blocked behind 1,275 bulk jobs that would take days to clear.
The fix: Queue isolation.
// QueueManager.ts - The fix that saved my sanity
export class QueueManager {
private queues: Map<string, Queue> = new Map();
constructor() {
// Core imports - NEVER blocked by bulk operations
this.createQueue('import', { concurrency: 3 });
// Heavy bulk operations - isolated
this.createQueue('import-bulk', { concurrency: 2 });
// Rate-limited sources get their own queue
this.createQueue('rate-limited', { concurrency: 2 });
}
}
Now bulk operations run in import-bulk (2 workers) while API-based imports run in import (3 workers), completely isolated. Rate-limited sources get their own queue so a single 429 doesn’t block everything else.
Lesson: Never put bulk operations in the same queue as time-sensitive imports.
Mistake #2: Fetching More Data Than You Need
Some ATS platforms expose rich career pages with images, custom fonts, and tracking scripts. When you’re importing from thousands of companies, all that unnecessary data adds up fast.
Our first implementation fetched everything — full responses with all embedded resources.
Day one: 3 GB of bandwidth consumed just on imports. When you’re paying per-GB for infrastructure, that’s not sustainable.
The fix: Strip responses down to only the structured data you actually need. Ignore images, stylesheets, fonts, and media assets — you only need the job title, description, skills, and metadata.
Result: ~30% less bandwidth. We only need the text data — not the company logos, hero images, or Google Fonts.
Mistake #3: The Deduplication Nightmare
When you import from 50+ sources, the same job appears everywhere. “Senior React Developer at Spotify” might be posted on one ATS, show up on LinkedIn, appear on another platform, and get syndicated to three aggregators.
Our first approach: match on title + company.
-- Naive dedup - doesn't work
SELECT id FROM jobs
WHERE lower(title) = lower($1)
AND lower(company) = lower($2)
AND is_active = true;
Problem 1: Same company, different names. “Deutsche Telekom” vs “T-Systems” vs “Telekom Deutschland” - all the same parent company.
Problem 2: Same job, different titles. “Sr. React Dev” vs “Senior React Developer” vs “React.js Engineer (Senior)”.
Problem 3: Reposted jobs. Company archives a listing and reposts it with a new ID but identical content.
The solution that actually works is a multi-layer approach:
async function isDuplicate(job: JobData): Promise<boolean> {
// Layer 1: Exact URL match (fastest)
if (job.url) {
const urlMatch = await pool.query(
'SELECT id FROM jobs WHERE url = $1 AND is_active = true',
[job.url]
);
if (urlMatch.rows.length > 0) return true;
}
// Layer 2: Title + Company exact match
const contentMatch = await pool.query(`
SELECT id FROM jobs
WHERE lower(title) = lower($1)
AND lower(company) = lower($2)
AND is_active = true
`, [job.title, job.company]);
if (contentMatch.rows.length > 0) return true;
// Layer 3: Content hash for reposted jobs
// (more expensive, only run if layers 1-2 pass)
if (job.description) {
const descHash = crypto
.createHash('md5')
.update(job.description.substring(0, 500).toLowerCase())
.digest('hex');
const hashMatch = await pool.query(
'SELECT id FROM jobs WHERE content_hash = $1 AND is_active = true',
[descHash]
);
if (hashMatch.rows.length > 0) return true;
}
return false;
}
This catches ~95% of duplicates. The remaining 5% are edge cases like jobs reposted with slightly reworded descriptions.
Critical index for performance (without this, dedup checks take 2+ seconds each on 1M rows):
CREATE INDEX idx_jobs_content_dedup
ON jobs (lower(title), lower(company))
WHERE is_active = true;
Mistake #4: Not Handling ATS Rate Limits Gracefully
Each ATS platform has different tolerance for request volume. Some APIs are generous, others return 429s after moderate traffic, and a few provide helpful Retry-After headers.
Our first implementation: exponential backoff on everything.
The problem? A single 429 would cause the worker to sleep, which blocked the entire queue for that source.
Better approach: Per-source rate limiting with Bull’s built-in limiter:
// Each source gets its own rate limit profile
const RATE_LIMITS: Record<string, { max: number; duration: number }> = {
'source-fast': { max: 100, duration: 60000 }, // generous API
'source-medium': { max: 50, duration: 60000 }, // moderate
'source-strict': { max: 20, duration: 60000 }, // strict rate limits
'source-heavy': { max: 5, duration: 60000 }, // resource-intensive
};
Combined with queue isolation, a rate limit hit on one source never affects the others.
Mistake #5: The 2 AM Core Dump Disaster
Our VM has 30 GB of disk. One morning we woke up to everything dead - disk 100% full. PM2 couldn’t restart, logs couldn’t write, even ls was slow.
The culprit? Node.js core dumps. During heavy scheduled jobs (2-3 AM), occasional OOM crashes generated 2.5 GB core dump files. After a week: 18 core dumps, 11.5 GB gone.
# The permanent fix
echo 'kernel.core_pattern=/dev/null' | sudo tee /etc/sysctl.d/50-no-coredumps.conf
sudo sysctl -p /etc/sysctl.d/50-no-coredumps.conf
echo 'ulimit -c 0' | sudo tee /etc/profile.d/no-coredumps.sh
Lesson: If you’re running heavy Node.js processes on a small VM, disable core dumps. You’re never going to debug a 2.5 GB binary dump anyway.
Mistake #6: Trusting posted_date From External Sources
Job listings come with a posted_date field. Simple, right? Just store it.
Except some ATS platforms return garbage in that field. We’ve seen:
"Reportar empleo"(Spanish for “Report job” — parsed from the wrong field)"Apply Now""3 days ago"(relative dates that need parsing)- Empty strings, nulls, Unix timestamps in seconds vs milliseconds
Our archive job command was crashing because it tried to insert these as dates:
-- The fix: Validate before inserting
INSERT INTO archived_jobs (posted_date, ...)
VALUES (
CASE WHEN posted_date ~ '^\d' THEN posted_date::date
ELSE NULL
END,
...
);
Lesson: Never trust external data. Validate everything, especially dates.
The Numbers
After 6 months of iteration:
| Metric | Value |
|---|---|
| Active job listings | 1,000,000+ |
| ATS platforms integrated | 50+ |
| Daily new jobs processed | 3,000 - 10,000 |
| Dedup accuracy | ~95% |
| Average import time (API-based) | 2-5 minutes |
| Average import time (page-based) | 15-45 minutes |
| Bandwidth reduction | 30% after optimization |
| PostgreSQL indexes | 55+ (yes, really) |
What I’d Do Differently
Start with queue isolation from day one. The “one queue” approach seems simpler but creates cascading failures the moment one source gets slow.
Build a dedup service, not dedup logic. Having dedup scattered across importers leads to inconsistencies. Centralize it.
Monitor bandwidth per source. We didn’t track which importer was consuming the most resources until costs spiked.
Use
OFFSETpagination never. On a 1M row table,OFFSET 50000means PostgreSQL scans and discards 50,000 rows. Use keyset pagination always:
-- Bad: Gets slower as offset increases
SELECT * FROM jobs ORDER BY id LIMIT 50 OFFSET 50000;
-- Good: Constant time regardless of position
SELECT * FROM jobs WHERE id > $lastSeenId ORDER BY id LIMIT 50;
- Validate every field from external sources. Dates, URLs, salary ranges - everything can and will contain garbage.
Try It
If you want to see the result of all this pain, check out MisuJob. It’s free to use - upload your CV and the AI matches you to relevant positions across 1M+ listings.
The category pages like Remote Jobs, Python Jobs, or Jobs in Berlin are all server-side rendered with full SEO - another rabbit hole I’ll write about next time.
Building something that ingests data from dozens of external sources? We’d love to hear your war stories in the comments.

