We import job listings from 50+ sources. Here’s a museum of the garbage we’ve encountered in production data — and the defensive coding patterns that save us.
The Hall of Shame
Dates That Aren’t Dates
Expected: "2026-02-15"
Received:
"Reportar empleo"(Spanish for “Report this job”)"Apply Now""3 days ago""Posted: Recently"1708000000(Unix timestamp in seconds, not milliseconds)""(empty string)"null"(the string “null”, not null)
The fix:
CASE WHEN posted_date ~ '^\d{4}-\d{2}-\d{2}' THEN posted_date::date
ELSE NULL
END
Locations That Aren’t Locations
Expected: "Berlin, Germany"
Received:
"Remote"(that’s a remote_type, not a location)"Multiple Locations"(thanks, very helpful)"TBD""See job description""London, London, London, United Kingdom, London"(yes, London appears 4 times)"N/A"/"Not specified"/"-"
Salaries That Aren’t Salaries
Expected: "$60,000 - $80,000"
Received:
"Competitive"(every recruiter’s favorite word)"DOE"(Depends On Experience)"0"(thanks for the exposure)"1"(minimum legal salary?)"999999"(placeholder that made it to production)"$60,000 - $80,000 - $100,000"(three numbers, zero clarity)
HTML in Plain Text Fields
Expected: "Senior React Developer"
Received:
"Senior React Developer&nbsp;""<strong>Lead Engineer</strong>""Software Engineer\n\n\n\n\n\n\n"(7 newlines)""(zero-width spaces — invisible but break string matching)
Defensive Patterns
1. The Sanitizer Chain
function sanitizeText(input: string | null | undefined): string {
if (!input) return '';
return input
.replace(/<[^>]*>/g, '') // Strip HTML
.replace(/&\w+;/g, ' ') // Strip HTML entities
.replace(/[\u200B-\u200D\uFEFF]/g, '') // Strip zero-width chars
.replace(/\s+/g, ' ') // Collapse whitespace
.trim();
}
2. The Date Validator
function parseDate(input: string | null): Date | null {
if (!input) return null;
if (!/^\d/.test(input)) return null; // Must start with a digit
const parsed = new Date(input);
if (isNaN(parsed.getTime())) return null;
// Reject dates in the future or too far in the past
const now = new Date();
if (parsed > now) return null;
if (parsed < new Date('2020-01-01')) return null;
return parsed;
}
3. The “Is This Actually a Location?” Check
const NOT_A_LOCATION = [
'remote', 'multiple', 'tbd', 'n/a', 'not specified',
'see description', 'various', 'anywhere', 'flexible'
];
function isRealLocation(loc: string): boolean {
return !NOT_A_LOCATION.some(bad =>
loc.toLowerCase().includes(bad)
);
}
The Lesson
Every field from an external source can and will contain garbage. Validate at the boundary, never in the middle of your business logic.
This is the data hygiene layer at MisuJob, cleaning 10,000+ new job listings daily from 50+ sources.
What’s the worst data you’ve received from an external API? Comments below.

