Engineering Originally on dev.to

Never Trust External Data: A Collection of Garbage We've Found in Production

A museum of garbage data from 50+ job sources and the defensive coding patterns that save us daily.

P
Pablo Inigo · Founder & Engineer
2 min read
Data pipeline filtering garbage input into clean validated output

We import job listings from 50+ sources. Here’s a museum of the garbage we’ve encountered in production data — and the defensive coding patterns that save us.

The Hall of Shame

Dates That Aren’t Dates

Expected: "2026-02-15"

Received:

  • "Reportar empleo" (Spanish for “Report this job”)
  • "Apply Now"
  • "3 days ago"
  • "Posted: Recently"
  • 1708000000 (Unix timestamp in seconds, not milliseconds)
  • "" (empty string)
  • "null" (the string “null”, not null)

The fix:

CASE WHEN posted_date ~ '^\d{4}-\d{2}-\d{2}' THEN posted_date::date
     ELSE NULL
END

Locations That Aren’t Locations

Expected: "Berlin, Germany"

Received:

  • "Remote" (that’s a remote_type, not a location)
  • "Multiple Locations" (thanks, very helpful)
  • "TBD"
  • "See job description"
  • "London, London, London, United Kingdom, London" (yes, London appears 4 times)
  • "N/A" / "Not specified" / "-"

Salaries That Aren’t Salaries

Expected: "$60,000 - $80,000"

Received:

  • "Competitive" (every recruiter’s favorite word)
  • "DOE" (Depends On Experience)
  • "0" (thanks for the exposure)
  • "1" (minimum legal salary?)
  • "999999" (placeholder that made it to production)
  • "$60,000 - $80,000 - $100,000" (three numbers, zero clarity)

HTML in Plain Text Fields

Expected: "Senior React Developer"

Received:

  • "Senior React Developer "
  • "<strong>Lead Engineer</strong>"
  • "Software Engineer\n\n\n\n\n\n\n" (7 newlines)
  • "​​​​" (zero-width spaces — invisible but break string matching)

Defensive Patterns

1. The Sanitizer Chain

function sanitizeText(input: string | null | undefined): string {
  if (!input) return '';
  return input
    .replace(/<[^>]*>/g, '')           // Strip HTML
    .replace(/&\w+;/g, ' ')           // Strip HTML entities
    .replace(/[\u200B-\u200D\uFEFF]/g, '') // Strip zero-width chars
    .replace(/\s+/g, ' ')             // Collapse whitespace
    .trim();
}

2. The Date Validator

function parseDate(input: string | null): Date | null {
  if (!input) return null;
  if (!/^\d/.test(input)) return null; // Must start with a digit

  const parsed = new Date(input);
  if (isNaN(parsed.getTime())) return null;

  // Reject dates in the future or too far in the past
  const now = new Date();
  if (parsed > now) return null;
  if (parsed < new Date('2020-01-01')) return null;

  return parsed;
}

3. The “Is This Actually a Location?” Check

const NOT_A_LOCATION = [
  'remote', 'multiple', 'tbd', 'n/a', 'not specified',
  'see description', 'various', 'anywhere', 'flexible'
];

function isRealLocation(loc: string): boolean {
  return !NOT_A_LOCATION.some(bad =>
    loc.toLowerCase().includes(bad)
  );
}

The Lesson

Every field from an external source can and will contain garbage. Validate at the boundary, never in the middle of your business logic.

This is the data hygiene layer at MisuJob, cleaning 10,000+ new job listings daily from 50+ sources.


What’s the worst data you’ve received from an external API? Comments below.

Backend Data Engineering Bugs Node.js
Share
P
Pablo Inigo

Founder & Engineer

Building MisuJob — an AI-powered job matching platform processing 1M+ tech job listings daily.

Engineering updates

Technical deep dives delivered to your inbox.

Find your next role with AI

Upload your CV. Get matched to 50,000+ jobs. Auto-apply to the best fits.

Get Started Free

User

Dashboard Profile Subscription