Engineering Originally on dev.to

Building a Production Health Monitor in Node.js (Without Datadog or New Relic)

Datadog costs $15/host/month. We built a health monitoring system with daily reports for $0. Here's how.

P
Pablo Inigo · Founder & Engineer
2 min read
Application health dashboard with heartbeat pulse and system gauges

Datadog costs $15/host/month. New Relic is $25/month. For a solo dev on a single VM, that’s overkill. Here’s how we built a health monitoring system that sends daily reports with zero external dependencies.

What We Monitor

Every morning at 7 AM, we get an email report:

=== MisuJob Health Report ===

Database:
  Active jobs: 512,847
  New today: 3,247
  Queue backlog: 142 jobs

Performance:
  Avg response time: 89ms
  Slow queries (>5s): 2
  Error rate: 0.01%

Scheduled Jobs:
  Last 24h: 47 executed, 0 failed
  Cancelled (PM2 restart): 3

Disk: 14.2 GB / 29 GB (49%)
Memory: 1.1 GB / 2 GB (55%)

The Architecture

class SystemHealthMonitor {
  async generateReport(): Promise<HealthReport> {
    const [db, perf, jobs, system] = await Promise.all([
      this.checkDatabase(),
      this.checkPerformance(),
      this.checkScheduledJobs(),
      this.checkSystemResources(),
    ]);

    return { db, perf, jobs, system };
  }
}

Avoiding False Positives

1. Materialized View Refresh Is NOT a Slow Query

async checkSlowQueries(): Promise<SlowQuery[]> {
  const result = await pool.query(`
    SELECT query, calls, mean_exec_time
    FROM pg_stat_statements
    WHERE mean_exec_time > 5000 -- > 5 seconds
    ORDER BY mean_exec_time DESC
  `);

  // Filter out known slow-but-expected queries
  return result.rows.filter(q =>
    !q.query.toLowerCase().includes('refresh materialized view')
  );
}

Our MV refresh takes 138 seconds. That’s expected, not an alert.

2. Auto-Cancelled Jobs Are NOT Failures

async checkFailedJobs(): Promise<FailedJob[]> {
  const result = await pool.query(`
    SELECT job_name, error_message
    FROM scheduled_job_executions
    WHERE status = 'failed'
      AND started_at > NOW() - INTERVAL '24 hours'
  `);

  // PM2 restarts cancel in-progress jobs - not real failures
  return result.rows.filter(j =>
    !j.error_message?.includes('Auto-cancelled')
  );
}

When PM2 restarts, running scheduled jobs get interrupted. They’re marked “failed” but it’s not a real failure.

Sending the Report

// Run daily at 7 AM UTC
cron.schedule('0 7 * * *', async () => {
  const report = await healthMonitor.generateReport();
  const html = formatReportAsHtml(report);
  await ses.sendEmail({
    to: '[email protected]',
    subject: `Health Report: ${report.hasIssues ? 'ISSUES' : 'OK'}`,
    html
  });
});

Total cost: $0 (we already pay for SES).

This monitors MisuJob — 1M+ listings, 50+ importers, all on a single VM.


How do you monitor your side projects? Uptime Robot, custom scripts, or just hope?

Node.js Monitoring DevOps Backend
Share
P
Pablo Inigo

Founder & Engineer

Building MisuJob — an AI-powered job matching platform processing 1M+ tech job listings daily.

Engineering updates

Technical deep dives delivered to your inbox.

Find your next role with AI

Upload your CV. Get matched to 50,000+ jobs. Auto-apply to the best fits.

Get Started Free

User

Dashboard Profile Subscription