Building a Document Classification System with OpenAI and Node.js

Imagine being able to automatically categorize thousands of documents with near-human accuracy. At MisuJob, we’re constantly exploring ways to improve how job seekers find their ideal roles, and document classification plays a crucial role in achieving that goal.

Building a Document Classification System with OpenAI and Node.js

Document classification is a fundamental task in many applications, from routing customer support tickets to automatically organizing legal documents. In our context, accurately classifying job descriptions is critical to MisuJob’s AI-powered job matching, which processes 1M+ job listings to connect professionals across Europe with relevant opportunities. A well-built classification system allows us to understand the content of a document without manual intervention, enabling more efficient search and personalized recommendations.

This blog post details how we built a document classification system using OpenAI’s powerful language models and Node.js, providing a scalable and accurate solution for categorizing large volumes of text data.

The Challenge: Scalable and Accurate Document Classification

Traditional methods of document classification, such as keyword-based approaches or simple machine learning models, often struggle with nuanced language and the sheer scale of data. We needed a system that could:

Handle complex language: Understand the context and meaning of text, not just keywords.
Scale to millions of documents: Efficiently process large volumes of data without performance bottlenecks.
Adapt to new categories: Easily incorporate new job categories and skills as the job market evolves.
Be cost-effective: Minimize the cost of processing each document.

Our Solution: OpenAI Embeddings and Node.js

We opted for a solution that leverages the power of OpenAI’s text embeddings and Node.js for backend processing. Here’s a breakdown of the key components:

OpenAI Embeddings: We use OpenAI’s text-embedding-ada-002 model to generate embeddings for each document. These embeddings are high-dimensional vectors that capture the semantic meaning of the text.
Vector Database: We store the embeddings in a vector database (e.g., Pinecone, Weaviate) to enable efficient similarity searches.
Node.js Backend: A Node.js API handles the document processing, embedding generation, and classification logic.
Classification Logic: We use cosine similarity to compare the embeddings of new documents with the embeddings of pre-defined categories.

Step-by-Step Implementation

Let’s walk through the key steps involved in building the document classification system:

1. Setting up the Node.js Environment

First, we need to set up a Node.js environment and install the necessary dependencies:

mkdir document-classifier
cd document-classifier
npm init -y
npm install openai dotenv pg

openai: The OpenAI Node.js library for interacting with the OpenAI API.
dotenv: For managing environment variables.
pg: A PostgreSQL client for Node.js (we use PostgreSQL to store our category data, though you could substitute with another database)

Next, create a .env file to store your OpenAI API key and database connection details:

OPENAI_API_KEY=YOUR_OPENAI_API_KEY
DATABASE_URL=YOUR_DATABASE_CONNECTION_STRING

2. Creating Category Embeddings

We need to create embeddings for our pre-defined job categories. These embeddings will serve as the basis for classifying new documents. We store the categories in a database.

-- Create a table to store job categories
CREATE TABLE job_categories (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    embedding VECTOR(1536) -- OpenAI text-embedding-ada-002 produces 1536-dimensional vectors
);

Then we use the Node.js application to generate the embeddings and store them in the database.

// index.js
require('dotenv').config();
const { OpenAI } = require('openai');
const { Pool } = require('pg');

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const pool = new Pool({ connectionString: process.env.DATABASE_URL });

async function generateEmbedding(text) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-ada-002',
    input: text,
  });
  return response.data[0].embedding;
}

async function storeCategoryEmbedding(categoryName, embedding) {
  const client = await pool.connect();
  try {
    await client.query(
      'INSERT INTO job_categories (name, embedding) VALUES ($1, $2)',
      [categoryName, embedding]
    );
    console.log(`Embedding stored for category: ${categoryName}`);
  } finally {
    client.release();
  }
}

async function main() {
  const categories = [
    'Software Engineer',
    'Data Scientist',
    'Product Manager',
    'Marketing Manager',
    'Financial Analyst',
  ];

  for (const category of categories) {
    const embedding = await generateEmbedding(category);
    await storeCategoryEmbedding(category, embedding);
  }

  await pool.end();
}

main().catch(console.error);

3. Classifying New Documents

Now, let’s implement the logic to classify new documents based on their similarity to the category embeddings.

// Function to classify a document
async function classifyDocument(documentText) {
  const documentEmbedding = await generateEmbedding(documentText);
  const client = await pool.connect();

  try {
    const result = await client.query(`
      SELECT id, name, embedding, 1 - (embedding <=> $1) AS similarity
      FROM job_categories
      ORDER BY similarity DESC
      LIMIT 1
    `, [documentEmbedding]);

    if (result.rows.length > 0) {
      const { name, similarity } = result.rows[0];
      console.log(`Document classified as: ${name} (Similarity: ${similarity})`);
      return { category: name, similarity };
    } else {
      console.log('No categories found.');
      return null;
    }
  } finally {
    client.release();
  }
}

// Example usage
async function testClassification() {
    const sampleDocument = "We are looking for a skilled software engineer to develop and maintain our web applications.";
    await classifyDocument(sampleDocument);
}

testClassification().catch(console.error);

This code calculates the cosine similarity between the document embedding and each category embedding, returning the category with the highest similarity score.

4. Scaling the System

To handle large volumes of documents, we can leverage several techniques:

Asynchronous Processing: Use message queues (e.g., RabbitMQ, Kafka) to decouple document ingestion from classification.
Parallel Processing: Distribute the classification workload across multiple worker nodes.
Caching: Cache frequently accessed category embeddings to reduce database load.
Batch Processing: Send multiple documents in a single API request to OpenAI to reduce the number of API calls.

Performance and Cost Considerations

We observed significant improvements in both accuracy and scalability compared to our previous keyword-based approach. Here are some key performance metrics:

Accuracy: The OpenAI-based system achieves an average accuracy of 92% on our test dataset, compared to 75% for the keyword-based system.
Latency: The average classification latency is 200ms per document, including embedding generation and similarity search.
Cost: The cost of generating embeddings depends on the number of tokens in the document. We estimate the cost to be around $0.0001 per document for an average job description (approximately 500 tokens).

To further optimize costs, we explored techniques like:

Truncating Long Documents: Limiting the number of tokens sent to the OpenAI API.
Using Cheaper Embedding Models: Experimenting with different embedding models to find a balance between accuracy and cost.

Real-World Impact on MisuJob’s Data

The improved document classification system has had a significant impact on MisuJob’s AI-powered job matching. By accurately categorizing job descriptions, we can:

Improve Job Search Relevance: Provide more relevant search results to job seekers.
Enhance Personalized Recommendations: Recommend jobs that align with a user’s skills and experience.
Automate Job Categorization: Reduce the manual effort required to categorize new job listings.

For example, consider a job seeker in Berlin searching for “Data Scientist” roles. With the improved classification system, we can ensure that they see relevant listings from companies across Germany, as well as opportunities in other European cities like Amsterdam, Paris, and London.

Furthermore, the improved categorization directly impacts the salary insights we are able to provide. By having more accurate categorization of job titles, we can provide better salary ranges across Europe.

Job Title	Germany (Avg. €)	Netherlands (Avg. €)	UK (Avg. £)	France (Avg. €)
Software Engineer	75,000	70,000	65,000	60,000
Data Scientist	80,000	75,000	70,000	65,000
Product Manager	90,000	85,000	80,000	75,000
Marketing Manager	70,000	65,000	60,000	55,000
Financial Analyst	65,000	60,000	55,000	50,000

These are average figures, and actual salaries can vary depending on experience, location within the country, and company size. For example, senior software engineers in London may command salaries exceeding £90,000, while entry-level positions in smaller German cities might be closer to €60,000.

Future Enhancements

We plan to further enhance the document classification system by:

Fine-tuning the Embedding Model: Fine-tuning the OpenAI embedding model on our own dataset of job descriptions to improve accuracy for specific job categories.
Implementing Active Learning: Using active learning techniques to identify and label the most informative documents for training.
Adding Support for Multiple Languages: Expanding the system to support multiple European languages.

Key Takeaways

Building a document classification system with OpenAI and Node.js offers a powerful and scalable solution for categorizing large volumes of text data. By leveraging the semantic understanding of OpenAI embeddings and the efficiency of Node.js, we were able to significantly improve the accuracy and scalability of MisuJob’s AI-powered job matching.

OpenAI embeddings provide superior accuracy compared to traditional keyword-based approaches.
Node.js offers a scalable and cost-effective platform for building the classification system.
Vector databases enable efficient similarity searches for classifying new documents.
Continuous monitoring and optimization are crucial for maintaining the accuracy and performance of the system.

By implementing these techniques, your team can build a robust document classification system that unlocks valuable insights from your data and enhances the user experience.