Engineering

Sitemap Generation at Scale: Dynamic Sitemaps for 100K+ Pages

Learn how MisuJob tackled sitemap generation for 100K+ dynamic pages, ensuring discoverability of European job opportunities at scale. SEO best practices included!

· Founder & Engineer · · 6 min read
Code snippet showing a sitemap generation script dynamically creating XML for a large website.

Sitemaps are essential for search engine optimization, but generating them at scale for dynamic websites with hundreds of thousands of pages presents unique engineering challenges. At MisuJob, we tackled this problem head-on to ensure our comprehensive index of European job opportunities is easily discoverable.

The Challenge: Dynamic Content at Scale

MisuJob processes 1M+ job listings, aggregating from multiple sources across Europe. This means our website’s content is constantly evolving. Manually managing a sitemap just wasn’t feasible. We needed an automated solution that could:

  • Dynamically generate sitemaps reflecting the latest job postings.
  • Handle a large number of URLs (100K+).
  • Minimize the impact on website performance.
  • Ensure search engines could efficiently navigate our extensive job database.

Our initial approach involved a simple cron job that would rebuild the entire sitemap daily. However, as our data grew, this process became increasingly slow and resource-intensive, impacting our database performance. We quickly realized we needed a more sophisticated strategy.

Our Solution: Incremental Sitemap Generation

We implemented an incremental sitemap generation approach, focusing on updating only the sections that have changed since the last sitemap update. This significantly reduced the processing time and minimized the load on our servers. Here’s a breakdown of the key steps:

  1. Change Tracking: We implemented a system to track changes to our job listings. Every time a job is created, updated, or deleted, we record this event in a dedicated “change log” table in our database.

    CREATE TABLE job_change_log (
        id SERIAL PRIMARY KEY,
        job_id UUID NOT NULL,
        change_type VARCHAR(10) NOT NULL, -- 'CREATE', 'UPDATE', 'DELETE'
        created_at TIMESTAMP WITHOUT TIME ZONE DEFAULT (NOW() AT TIME ZONE 'utc')
    );
    
  2. Sitemap Segmentation: We divided our sitemap into smaller, more manageable files based on job categories and locations. This allows us to update only the specific sitemap files that contain the changed job listings. For example, we might have separate sitemaps for “Software Engineering Jobs in Berlin,” “Marketing Jobs in London,” and so on.

  3. Incremental Update Process: A scheduled task runs periodically (every hour in our case) to process the change log. It identifies the affected sitemap segments and regenerates only those segments.

    import datetime
    import uuid
    import xml.etree.ElementTree as ET
    from xml.dom import minidom
    
    def generate_sitemap(job_ids: list[uuid.UUID], filename: str):
        """Generates a sitemap XML file for the given job IDs."""
        root = ET.Element("urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")
    
        for job_id in job_ids:
            url = ET.SubElement(root, "url")
            loc = ET.SubElement(url, "loc")
            loc.text = f"https://misujob.com/job/{job_id}"  # Replace with your actual URL structure
            lastmod = ET.SubElement(url, "lastmod")
            lastmod.text = datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
            changefreq = ET.SubElement(url, "changefreq")
            changefreq.text = "daily"
            priority = ET.SubElement(url, "priority")
            priority.text = "0.8"
    
        xmlstr = minidom.parseString(ET.tostring(root)).toprettyxml(indent="   ")
        with open(filename, "w") as f:
            f.write(xmlstr)
    
    # Example Usage (replace with your actual data)
    # job_ids = [uuid.uuid4() for _ in range(10)]
    # generate_sitemap(job_ids, "sitemap_example.xml")
    
  4. Sitemap Index: We maintain a sitemap index file that lists all the individual sitemap files. This index file is submitted to search engines. The incremental update process also updates this index file whenever a new sitemap segment is generated or an existing one is updated.

    <?xml version="1.0" encoding="UTF-8"?>
    <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
      <sitemap>
        <loc>https://misujob.com/sitemaps/sitemap_software_berlin.xml</loc>
        <lastmod>2024-01-26T10:00:00+00:00</lastmod>
      </sitemap>
      <sitemap>
        <loc>https://misujob.com/sitemaps/sitemap_marketing_london.xml</loc>
        <lastmod>2024-01-26T09:30:00+00:00</lastmod>
      </sitemap>
      </sitemapindex>
    

Performance Optimization

Beyond the incremental approach, we employed several optimization techniques to further enhance the performance of our sitemap generation process:

  • Database Indexing: We added appropriate indexes to our database tables to speed up the queries used to retrieve job listing data. Specifically, we indexed the job_id column in the job_change_log table and the columns used for filtering jobs by category and location.

  • Caching: We implemented caching at various levels to reduce the load on our database. We cached the results of frequently executed queries, as well as the generated sitemap files themselves. We used Redis for in-memory caching.

  • Asynchronous Processing: We moved the sitemap generation task to a background queue to avoid blocking the main website threads. This ensures that our website remains responsive even during peak traffic. We used Celery for task queue management.

  • Gzip Compression: We enabled Gzip compression for our sitemap files to reduce their size and improve download speeds.

The Impact

Our incremental sitemap generation solution significantly improved the efficiency of our sitemap updates. We observed the following results:

  • Reduced Sitemap Generation Time: The time required to generate the sitemap decreased from several hours to just a few minutes.
  • Minimized Database Load: The load on our database during sitemap generation was reduced by over 80%.
  • Improved Website Performance: The overall performance of our website improved due to the reduced database load and asynchronous processing.
  • Increased Search Engine Visibility: We observed a noticeable increase in the number of job listings indexed by search engines, leading to more organic traffic.

Real-World Considerations: Internationalization and Salary Data

As MisuJob operates across multiple European countries, we also had to consider internationalization when generating our sitemaps. We ensure that the URLs in our sitemaps are properly localized for each country. For example, a job listing in Germany would have a URL that includes the /de/ prefix, while a job listing in France would have a URL that includes the /fr/ prefix.

Furthermore, we expose salary data through our API and website. We’ve observed significant differences in salary expectations across different European countries for similar roles. Understanding these nuances is critical for job seekers and employers alike.

Here’s a comparison of average salaries for Software Engineers with 3-5 years of experience in various European cities:

CityCountryAverage Salary (€/year)
ZurichSwitzerland110,000 - 130,000
LondonUK75,000 - 95,000
AmsterdamNetherlands65,000 - 85,000
BerlinGermany60,000 - 80,000
ParisFrance55,000 - 75,000

This data highlights the importance of considering location when evaluating job opportunities and salary expectations. MisuJob’s AI-powered job matching takes these factors into account to provide personalized recommendations to our users.

Future Enhancements

We are continuously working to improve our sitemap generation process. Some of the future enhancements we are considering include:

  • Real-time Sitemap Updates: Exploring the possibility of updating the sitemap in near real-time using message queues.
  • Dynamic Priority Assignment: Implementing a more sophisticated algorithm for assigning priorities to URLs in the sitemap based on factors such as job relevance and freshness.
  • Integration with Analytics: Integrating our sitemap generation process with our analytics platform to track the performance of individual sitemap segments.

Key Takeaways

  • Dynamic sitemap generation is crucial for websites with frequently changing content.
  • Incremental updates significantly reduce processing time and minimize database load.
  • Database indexing, caching, and asynchronous processing are essential for performance optimization.
  • Internationalization and salary data considerations are important for European job platforms.
  • Continuous monitoring and optimization are key to maintaining an efficient sitemap generation process.

By implementing these strategies, we’ve successfully managed sitemap generation at scale for MisuJob, ensuring that our users can easily find the job opportunities they’re looking for across Europe.

sitemap seo dynamic sitemap large scale engineering
Share
P
Pablo Inigo

Founder & Engineer

Building MisuJob - an AI-powered job matching platform processing 1M+ job listings daily.

Engineering updates

Technical deep dives delivered to your inbox.

Find your next role with AI

Upload your CV. Get matched to 50,000+ jobs. Apply to the best fits effortlessly.

Get Started Free

User

Dashboard Profile Subscription