2 min read

Cleaning Up Sitemaps for Google Search Console

Google wasn't indexing my sitemap properly. Turned out I had redirect URLs polluting it.

Noticed Google Search Console wasn’t picking up my sitemap submissions properly. Took a closer look at what was actually in there.

The problem

My sitemap generator was including redirect URLs alongside the canonical ones. For example:

sitemap-tools.xml: /ko/tools/base64, /en/tools/base64, /ja/tools/base64
sitemap-pages.xml: /tools/base64 (redirect page!)

The /tools/base64 page just redirects to /{lang}/tools/base64, but it was sitting in the sitemap. Google sees this, tries to crawl it, gets redirected, and probably gets confused.

Same issue with /anonymous-chat - it redirects to /{lang}/anonymous-chat.

What I fixed

Updated scripts/generate-sitemaps.mjs to:

  1. Filter out redirect URLs
const REDIRECT_PATTERNS = [
  /^\/tools\/[^/]+\/?$/,    // /tools/xxx without lang prefix
  /^\/tools\/?$/,           // /tools/ index
  /^\/anonymous-chat\/?$/,  // /anonymous-chat
];

function isRedirectUrl(pathname) {
  return REDIRECT_PATTERNS.some((pattern) => pattern.test(pathname));
}
  1. Use actual blog post dates for lastmod

Before, every URL had the same lastmod timestamp (build time). Google probably ignored it.

function getBlogPostDates() {
  const dates = new Map();
  const files = fs.readdirSync(CONTENT_DIR).filter((f) => f.endsWith('.mdx'));

  for (const file of files) {
    const content = fs.readFileSync(path.join(CONTENT_DIR, file), 'utf-8');
    const dateMatch = content.match(/^date:\s*(\d{4}-\d{2}-\d{2})/m);

    if (dateMatch) {
      const slug = file.replace('.mdx', '');
      dates.set(slug, `${dateMatch[1]}T00:00:00.000Z`);
    }
  }
  return dates;
}

Now blog posts have their actual publish date as lastmod.

Results

BeforeAfter
179 URLs (with duplicates)154 URLs (canonical only)
All same lastmodBlog posts have real dates
sitemap-blog.xml:     17 URLs
sitemap-tools.xml:   123 URLs (ko/en/ja × tools)
sitemap-projects.xml:  9 URLs
sitemap-pages.xml:     5 URLs

41 redirect URLs excluded.

Validation

Quick check that no redirect URLs leaked through:

xmllint --noout dist/sitemap-*.xml
# All valid

grep -E "/tools/[^/]+/" dist/sitemap-pages.xml
# Empty - no legacy tool URLs

Should see better indexing now that Google isn’t chasing redirects in the sitemap.