Cleaning Up Sitemaps for Google Search Console
Google wasn't indexing my sitemap properly. Turned out I had redirect URLs polluting it.
Noticed Google Search Console wasn’t picking up my sitemap submissions properly. Took a closer look at what was actually in there.
The problem
My sitemap generator was including redirect URLs alongside the canonical ones. For example:
sitemap-tools.xml: /ko/tools/base64, /en/tools/base64, /ja/tools/base64
sitemap-pages.xml: /tools/base64 (redirect page!)
The /tools/base64 page just redirects to /{lang}/tools/base64, but it was sitting in the sitemap. Google sees this, tries to crawl it, gets redirected, and probably gets confused.
Same issue with /anonymous-chat - it redirects to /{lang}/anonymous-chat.
What I fixed
Updated scripts/generate-sitemaps.mjs to:
- Filter out redirect URLs
const REDIRECT_PATTERNS = [
/^\/tools\/[^/]+\/?$/, // /tools/xxx without lang prefix
/^\/tools\/?$/, // /tools/ index
/^\/anonymous-chat\/?$/, // /anonymous-chat
];
function isRedirectUrl(pathname) {
return REDIRECT_PATTERNS.some((pattern) => pattern.test(pathname));
}
- Use actual blog post dates for lastmod
Before, every URL had the same lastmod timestamp (build time). Google probably ignored it.
function getBlogPostDates() {
const dates = new Map();
const files = fs.readdirSync(CONTENT_DIR).filter((f) => f.endsWith('.mdx'));
for (const file of files) {
const content = fs.readFileSync(path.join(CONTENT_DIR, file), 'utf-8');
const dateMatch = content.match(/^date:\s*(\d{4}-\d{2}-\d{2})/m);
if (dateMatch) {
const slug = file.replace('.mdx', '');
dates.set(slug, `${dateMatch[1]}T00:00:00.000Z`);
}
}
return dates;
}
Now blog posts have their actual publish date as lastmod.
Results
| Before | After |
|---|---|
| 179 URLs (with duplicates) | 154 URLs (canonical only) |
| All same lastmod | Blog posts have real dates |
sitemap-blog.xml: 17 URLs
sitemap-tools.xml: 123 URLs (ko/en/ja × tools)
sitemap-projects.xml: 9 URLs
sitemap-pages.xml: 5 URLs
41 redirect URLs excluded.
Validation
Quick check that no redirect URLs leaked through:
xmllint --noout dist/sitemap-*.xml
# All valid
grep -E "/tools/[^/]+/" dist/sitemap-pages.xml
# Empty - no legacy tool URLs
Should see better indexing now that Google isn’t chasing redirects in the sitemap.