For a file that aids in URL discovery, XML sitemaps are rarely talked about. Luckily, I got the opportunity to talk about them on the Rank Ranger Podcast. And then decided to write an in-depth version of the talk here. You can find a link to the podcast in the third section of this article.
In this post, I’ll explain what XML sitemaps are, common mistakes you’re making in yours, and how to fix them.
What is a Sitemap?
A sitemap is a file where you inform search engines about important pages and files on your site. Besides the pages, you can provide valuable information such as when the page was last updated and the images on it. If you have videos, you would also need to specify their duration, description, and title on a video XML file.
If you don’t have lots of pages or file formats (e.g videos)on your website, you can create a single sitemap for your pages. This sitemap can contain up to 50,000 URLS and must not exceed 50MB uncompressed. Here’s an example:
However, if you have a site with several file types or a large website with different sections, you can break them down into individual sitemaps and put them under one sitemap index file. Just like the individual sitemaps, an index file can only contain up to 50,000 URLs (of individual sitemaps) and must not exceed 50MB uncompressed. An example of an index sitemap file is below.
Why XML Sitemaps are Important
There are two main benefits of having an XML sitemap:
1. XML Sitemaps Aid Page Discovery
XML sitemaps help search engines discover important pages and recent content updates on a website as I mentioned earlier. So, usually, if a website is relatively small, has pages that don’t change frequently, and has a great internal linking structure, crawlers can discover all important pages by following links on that site. In this scenario, having a sitemap is a nice addition but not a ‘must-have’.
But in the case where the site has poor internal linking, no external links pointing to its pages, or you have lots of images and videos you want to index, then a sitemap can help crawlers to find those pages.
Keep in mind that an XML sitemap doesn’t guarantee the crawling and indexing of those pages. It’s more like a hint, rather than a directive that crawlers are bound to follow. Also, your pages will not be crawled in the order they appear on the sitemap.
2. XML Sitemaps Can Help You TroubleShoot Indexing Issues
The second benefit of having an XML sitemap is that it can help you troubleshoot indexing issues on your website. I’ll explain more about this when looking at having one large sitemap for separate sections of a website later in this post.
Common XML Sitemap Mistakes and How to Fix Them
As I mentioned earlier, I discussed this topic on the Rank Ranger Podcast. However, this blog post is more in-depth. If you’d like to check out the podcast version, listen to the XML sitemap episode here. Now, let’s look at 4 common XML sitemap mistakes.
1. Listing Ineligible URLs
This is a scenario where 404 pages, duplicates, redirects, and pages blocked with a noindex rule or by robots.txt are found on the sitemap. One of the reasons this could come up is if the sitemap was generated manually and some of these ineligible URLs made it into the file. It may also be because you have a static sitemap that doesn’t update automatically as you make changes to your pages.
Whichever the case, having ineligible URLs on your sitemap is a problem because rather than crawling valid pages, web crawlers may spend time on these problematic URLs on your sitemap. Also, by specifying these URLs, you’re telling search engines that those pages are important and worth indexing while at the same time, the robots.txt, noindex tag, or whatever error that exists on that page is sending a contrary signal.
How to Diagnose:
To check if you have non-indexable URLs on your XML sitemap, run a crawl using Screaming Frog.
Steps: Once you’re on Screaming Frog, go to Configuration > Spider > Crawl. Go to the XML sitemap section under this tab and tick the Crawl Linked XML Sitemap and Crawl These Sitemaps options. Paste a link to your sitemap(s) in the box below and save your configuration.
This configuration allows you to see non-indexable URLs in a sitemap, their status & response code, and tags. Go back to the main Screaming Frog dashboard and add a link to the URL you wish to audit. Once the crawl is done, go to the sitemaps tab and filter to Non-indexable URLs in Sitemap.
How to Fix:
To remove non-indexable pages from your XML sitemap, check the instructions for the sitemap plugin you’re using. For instance, on Yoast, if you have a redirected URL on the sitemap and the post itself is still available in your Content Management System, you might need to add a noindex tag to remove it from your sitemap. Here are the steps:
- Open up the edit functionality for the post
- Scroll down to Yoast options usually at the bottom of the page
- Click on Advanced
- Look for ‘Allow search engines to show this page in search results‘ and then, click No
Note: On some plugins like Yoast, if a page was redirected through another source that specific plugin, the page would still appear on its generated XML sitemap. If you don’t have the page anymore on your CMS, the only way to remove the post from your sitemap at the time of writing this post is to disable the Yoast XML sitemap functionality and generate a new sitemap with another plugin.
Related post: What Is a 404 Error and How Do You Fix It?
2. XML Sitemap File in HTML Format
First, what’s the difference between an HTML sitemap and an XML sitemap?
An HTML sitemap contains a clickable list of pages and sections of your website. Usually, it’s found at the footer of your website. The purpose of an HTML sitemap is to help users easily navigate your website and find the pages they’re looking for. Because these links are listed on your website, search engines can also follow them to discover new pages. However, if you create and submit an actual HTML file on the Google Search Console, crawlers won’t be able to read it correctly.
An XML sitemap on the other hand is exclusively written for search engines. Your site visitors can’t find it anywhere on your website. Also, it contains not just a list of your important pages but additional information like when it was last updated, images, videos, etc.
If you intend to tell search engines about existing, new, or updated pages on your website, you need an XML sitemap, not an HTML sitemap. Other supported formats are Atom 1.0, RSS, mRSS, and a text file that has a .txt extension.
If your sitemap has been detected as an HTML file, you’ll see this error on the sitemap section of the Google Search Console shortly after you submit it.
Possible reasons why this error comes up are:
- You’re submitting an actual HTML file or page; you shouldn’t.
- There are syntax errors on the sitemap.
- Your caching plugin or functionality is caching your sitemap—the most common. This often results in the sitemap being detected as an HTML file or not being read correctly.
Any of these reasons can affect the ability of search engines to parse your sitemap properly and discover your pages. In fact, Google has mentioned that if they can’t read a sitemap after several attempts, they will stop trying to read that sitemap. This defeats the purpose of having an XML sitemap and can significantly impact page discovery for websites with less optimal internal linking structure.
“If a sitemap cannot be read after several attempts, Google will eventually stop trying to read that sitemap. You should fix the errors and resubmit the sitemap.”
– Google Developer Documentation
How to Diagnose:
First, check your sitemap URL to make sure you copied the right link and that the URL is an XML sitemap page.
If your sitemap is truly not an HTML file but keeps generating the “Your Sitemap Appears to Be An HTML Page” error, you’ll need to check if your plugin(s) is caching your sitemap.
Review your installed plugins to know which of them has an active caching functionality. If you’re having issues narrowing it down, a quick way to confirm which plugin is responsible is to inspect the sitemap page through the Chrome Dev Tools.
- Press Option + ⌘ + J (on macOS), or Shift + CTRL + J on Windows or Linux to open the dev tools.
- Go to the Elements tab. If there’s a plugin or any settings on your server caching your sitemap, it’s usually listed there.
How to Fix:
Once you find the caching plugin, follow instructions specific to that tool to exclude your sitemaps. If you use W3 Total Cache, WP Super Cache, or WP Rocket, read this guide on how to prevent sitemap caching on those plugins. Don’t forget to clear your cache after making these changes.
If your caching functionality isn’t the cause, then there’s likely a syntax error on your sitemap. In this case, use an XML sitemap validator to find errors on the sitemap.
Also, check your installed plugins to make sure you don’t have multiple tools creating XML sitemaps for your website. This could result in a scenario where the settings from one plugin conflict with another.
3. Not Declaring a Page and Its Alternate Versions Correctly
If you’re implementing your hreflang tag on your XML sitemap, you need to indicate a specific URL and then list all of its alternate versions including itself.
For instance, let’s say you have three versions of a page—an article in English for English speakers worldwide, another version for German users worldwide, and the last one for German users in Switzerland. You’ll need to list the English page and then specify three alternate versions which are the German version for worldwide users, the German version for users in Switzerland, and the English page again. Below is an example from the Google Developers docs on how your sitemap should look for these pages.
Also, hreflang annotations are bi-directional. Each referenced alternate version must point back to the page that referenced it just like in the sitemap above. If you don’t reference an alternate version properly (including the URL you’re listing), these tags may be ignored or interpreted incorrectly.
How to Diagnose:
Crawl your XML sitemap with Screaming Frog. To do this, go to Configuration> Spider > Crawl > Crawl Linked XML Sitemaps and paste a link to your sitemap.
Look for errors like missing return links, missing self-reference, inconsistent language & legion return links, or pages that don’t have an hreflang tag (missing).
How to Fix:
Once you spot errors, you can use XML sitemap generators that support international SEO to map your URLs and create a valid sitemap. Examples of such tools are Merkle’s XML Sitemap Generator (for hreflang tags) and the Hreflang Builder.
If you created one already, you can use Merkle’s hreflang tag test tool to validate your sitemap.
4. Having One Large Sitemap for Separate Sections of a Website
Google supports up to 50,000 URLs on a single sitemap or a max size of 50 MB uncompressed—whichever one you hit first.
However, this doesn’t mean that you should list all your URLs in one sitemap until you get to the maximum file size. Doing so would make it difficult to spot sections of your website that aren’t getting crawled or indexed as they would.
How to Diagnose:
If you have an individual sitemap that lists all URLs on your website, you’ll notice that you can only check for URLs in just that sitemap on the Indexing Coverage Report in Google Search Console. This makes it difficult to draw insights on indexing. You have no idea if the issues are more prevalent in a section of your website or if it’s an isolated case.
How to Fix:
It’s best to have separate XML sitemaps under one index file. Ideally, choose a plugin that segments your sitemaps by sections. For example, if you have an eCommerce site, you can have different XML sitemaps for your static content pages (about us, terms & conditions, etc), product pages, and blog posts.
This segmentation allows you to filter down to each specific sitemap and spot potential issues more effectively. For instance, if a specific section has more URLs that are crawled but not indexed, you might need to check the quality of pages in this section, internal linking, and how these pages are rendered.
Wrapping Up
As I mentioned before, having a URL on the sitemap doesn’t guarantee crawling or indexing. But, it’s a great way to tell search engines about your important pages and improve their chances of finding those pages.
However, listing non-indexable pages, submitting an XML sitemap that’s detected as HTML, or incorrectly implementing your hreflang tags on the sitemap could affect the chances of your pages getting discovered, especially if you don’t have good internal linking. In the case of having one sitemap for all sections of your website, it could affect your ability to spot valuable patterns for indexing. Review your sitemap for any of these mistakes and resolve them to improve page discovery on your website.