Tuesday, 25 June 2013

Why sitemap.xml files are important

Before starting this post I just wanted to say I'm really sorry about not having posted in quite some time, I've just bee really, really busy at work. Still am but will hopefully be making time to make more regular posts here.


I've done a bit of reading recently about why sitemap.xml files can be useful for Google to index your sites, they can actually make a huge difference. I should point out that sitemap.xml isn't only used by the Google search crawler (AKA Google Bot) but they are basically the main 'player' in search engines.

Why have sitemap.xml?

Well for a lot of sites like a 'standard blog' or a lot of other sites where pages link to each other it isn't a massive issue because the Google bot can get to & index every page pretty easily through the links within the site. A problem can occur if you have a site with dynamically generated pages, it might get so large that it is no longer even be possible to have links to every page and as a result the Google bot may never find those pages. An example of a site of this size would be StackOverflow.com.

In those situations it takes the Google bot a lot longer to find every page (if it finds them at all) so a sitemap.xml file really helps to get those pages found and indexed by Google, if you want a good example of this happening in real life read this post by Jeff Atwood on Coding Horror about Stack Overflow not being indexed well by Google without sitemap.xml. Of course I'm sure even if you are running a smaller site where all the pages are discoverable 'naturally' by the Google bot then it still might help you but not to the same degree and if you left sitemap.xml out all together on a smaller site you might not even notice.

Sitemap.xml limits

There are limits to the size of this file, it cannot be any bigger than 50M and can contain no more than 50,000 links according to a Google webmaster support article, with such large limits this generally isn't a problem but I don't know about other search crawler restrictions.

The other limit (which is actually more of a scope issue) is that links in the sitemap.xml file can only be related to other files in the same directory as the sitemap.xml file or its subdirectories, you can go as many directories deep as you like, so it won't work with any files which are in a parent directory or anywhere further up the tree structure. Google's reasoning for this is that they can be pretty certain that all the content in the directory and sub-directories to where you've placed the sitemap.xml file are controlled by you but further up or across than that could be outside of your control. For example on a shared hosting server you could have several sites all on the same 'level' and if sitemap.xml could go to any directory that isn't one of its descendants you'd be having Google index everyone else's site on that server.

What to do when you reach the sitemap.xml limits

At this point you'll need to split the file into multiple ones, which is just fine according to the Google webmaster blog. The best methods of splitting them up isn't to just start a new one each time you reach either the 50M or 50,000 links limits but to split them in an organised fashion. There are various ways of doing this and it really depends on your site and how the content has been structured, for some ideas check out the Google webmaster blog post about having multiple site maps & a post on SEOmoz on having multiple site maps.

How to make a sitemap.xml file

Well if you are making one 'by hand' or making a script which will generate them for you check out this support article by Google webmasters which gives a good overview of how the XML file should be structured.

If you are basing your website on a CMS which supports plugins then there's a good chance one already exists for creating sitemap.xml files, for example this is one I use for generating sitemap.xml files in WordPress sites.