In our introductory guide to log file analysis, we highlighted that one of the issues that can hinder Google's ability to crawl a site is complex faceted navigation.
The root cause is simple - for every additional facet, you increase the number of potential URL combinations exponentially. Thus if, for example, an ecommerce site lets people filter by size, colour, style and gender you can quickly end up with millions of possible combinations, all with unique URLs.
It's this endlessly combinatorial nature of faceted navigation that presents us with issues when it comes to crawling and indexation. How should Google decide what to crawl and leave?
Solving the Problem
The first step towards solving faceted navigation is to work out what URL combinations you actually do want to end up being crawled and indexed. If you're a fashion retailer, this might be dependent on season, so you you'd want women's summer dresses indexed from say, March to September in the northern hemisphere, whilst they'd be indexed from September to March in southern hemisphere nations. For regions around the equator, you might want those products available to the index all year round. This is why this is the the first challenge. We need to understand what's valuable for traffic and what isn't, and under what conditions that's true.
Enforced Facet Ordering
Using this sort of logic, we can start to construct a rules system for what should and shouldn't be indexed. For sites with faceted navigation, this tends to get pretty complex pretty quickly, but the best method is to assume everything should be non-indexed, unless x set of conditions are true.
This is a far simpler method than trying to state when you want indexation to be turned off. Odds are, the vast majority of combinations a user can generate are things you don't want indexed. Thus by starting assuming not to index something and only turning indexation on if the set conditions are met, we can easily test later if everything works correctly.
We'd also always recommend doing this in conjunction with a keyword research project, to ensure that the orderings chosen best match how people are thinking about product category ordering. You might not be able to get it perfect, but if you can cater for the how the majority search, that's going to be a big help.
Noindex, Follow & Robots.txt
The next step is to integrate those rules for the robots tag. This won't stop these URLs being crawled, but it means that if these ever do get crawled, they at least won't end up being indexed.
The way we do this is by having noindex,follow as the setting for the robots tag in the header of the page in question. A correct implementation should look like this:
<meta name="robots" content="noindex,follow" />
This can be combined with the robots.txt directives file to stop crawling of certain URLs and combinations of URLs. Using the robots file has the advantage of stopping crawling to those URLs altogether, which helps preserve crawl budget, but if any links point to those URLs, their weight will be lost, so I'd recommend it as a last resort.
301 Redirection & Canonicalisation
Now we've ensured that we won't get things indexed we don't want to, we need to now create canonical versions of content where necessary. For example, if you had a URL like this:
...that's going to match for the same logic as this:
Certain CMS's are perfectly capable of creating both those URLs. After all, they're technically both valid ways of sub-setting content. However, we only want one of them to work. There's two ways to go about fixing this.
The first is to define a canonical ordering (similar to how the order of adjectives works), and to redirect any non-canonical versions to their correct versions. Using this ensures that only the correct version of the URL will ever be indexed, and any link weight will get passed correctly. Also, a 301 server header is quicker to get than a downloaded page with a rel=canonical.
However, it can mean creating page refreshes, and for certain systems can be problematic from a development standpoint. If that's an issue you're facing, the solution is still to create an ordering pattern, but to use rel=canonical to define the correct URL for that content. This will tell Google, should it ever end up on the "wrong" version of that content, which the correct one is. That should look like this:
<link rel="canonical" href="/gb-en/womens-fashion/dresses/summer/" />
This way, only that specific version of the URL will ever be indexed. It also passes link weight, but doesn't stop these pages being crawled.
There's a few things we can do on the front end to up too. The first is to ensure that any time there's no results for a set of facets, that further refinement is impossible. After all, there's no benefit to users, either human or bot, in offering further drill-down when there's no products under the current selection.
The second is to ensure that the page has breadcrumbs, which are properly marked up. This further allows a search engine to understand the structure of a site, reinforcing its perception of how the site's internal architecture works.
The third is to ensure that links to any facets that shouldn't be crawled have rel=nofollow attached to them. Reducing the crawlable options as far as possible on internal navigation will help improve the usage of your site's crawl budget.
However, it will deliver good UX, so we'd recommend it for that reason.
Whilst this work is being done, we'd strongly suggest having a version of the site up on a staging server, so it can be regularly crawled to test the results of that work. You'll want a long list of URLs that you can check should be indexed, as well as an equal list of URLs that shouldn't. The aim is to try and make sure that everything you've defined as uncrawlable and unindexable should work that way.
Also, have a list of URLs that shouldn't be crawlable, but should return things like a noindex, follow directive, or which should redirect. Plug those URLs into your crawler directly and make sure they all do what they should too. These are URLs that a normal crawl, assuming everything works, would never get crawled in the first place. However, thanks to people linking to them, sharing them on social media, or even just visiting via a browser, may still get crawls attempted by Google and co.
When you're happy with the new solution, push it live, and make sure you watch how Google's crawling the new site. Hopefully you'll have caught everything and your rankings will improve due to lowered duplicate content and better crawling. However, if you've missed something, simply switch back to your old version, refine the new version further to solve any bugs found, and return to the new version.