Monday, December 9, 2024
Allow us to cache, pretty please.
As the internet grew over the years, so did how much Google crawls. While Google’s crawling
infrastructure supports heuristic caching mechanisms, in fact always had, the number of requests
that can be returned from local caches has decreased: 10 years ago about 0.026% of the total
fetches were cacheable, which is already not that impressive; today that number is 0.017%.
Why is caching important?
Caching is a critical piece of the large puzzle that is the internet. Caching allows pages to load
lightning fast on revisits, it saves computing resources and thus also natural resources, and
saves a tremendous amount of expensive bandwidth for both the clients and servers.
Especially if you have a large site with rarely-changing content under individual URLs, allowing
caching locally may help your site be crawled more efficiently. Google’s crawling infrastructure
supports heuristic HTTP caching as defined by the
HTTP caching standard,
specifically through the ETag
response- and If-None-Match
request
header, and the Last-Modified
response- and If-Modified-Since
request
header.
We strongly recommend using ETag
because it’s less prone to errors and mistakes (the
value is not structured unlike the Last-Modified
value). And, if you have the option,
set them both: the internet will thank you. Maybe.
As for what you consider a change that requires clients to refresh their caches, that’s up to you.
Our recommendation is that you require a cache refresh on significant changes to your content; if
you only updated the copyright date at the bottom of your page, that’s probably not significant.
ETag
and If-None-Match
Google’s crawlers support ETag
based conditional requests exactly as defined in the
HTTP caching standard.
That is, to signal caching preference to Google’s crawlers, set the Etag
value to any
arbitrary ASCII string (usually a hash of the content or version number, but it could also be a
piece of the π, up to you) unique to the representation of the content hosted by the accessed URL.
For example, if you host different versions of the same content under the same URL (say, mobile
and desktop version), each version could have its own unique ETag
value.
Google’s crawlers that support caching will send the ETag
value returned for a
previous crawl of that URL in the If-None-Match header
. If the ETag
value sent by the crawler matches the current value the server generated, your server should
return an HTTP 304
(Not modified) status code with no HTTP body. This last bit, no
HTTP body, is the important part for a couple reasons:
-
your server doesn’t have to spend compute resources on actually generating content; that is, you
save money - your server doesn’t have to transfer the HTTP body; that is, you save money
On the client side, like a user’s browser or Googlebot, the content under that URL is retrieved
from the client’s internal cache. Because there’s no data transfer involved, this happens
lightning fast, making users happy and potentially saving some resources for them, too.
Last-Modified
and If-Modified-Since
Similarly to ETag
, Google’s crawlers support Last-Modified based
conditional requests, too, exactly as defined in the HTTP Caching standard. This works the same
way as ETag
from a semantic perspective — an identifier is used to decide
whether the resource is cacheable —, and provides the same benefits as ETag
on
the clients’ side.
We have but a couple recommendations if you’re using Last-Modified
as a caching
directive:
-
The date in the
Last-Modified
header must be formatted according to the
HTTP standard.
To avoid parsing issues, we recommend using the following date format:
“Weekday, DD Mon YYYY HH:MM:SS Timezone”. For example,
“Fri, 4 Sep 1998 19:15:56 GMT“. -
While not required, consider also setting the
max-age
field of the
Cache-Control
header to help crawlers determine when to recrawl the specific URL.
Set the value of themax-age
field to the expected number of seconds the content
will be unchanged. For example,Cache-Control: max-age=94043
.
Examples
If you’re like me, wrapping my head around how heuristic caching works is challenging, however
showing an example of the chain of requests and responses seems to help me. Here are two chains
— one for ETag
/If-None-Match
and one for
Last-Modified
/If-Modified-Since
— to visualize how it’s supposed
to work:
ETag /If-None-Match | Last-Modified /If-Modified-Since | |
---|---|---|
A server’s response to a crawl: This is the response from which a crawler can save the precondition header fields ETag and Last-Modified . |
HTTP/1.1 200 OK Content-Type: text/plain Date: Fri, 4 Sep 1998 19:15:50 GMT ETag: "34aa387-d-1568eb00" ... |
HTTP/1.1 200 OK Content-Type: text/plain Date: Fri, 4 Sep 1998 19:15:50 GMT Last-Modified: Fri, 4 Sep 1998 19:15:56 GMT Cache-Control: max-age=94043 ... |
Subsequent crawler conditional request: The conditional request is based on the precondition header values saved from a previous request. The values are sent back to the server for validation in the If-None-Match and If-Modified-Since request headers. |
GET /hello.world HTTP/1.1 Host: www.example.com Accept-Language: en, hu User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html) If-None-Match: "34aa387-d-1568eb00" ... |
GET /hello.world HTTP/1.1 Host: www.example.com Accept-Language: en, hu User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html) If-Modified-Since: Fri, 4 Sep 1998 19:15:56 GMT ... |
Server response to the conditional request: Since precondition header values sent by the crawler are validated on the server’s side, the server returns a 304 HTTP status code (without an HTTP body) to the crawler. This will happen to every subsequent request until the preconditions fail to validate (the ETag or theLast-Modified date changes on the server’s side). |
HTTP/1.1 304 Not Modified Date: Fri, 4 Sep 1998 19:15:50 GMT Expires: Fri, 4 Sep 1998 19:15:52 GMT Vary: Accept-Encoding If-None-Match: "34aa387-d-1568eb00" ... |
HTTP/1.1 304 Not Modified Date: Fri, 4 Sep 1998 19:15:50 GMT Expires: Fri, 4 Sep 1998 19:15:51 GMT Vary: Accept-Encoding If-Modified-Since: Fri, 4 Sep 1998 19:15:56 GMT ... |
If you’re in the business of making your users happy and perhaps also want to potentially save a
few bucks on your hosting bill, talk to your hosting or CMS provider, or your developers about how
to enable HTTP caching for your site. If nothing else, your users will like you a bit more.
If you wanna chat about caching, head to your nearest
Search Central help community, and if
you have comments about how we’re caching, leave feedback on
the documentation about caching
that we published together with this blog post.