Does Googlebot Use Etag Headers?
What are Etags?
A 304 Not Modified Status Code tells the client, (your browser, or even a crawler like Googlebot) that the page hasn't changed since the cached version it holds of the resource, so just use that and don't bother to download the page again. Naturally this can be a great boost in efficiency, there's no need to download the requested resource again, saving bandwidth and resources for both the consumer of that content or resource, and the site operator.
There's two ways a browser and server operate together to determine if a page is the same as the one held in cache,
when a URL is requested, the server sends out headers, and there's two that apply to this,
Last-Modified is a date string that simply denotes the last time a URL's was changed, an
ETag is a string which is unique to the page and version. Much like if a page was updated &
Last-Modified timestamp changed, the ETag value would change to a new string.
A returning visitor to this resource would send, with the request for the resource a header of its own, either If-Modified-Since which would be sent if the page sent out a Last-Modified header when it was requested and cached before, and / or a If-None-Match header if ETags were in the prior request (if both are sent, If-None-Match takes precedence over If-Modified-Since, with that acting as a fall back).
In this test, I was interested if Googlebot uses Etags alone, and if googlebot would end up using this and get 304 responses. I ran the test on this site, and if you check the response headers, you can see that ETag headers are sent, but no Last-Modified:
I also used postman to double check that if the appropriate If-None-Match header was sent, the server responds correctly with a 304 status.
I also confirmed using the Network tab in chrome's devTools that a browser was also getting the 304 status code, it is.
So then it was a case of waiting and monitoring the log files to see if we can see any 304 responses getting sent by googlebot
I started monitoring the log files from the 14th of May 2019 up to around midday today, the 10th of June. Initially
it my conclusions were going to be that is seems that, in the case of this one site at least, no, Googlebot doesn't
seem to use the eTag at all. To double confirm this, I changed the logging of the nginx server to include the
If-None-Matched (and for my information also included the
Accept-Encoding header too) by
adding this to the nginx config file:
log_format headers '$remote_addr - $remote_user [$time_local] ' '"$request" $status $body_bytes_sent ' '"$http_referer" "$http_user_agent" "$http_if_none_match" "$http_accept_encoding"';
access_log /var/log/nginx/access.log headers;
This way I could be sure that no If-None-Matched was being sent by googlebot, and decided to let this run for a few more days, and I am glad I did, as this extra time allowed me to catch some actual 304's being served to googlebot as a result of them sending a If-None-Matched. Here's an example from the log:
220.127.116.11 - - [07/Jun/2019:16:01:46 +0100] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "\x2220e37-qfuPohLacizeFJDundMd0taADDE\x22" "gzip,deflate,br"
(Standard nginx log format with the two additional fields at the end, the If-None-Matched, and Accept-Encoding, googlebot accepts Brotli, nice!)
So just as I was about to call it as no, it appears that googlebot does indeed understand, and under certain circumstances, use ETags to send a If-None-Matched header.
This site is, as I'm sure you've noticed, pretty tiny and the crawl rate from Googlebot is pretty minimal. Over the course of the test I recorded 690 verified hits (verified by looking up the IP via reverse DNS), the breakdown of statuses googlebot got is as follows:
So all in all, the 304 requests don't mount up to much, and aren't saving a big amount of crawls, but scale that up to a larger, more heavily crawled site, and that could start to represent significant savings. Of the hits from Google where they did send out the If-None-Matched request, it seems this was when the URL was requested fairly shortly after a prior request, within a few minutes. Quite what the upper limit to this could be, it's hard to say with a limited data set, but if you have the kind of site where Googlebot is hitting a URL frequently, you might be getting more 304's if your server is correctly set up for Etags / If-None-Matched.
In case you were curious, Bingbot seemed to send the If-None-Matched and receive a 304 much, much more frequently, with most requests bearing the If-None-Matched header, unless it was requesting robots.txt
It's important to note that this is a limited test, on one site, so as always I'd encourage you to experiment and validate in your own environments, and fundamentally, ETags are a good thing for your real human users, so I'd recommend them on that basis alone.