Google Unveils How Googlebot Really Crawls the Web
Google has offered a rare, detailed technical look into how its crawling systems operate. This goes on to shed light on how Googlebot works and the limits that quietly shape how web pages are discovered and indexed.
In a detailed blog post, Gary Illyes explains that Googlebot is not a standalone system. He says it is but part of a much larger, shared crawling platform used across several Google products.
A Shared Backbone Behind the Crawlers
One of the most notable revelations is that Googlebot functions as just one client within a centralized crawling infrastructure.
This is the same system that also powers crawlers for services like Google Shopping and AdSense. Each of these clients, Illyes explains, operates with its own configuration, including distinct user-agent strings and crawling rules.
This architecture helps explain why different Google crawlers appear separately in server logs. While Googlebot represents Search, other crawlers identify themselves differently and exclusively depending on the product they serve.
Despite these changes, they all rely on the same underlying system to fetch content from the web.
The 2 MB Rule: Where Crawling Draws the Line
At the heart of the update is a critical technical constraint that specifies that Googlebot will fetch only up to 2 MB of a web page’s content.
This limit applies to HTML files, while PDFs are allowed a much larger ceiling of 64 MB. When no specific limit is defined, some crawlers then go on to default at 15 MB.
What makes this limit particularly important is how it is enforced. Rather than rejecting oversized pages, Googlebot simply stops fetching once the 2 MB threshold is reached.
This version of the page is then forwarded to Google’s indexing and rendering systems proving if it were complete.
This means that any content beyond that cutoff point is effectively invisible to Google Search. It will therefore won’t be indexed, rendered, or considered in ranking.
What Counts and What Doesn’t
Illyes also clarifies that HTTP headers are included within the 2 MB limit. This is a detail that may catch some developers off guard.
However, external resources such as CSS and JavaScript files can be treated separately. Each of these files is then fetched independently and does not eat into the main page’s byte budget.
Interestingly, not all resources are processed equally. The media files like images, videos, and certain fonts cannot be fetched by Google’s rendering systems, therefore focusing primarily on understanding structure and functionality rather than visual content.
Rendering Without Memory
Once the page is fetched, it is handed over to Google’s Web Rendering Service (WRS). This system executes JavaScript and processes dynamic elements to better interpret the page.
It operates in a stateless mode that includes clearing local storage and session data after each request.
For developers relying heavily on client-side rendering, this behavior can have significant implications. Any content dependent on stored session data may not be interpreted as intended.
Practical Advice for Publishers
Google’s guidance is straightforward. Keeping critical elements such as meta tags, structured data, and canonical links near the top of the HTML is crucial.
Heavy inline content, including base64 images or large blocks of embedded scripts, need to be minimized.
Moving CSS and JavaScript into external files is also recommended, as this prevents them from contributing to the 2 MB cap.
Why This Update Matters?
For most websites, this limit is unlikely to pose a problem. A lot of data suggests that typical pages fall well within the threshold.
It is observed that larger, more complex pages risk losing valuable content if they exceed the limit.
Illyes also hints that these limits may evolve as the web continues to grow. So now, when it comes to Googlebot, size and structure still matter more than ever.