Introduction
I keep hearing the same question framed the same way: “Do we have a crawl budget problem?” It usually comes after someone notices that Googlebot spent a lot of time somewhere unimportant and not enough time where the business value actually is.
In practice, that question is almost always misdirected. What fails on real sites is rarely the budget itself. What fails is the crawl path — the structure of URLs and links that defines what the crawler is exposed to first, most often, and at scale.
This confusion is one reason crawl-budget discourse keeps resurfacing as a “trend.” I touched on that pattern earlier in Latest SEO Trends 2025 (AU perspective): the industry prefers a single abstract explanation over uncomfortable structural causes.
What follows is not advice. It’s a description of how crawling behaves when you look at logs long enough.
1. What people mean by “crawl budget” — and what it actually is
Publicly, Google has described crawl budget as a combination of crawl rate limit and crawl demand. That framing is technically correct and operationally misleading.
Crawl rate limit is mostly about infrastructure safety. If your server responds slowly or errors spike, fetch rates drop. On modern CDNs with stable latency, this is rarely the bottleneck.
Crawl demand is where most sites quietly fail. Demand is shaped by how many distinct URLs the system believes are worth revisiting. That belief is inferred from link signals, historical change frequency, duplication patterns, and canonical consolidation.
The important part: Google does not allocate crawl budget as a static quota per site. It continuously rebalances effort across URL families. That means you don’t “run out” of crawl — you dilute it.
Gary Illyes has repeated this in different forms for years. One of the more blunt versions was: “For the most part, crawling is not something you need to worry about unless you have an extremely large site.” What usually gets dropped is the next implication: extremely large isn’t just page count — it’s URL surface.
2. Crawl path is a graph problem, not a budget problem
From a crawler’s perspective, your site is a directed graph. URLs are nodes. Links are edges. Parameters and pagination multiply nodes without necessarily adding information.
Once you look at it that way, a lot of behaviour stops looking mysterious. Crawlers prioritise URL families that are:
- cheap to fetch,
- frequently linked,
- and historically redundant.
If your internal linking constantly advertises faceted combinations or infinite pagination, the crawler will keep sampling them. Not because it wants to — because the graph keeps telling it those paths exist.
This is why pagination and facet handling matter structurally, not philosophically. I go into the mechanics in Pagination / Facets / Crawl Traps, but the short version is simple: unbounded URL generators dominate the frontier.
3. What “crawl waste” looks like in real logs
Search Console crawl stats are summaries. They hide the shape of the problem.
When you look at raw access logs over weeks or months, patterns become obvious. The same URL templates absorb the majority of fetches, while large sections of genuinely distinct content are seen late or inconsistently.
Below is a distilled mapping from log analysis. This is not a framework; it’s a description of recurring patterns.
| Observable behaviour | Log-level signal | What it usually means |
|---|---|---|
| High crawl volume, low growth in indexed URLs | Repeated hits to similar parameterised URLs | URL families collapsing into the same canonical cluster |
| Important pages discovered late | Long time-to-first-seen for key templates | Weak internal discovery; often overlaps with Orphan Pages: The Silent Indexing Killer |
| Bursty crawling followed by long gaps | Large fetch spikes around updates | Redirect churn or refresh loops |
| Majority of crawls hit “non-content” URLs | Filters, sorts, internal search URLs | Navigation leaking crawlable parameters |
| High share of rendered fetches | Googlebot repeatedly requesting JS-rendered variants | Discovery depends on rendering, not static links |
On large content sites, it’s common to see 60–80% of crawls concentrated on URL patterns that contribute little or nothing to indexable inventory. The exact percentage varies, but the skew rarely disappears on its own.
4. Discovery competition matters more than depth
Depth is an attractive metric because it’s easy to visualise. Unfortunately, it’s also a weak proxy for discovery.
What matters is competition on the crawl frontier. If hundreds of cheap, similar URLs are constantly reintroduced through navigation, they crowd out more expensive or less frequently linked pages.
This is why sites can be technically crawlable and still practically undiscovered. The crawler isn’t blocked; it’s busy.
John Mueller has hinted at this dynamic when discussing large sites, noting that Google will focus on URLs it considers important and representative. Importance, in this sense, is inferred from link structure and historical signals — not intent.
5. Internal linking as a scheduling signal
Internal links are not decoration. They are one of the few explicit signals you give the crawler about priority and relationship.
Over time, link graphs decay. Sections get deprecated, templates change, navigation expands. The crawler continues to follow what you leave exposed. That’s the operational point behind Internal Link Decay in Large Content Sites: decay isn’t abstract, it’s cumulative misrouting.
In logs, this often shows up as:
- persistent crawling of legacy sections,
- delayed discovery of newer templates,
- and repeated revisits to URLs that no longer define the site.
None of this is fixed by publishing faster or submitting more sitemaps.
6. Where performance and rendering actually fit
Performance affects crawl throughput when it affects fetch cost. Rendering affects crawl behaviour when it gates discovery.
If Google must render a page to find links that lead to new content, that content will compete with every other render-dependent URL on the site. On JavaScript-heavy sites, this can materially slow discovery even when servers are fast.
What’s often missing from the discussion is proportionality. Rendering is a problem when it’s required for discovery at scale. It’s a non-issue when rendered content doesn’t materially expand the graph.
7. What we still don’t know
Two areas remain opaque despite years of observation:
- How aggressively Google collapses large canonical clusters before scheduling revisits.
- How crawl prioritisation changes across infrastructure classes (edge-cached vs origin-heavy sites).
You can infer behaviour from logs, but you can’t observe the scheduler directly. Anyone claiming certainty here is speculating.
Conclusion
“Crawl budget” survives as a concept because it’s comforting. It suggests a missing allocation instead of an exposed structure.
On real sites, the limiting factor is almost always the crawl path: the URL families you generate, the links you surface, and the duplication you tolerate. Crawlers follow what you show them.
If you keep diagnosing crawl issues as a budget shortage, you’ll keep fixing the wrong layer. The system doesn’t need more allowance. It needs fewer dead ends and clearer paths.