Google recently published a podcast discussing what’s known as a crawl budget and what influences Google to index content.
Both Gary Illyes and Martin Splitt shared insights into indexing the web, as understood from Google’s perspective.
Origin of the Crawl Budget Concept
Gary Illyes said that the concept of a crawl budget was something created outside of Google by the search community.
He explained there wasn’t any one thing internally within Google that corresponded with the idea of a crawl budget.
When people talked about a crawl budget, what was happening inside Google involved multiple metrics, not this one thing called a crawl budget.
So inside Google they talked about what could represent a crawl budget and came up with a way of talking about it.
He said:
“…for the longest time we were saying that we don’t have the concept of crawl budget. And it was true.
We didn’t have something that could mean crawl budget on its own- the same way we don’t have a number for EAT, for example.
And then, because people were talking about it, we tried to come up with something… at least, somehow defined.
And then we worked with two or three or four teams– I don’t remember– where we tried to come up with at least a few internal metrics that could map together into something that people externally define as crawl budget.”
What Crawl Budget Means Within Google
According to Gary, part of the calculation for a crawl budget is based on practical considerations like how many URLs does the server allow Googlebot to crawl without overloading the server.
Gary Illyes and Martin Splitt:
“Gary Illyes: …we defined it as the number of URLs Googlebot can and is willing or is instructed to crawl.”
Martin Splitt: For a given site.
Gary Illyes: For a given site, yes.
And for us, that’s roughly what crawl budget means because if you think about it, we don’t want to harm websites because Googlebot has enough Chrome capacity to bring down sites…”
Balancing Different Considerations
Another interesting point that was made was how, in relation to crawling, there are different considerations involved. There are limits to what can be stored so, according to Google, that means utilizing Google’s resources “where it matters.”
“Martin Splitt: Apparently, obviously, everyone wants everything to be indexed as quickly as possible, be it the new website that just came online or be it websites that have plenty of pages, and they want to frequently change those, and they’re worried about things not being crawled as quickly.
I usually describe it as a challenge with the balance between not overwhelming the website and also spending our resources where it matters.”
John Mueller recently tweeted that Google doesn’t index everything and mentioned that not everything is useful.
Mueller’s tweet:
“…it’s important to keep in mind that Google just doesn’t index every page on the web, even if it’s submitted directly. If there’s no error, it might get selected for indexing over time — or Google might just focus on some other pages on your site.”
He followed up with another tweet:
“Well, lots of SEOs & sites (perhaps not you/yours!) produce terrible content that’s not worth indexing. Just because it exists doesn’t mean it’s useful to users.”
- Martin Splitt calls the process of crawling an issue of “spending our resources where it matters.”
- John Mueller mentioned if the content is “useful to users.”
Usefulness is an interesting angle for judging content and in my opinion it can be more helpful for diagnosing content than the sterile advice to make sure the content “targets the user intent” and that it’s “keyword optimized.”
For example, I recently reviewed a YMYL site where the entire site looked like it was created from an SEO to-do checklist.
- Create an Author profile
- Author profile should have a LinkedIn Page
- Keyword optimize the traffic
- Link out to “authority” sites
The publisher was using AI generated images for the author bio, which was also used on a fake LinkedIn profile.
Many of the webpages of the site linked to thin .gov pages that have the keywords in the title but are not useful at all. It was like they didn’t even look at the government page to judge if it was worth linking to.
Outwardly, they were ticking the boxes of an SEO to-do checklist, completing rote SEO activities such as linking to a .gov site, creating an author profile, etc.
They created the outward appearance of quality but not really achieving it because at every step they didn’t consider whether what they were doing was useful.
Crawl Budget Is Not Something To Worry About
Gary and Martin began talking about how most sites don’t need to worry about the crawl budget.
Gary pointed the finger at blogs in the search industry that in the past promoted the idea that the crawl budget is something to worry about when according to him it’s not something to worry about.
He said:
“I think it’s partly a fear of something happening that they can’t control, that people can’t control, and the other thing is just misinformation.
…And there were some blogs back in the days where people were talking about crawl budget, and it’s so important, and then people were finding that, and they were getting confused about “Do I have to worry about crawl budget or not?”
Martin Splitt asked:
“But let’s say you were an interesting blog… Do you need to worry about crawl budget?”
And Gary responded:
“I think most people don’t have to worry about it, and when I say most, it’s probably over 90% of sites on the internet don’t have to worry about it.”
A few minutes later in the podcast Martin observed:
“But people are worried about it, and I’m not exactly sure where it comes from.
I think it comes from the fact that a few large-scale websites do have articles and blog posts where they talk about crawl budget being a thing.
It is being discussed in SEO training courses. As far as I’ve seen, it’s being discussed at conferences.
But it’s a problem that is rare to be had. Like it’s not a thing that every website suffers, and yet, people are very nervous about it.”
How Google Determines What to Index
What followed next was a discussion about factors that cause Google to index content.
Of interest is when Gary talks about wanting to index content that might be searched for.
Gary Illyes:
“…Because like we said, we don’t have infinite space, so we want to index stuff that we think– well, not we– but our algorithms determine that it might be searched for at some point, and if we don’t have signals, for example, yet, about a certain site or a certain URL or whatever, then how would we know that we need to crawl that for indexing?”
Gary Google Search Central tech writer, Lizzi Sassman (@okaylizzi), next talked about inferring from the rest of the site whether or not it’s worth indexing new content.
“And some things you can infer from– for example, if you launch a new blog on your main site, for example, and you have a new blog subdirectory, for example, then we can sort of infer, based on the whole site, whether we want to crawl a lot from that blog or not.
Lizzi Sassman: But the blog is a new type of content that might be updated more frequently, so how can we tell if that is…? It’s just new. We’re not sure if it’s going to be newsy, like how
frequent it’s still to be determined.Gary Illyes: But we need a starter signal.
Lizzi Sassman: And the starter signal is…
Gary Illyes: Infer from the main site.”
Gary then pivoted to talking about quality signals. The quality signals they talked about though were whether signals related to user interest, like, are people interested in this product? Are people interested in this site?
He explained:
“But it’s not just update frequency. It’s also the quality signals that the main site has.
So, for example, if we see that a certain pattern is very popular on the Internet, like a slash product is very popular on the Internet, and people on Reddit are talking about it, other sites are linking to URLs in that pattern, then it’s a signal for us that people like the site in general.”
Gary continues talking about the popularity and interest signals but in the context of the conversation, which is a new section of a site that’s been launched.
In the discussion he calls the new section a Directory.
Illyes:
“While if you have something that people are not linking to, and then you are trying to launch a new directory, it’s like, well, people don’t like the site, then why would we crawl this new directory that you just launched?
And eventually, if people just start linking to it–“
Crawl Budget and Sites that Get Indexed
To recap some of what was discussed:
- Google doesn’t have infinite capacity and can’t index everything on the web.
- Because Google can’t index everything, it’s important to be selective by indexing only the content that matters.
- Content topics that matter tends to be discussed
- Sites that are important, which tend to be useful, tend to be discussed and linked to
Obviously, that’s not a comprehensive list of everything that influences what gets indexed. Nor is it meant to be an SEO checklist.
It’s just an idea of the kinds of things that are so important that Gary Illyes and Martin Splitt discussed it.
Featured image by Shutterstock/Trismegist san
Citation
Listen to the podcast here:
[embedded content]