Google published a groundbreaking research paper about identifying page quality with AI. The details of the algorithm seem remarkably similar to what the helpful content algorithm is known to do.
Google Doesn’t Identify Algorithm Technologies
Nobody outside of Google can say with certainty that this research paper is the basis of the helpful content signal.
Google generally does not identify the underlying technology of its various algorithms such as the Penguin, Panda or SpamBrain algorithms.
So one can’t say with certainty that this algorithm is the helpful content algorithm, one can only speculate and offer an opinion about it.
But it’s worth a look because the similarities are eye opening.
The Helpful Content Signal
1. It Improves a Classifier
Google has provided a number of clues about the helpful content signal but there is still a lot of speculation about what it really is.
The first clues were in a December 6, 2022 tweet announcing the first helpful content update.
The tweet said:
“It improves our classifier & works across content globally in all languages.”
A classifier, in machine learning, is something that categorizes data (is it this or is it that?).
2. It’s Not a Manual or Spam Action
The Helpful Content algorithm, according to Google’s explainer (What creators should know about Google’s August 2022 helpful content update), is not a spam action or a manual action.
“This classifier process is entirely automated, using a machine-learning model.
It is not a manual action nor a spam action.”
3. It’s a Ranking Related Signal
The helpful content update explainer says that the helpful content algorithm is a signal used to rank content.
“…it’s just a new signal and one of many signals Google evaluates to rank content.”
4. It Checks if Content is By People
The interesting thing is that the helpful content signal (apparently) checks if the content was created by people.
Google’s blog post on the Helpful Content Update (More content by people, for people in Search) stated that it’s a signal to identify content created by people and for people.
Danny Sullivan of Google wrote:
“…we’re rolling out a series of improvements to Search to make it easier for people to find helpful content made by, and for, people.
…We look forward to building on this work to make it even easier to find original content by and for real people in the months ahead.”
The concept of content being “by people” is repeated three times in the announcement, apparently indicating that it’s a quality of the helpful content signal.
And if it’s not written “by people” then it’s machine-generated, which is an important consideration because the algorithm discussed here is related to the detection of machine-generated content.
5. Is the Helpful Content Signal Multiple Things?
Lastly, Google’s blog announcement seems to indicate that the Helpful Content Update isn’t just one thing, like a single algorithm.
Danny Sullivan writes that it’s a “series of improvements which, if I’m not reading too much into it, means that it’s not just one algorithm or system but several that together accomplish the task of weeding out unhelpful content.
This is what he wrote:
“…we’re rolling out a series of improvements to Search to make it easier for people to find helpful content made by, and for, people.”
Text Generation Models Can Predict Page Quality
What this research paper discovers is that large language models (LLM) like GPT-2 can accurately identify low quality content.
They used classifiers that were trained to identify machine-generated text and discovered that those same classifiers were able to identify low quality text, even though they were not trained to do that.
Large language models can learn how to do new things that they were not trained to do.
A Stanford University article about GPT-3 discusses how it independently learned the ability to translate text from English to French, simply because it was given more data to learn from, something that didn’t occur with GPT-2, which was trained on less data.
The article notes how adding more data causes new behaviors to emerge, a result of what’s called unsupervised training.
Unsupervised training is when a machine learns how to do something that it was not trained to do.
That word “emerge” is important because it refers to when the machine learns to do something that it wasn’t trained to do.
The Stanford University article on GPT-3 explains:
“Workshop participants said they were surprised that such behavior emerges from simple scaling of data and computational resources and expressed curiosity about what further capabilities would emerge from further scale.”
A new ability emerging is exactly what the research paper describes. They discovered that a machine-generated text detector could also predict low quality content.
The researchers write:
“Our work is twofold: firstly we demonstrate via human evaluation that classifiers trained to discriminate between human and machine-generated text emerge as unsupervised predictors of ‘page quality’, able to detect low quality content without any training.
This enables fast bootstrapping of quality indicators in a low-resource setting.
Secondly, curious to understand the prevalence and nature of low quality pages in the wild, we conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.”
The takeaway here is that they used a text generation model trained to spot machine-generated content and discovered that a new behavior emerged, the ability to identify low quality pages.
OpenAI GPT-2 Detector
The researchers tested two systems to see how well they worked for detecting low quality content.
One of the systems used RoBERTa, which is a pretraining method that is an improved version of BERT.
These are the two systems tested:
They discovered that OpenAI’s GPT-2 detector was superior at detecting low quality content.
The description of the test results closely mirror what we know about the helpful content signal.
AI Detects All Forms of Language Spam
The research paper states that there are many signals of quality but that this approach only focuses on linguistic or language quality.
For the purposes of this algorithm research paper, the phrases “page quality” and “language quality” mean the same thing.
The breakthrough in this research is that they successfully used the OpenAI GPT-2 detector’s prediction of whether something is machine-generated or not as a score for language quality.
They write:
“…documents with high P(machine-written) score tend to have low language quality.
…Machine authorship detection can thus be a powerful proxy for quality assessment.
It requires no labeled examples – only a corpus of text to train on in a self-discriminating fashion.
This is particularly valuable in applications where labeled data is scarce or where the distribution is too complex to sample well.
For example, it is challenging to curate a labeled dataset representative of all forms of low quality web content.”
What that means is that this system does not have to be trained to detect specific kinds of low quality content.
It learns to find all of the variations of low quality by itself.
This is a powerful approach to identifying pages that are not high quality.
Results Mirror Helpful Content Update
They tested this system on half a billion webpages, analyzing the pages using different attributes such as document length, age of the content and the topic.
The age of the content isn’t about marking new content as low quality.
They simply analyzed web content by time and discovered that there was a huge jump in low quality pages beginning in 2019, coinciding with the growing popularity of the use of machine-generated content.
Analysis by topic revealed that certain topic areas tended to have higher quality pages, like the legal and government topics.
Interestingly is that they discovered a huge amount of low quality pages in the education space, which they said corresponded with sites that offered essays to students.
What makes that interesting is that the education is a topic specifically mentioned by Google’s to be affected by the Helpful Content update.
Google’s blog post written by Danny Sullivan shares:
“…our testing has found it will especially improve results related to online education…”
Three Language Quality Scores
Google’s Quality Raters Guidelines (PDF) uses four quality scores, low, medium, high and very high.
The researchers used three quality scores for testing of the new system, plus one more named undefined.
Documents rated as undefined were those that couldn’t be assessed, for whatever reason, and were removed.
The scores are rated 0, 1, and 2, with two being the highest score.
These are the descriptions of the Language Quality (LQ) Scores:
“0: Low LQ.
Text is incomprehensible or logically inconsistent.1: Medium LQ.
Text is comprehensible but poorly written (frequent grammatical / syntactical errors).2: High LQ.
Text is comprehensible and reasonably well-written (infrequent grammatical / syntactical errors).
Here is the Quality Raters Guidelines definitions of low quality:
Lowest Quality:
“MC is created without adequate effort, originality, talent, or skill necessary to achieve the purpose of the page in a satisfying way.
…little attention to important aspects such as clarity or organization.
…Some Low quality content is created with little effort in order to have content to support
monetization rather than creating original or effortful content to help users.Filler” content may also be added, especially at the top of the page, forcing users to scroll down to reach the MC.
…The writing of this article is unprofessional, including many grammar and punctuation errors.”
The quality raters guidelines have a more detailed description of low quality than the algorithm.
What’s interesting is how the algorithm relies on grammatical and syntactical errors.
Syntax is a reference to the order of words.
Words in the wrong order sound incorrect, similar to how the Yoda character in Star Wars speaks (“Impossible to see the future is”).
Does the Helpful Content algorithm rely on grammar and syntax signals? If this is the algorithm then maybe that may play a role (but not the only role).
But I would like to think that the algorithm was improved with some of what’s in the quality raters guidelines between the publication of the research in 2021 and the rollout of the helpful content signal in 2022.
The Algorithm is “Powerful”
It’s a good practice to read what the conclusions are to get an idea if the algorithm is good enough to use in the search results.
Many research papers end by saying that more research has to be done or conclude that the improvements are marginal.
The most interesting papers are those that claim new state of the art results.
The researchers remark that this algorithm is powerful and outperforms the baselines.
They write this about the new algorithm:
“Machine authorship detection can thus be a powerful proxy for quality assessment.
It requires no labeled examples – only a corpus of text to train on in a self-discriminating fashion.
This is particularly valuable in applications where labeled data is scarce or where the distribution is too complex to sample well.
For example, it is challenging to curate a labeled dataset representative of all forms of low quality web content. “
And in the conclusion they reaffirm the positive results:
“This paper posits that detectors trained to discriminate human vs. machine-written text are effective predictors of webpages’ language quality, outperforming a baseline supervised spam classifier.”
The conclusion of the research paper was positive about the breakthrough and expressed hope that the research will be used by others.
There is no mention of further research being necessary.
This research paper describes a breakthrough in the detection of low quality webpages.
The conclusion indicates that, in my opinion, there is a likelihood that it could make it into Google’s algorithm.
Because it’s described as a “web-scale” algorithm that can be deployed in a “low-resource setting” means that this is the kind of algorithm that could go live and run on a continual basis, just like the helpful content signal is said to do.
We don’t know if this is related to the helpful content update but it’s a certainly a breakthrough in the science of detecting low quality content.
Citations
Google Research Page:
Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study
Download the Google Research Paper
Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study (PDF)
Featured image by Shutterstock/Asier Romero