How To Use Google Sheets For Web Scraping With AI via @sejournal, @andreaatzori

Scraping data from webpages is a relatively advanced task that, until recently, required a degree of technical skill. The idea of diving into code or scripts for data extraction seemed overwhelming for many, myself included.

Data scraping can power many SEO tasks, such as auditing, competitor analysis, and examining website and data structure.

Google sheets offers simple solutions to help.

One of those solutions is the IMPORTXML function that allows users to scrape webpage data using just a few parameters. It makes data extraction accessible to a wider audience, especially to those who were not well-versed in programming languages.

While this function is impressive, the real breakthrough came with the adoption and integration of generative AI into the mix.

In this guide, we’ll show you how to use Google Sheets and AI, particularly ChatGPT, for web scraping without needing advanced coding skills.

The Tools: AI And Chatbots

We are now all familiar with AI, ChatGPT, and similar chatbots.

In fact, many of us use solutions like ChatGPT to write our own code, scripts, and programs without or with very limited programming knowledge.

It is as simple as providing detailed instructions in the form of prompts and working with the chatbot to build tools that only until recently we believed were way above us.

But most importantly, these are tools that are deeply changing the way we approach our day-to-day work.

For example, if we ask ChatGPT the following question, “What is the IMPORTXML function and how can I use it in Google Sheets to scrape the title of an HTML webpage? Provide the necessary code to do that in Google Sheets,” the response is extremely accurate. In a matter of seconds, we have our formula ready to use in Google Sheets.

But to be honest, that was a very basic and simple task that we could have easily completed without ChatGPT.

The Task

So, how does this work if we want to extract data that is a bit less standard compared to a page title or description?

For example, how does this work if we want to extract the following data from the PPC front page of Search Engine Journal?

List all featured articles, their authors, the link URLs, and the article description for the columns listed on https://www.searchenginejournal.com/category/paid-media/pay-per-click/.

Can we do that directly with ChatGPT?

Executing With ChatGPT

When creating prompts, it took a few attempts to provide instructions that were detailed enough for the chatbot to fully understand the objective of the task and return good results.

In many cases, it felt like the AI was under pressure to return quick results despite their accuracy.

But let me explain.

The task was to analyze the page and list all featured articles, their authors, the link URLs, and the description for each of the 30 articles listed on the page. Then compile the data into a table and finally export it into a CSV file.

Simple right?

At first, ChatGPT returned just a sample of seven articles and only their titles and URLs; after a reworked prompt, it managed to list and export all 30 articles and their links.

Now, that was good. So, to complete the task, we just needed to add the authors and the article descriptions.

But here is where the bot stumbled and was not able to provide an accurate description of each article despite us providing examples of the page element it needed to find and copy.

ChatGPT kept ignoring the instructions and providing its own article descriptions time and time again.

ChatGPT even failed when we tried with a different approach and downloaded and uploaded a copy of the page HTML.

ChatGPT extractScreenshot from ChatGPT, February 2024ChatGPT extract

This time, it was able to provide accurate data for seven articles but couldn’t go past that. The issue reported:

“…the structure and content of the page present significant challenges for comprehensive data extraction in a single session.

The page is quite extensive and complex, and it’s not feasible to extract all 30 articles in the current format of interaction.”

ChatGPT extracting from 30 articlesScreenshot from ChatGPT, February 2024ChatGPT extracting from 30 articles

ChatGPT + Google Sheets

So, going back to IMPORTXML and Google Sheets.

This time, getting ChatGPT to provide the formulas for each field was like a breeze.

 ChatGPT extracting instructionsScreenshot from ChatGPT, February 2024 ChatGPT extracting instructions

Here are some of the formulas, as suggested by the chatbot, that you can easily try yourself in Google Sheets to extract:

Title

=IMPORTXML("https://www.searchenginejournal.com/category/paid-media/pay-per-click/", "//*[@id='archives-wrapper']/article/div/div[2]/h2/a")

Author Name

=IMPORTXML("https://www.searchenginejournal.com/category/paid-media/pay-per-click/", "//*[@id='archives-wrapper']/article/div/div[2]/p[1]/a")

URL Link

=IMPORTXML("https://www.searchenginejournal.com/category/paid-media/pay-per-click/", "//*[@id='archives-wrapper']/article/div/div[2]/h2/a/@href")

Description

=IMPORTXML("https://www.searchenginejournal.com/category/paid-media/pay-per-click/", "//*[@id='archives-wrapper']/article/div/div[2]/p[2]")

In no time, we were able to extract the data into the spreadsheet.

Google SheetsScreenshot from Google Sheets, February 2024Google Sheets

Additionally, by using simply built nested formulas, we can quickly pull the data from multiple pages at the same time.

In the example below, I was able to extract the same data related to each article (title, author, URL link, and description) for the first 10 pages of the PPC section.

The result is a total of 300 articles scraped in less than a minute!

Google Sheets extract resultsScreenshot from Google Sheets, February 2024Google Sheets extract results

Comparing The Two

So, how do ChatGPT vs. ChatGPT + Google Sheets IMPORTXML compare?

In my experience, I could not find an easy and quick way to use ChatGPT to scrape the data I was looking for – mind you, that doesn’t mean that this is not possible, and there might be several ways to do this, but I didn’t find any.

What worked for me was a combination of the different tools, and that served me really well for my intended purpose.

ChatGPT was extremely useful for writing the IMPORTXML formulas I needed to use in Google Sheets, and those formulas did the rest.

An additional bonus of the ChatGPT + Google Sheets option is that you can just use the free 3.5 version of ChatGPT and get the tool to build your IMPORTXML formulas, instead of having version 4 to scan the page and extract the data.

Key Takeaway

This highlights a critical aspect of how AI has transformed how we think and work.

The best tool for the job isn’t merely using AI, Google Sheets, or any specific software alone but rather a combination of tools and skills.

It’s in this integrated approach that we develop workflows that are efficient and effective, thus improving our overall productivity.

More resources: 


Featured Image: Visual Generation/Shutterstock

Leave a Reply

Your email address will not be published. Required fields are marked *