Scraping 'People also asked' on Google Search

Scraping 'People also asked' on Google Search
💡
Too hard to spin the whole thing?
I've created a free tool to get People Also Ask questions. Check it out!

‌‌Lately, I binge-watched an amazing series on Netflix, “Mad Men”.

From the first episode, I noticed wild cigarette and alcohol consumption.

I asked Google the total number of cigarettes smoked and, besides the actual answer ( 974 cigs in 94 episodes for 7 seasons), I had an idea.

“People also asked” is a widget in SERP introduced in 2015: it givengines try to help users engine tries to help user to refine their search.

Source: MozCast

Long story short: for each click on a question, Google will show us the page that is more relMore question we’ve clicked, more questions are loaded depending on what question we've clicked at the bottom of the PAA box.

Then I asked myself: could I gather exciting insights from how Google shows me questions? This is why I’ve written this script in Python.

You can download here from Github.

⚠ DISCLAIMER: This software is not authorized by Google and doesn’t follow Google’s robots.txt. Scraping without Google explicit written permission is a violation of their terms and conditions on scraping and can potentially cause a lawsuit‌‌This software is provided as is, for educational purposes, to show how a crawler can be made to recursively parse Google’s “People also asked”. Use at your own risk.

Gquestions.py

How it is made

Python

python logo

‌Python is a versatile programming language in many fields (from data analysis to machine learning and web development). Due to its low learning curve, beginners and non-techie guys are strongly recommended.

To make this script work we need Python >= 3.7 . This is because from this particular version it can “remember” the order of entries in dictionaries.

You can download the latest version here.

Once downloaded you’ll need to verify if the package manager Pip is installed (it should in the latest versions).

Open a terminal and digit:

pip --version

If no errors appear, it means we’re almost ready!

To make the script work correctly, move into the script folder and download all dependencies:

cd <PATH_REPO_FOLDER>
pip install -r requirements.txt

This is the folder structure:

.
├── csv
│   ├──<keyword1>_<date>.csv
│   └──<keyword2>_date.csv
├── html
│   ├── <keyword1>_<date>.html
│   └── <keyword2>_<date>.html
├── driver
│   └── yourchromedriver
├── templates
|  └── index.html
├── gquestions.py
└── requirements.txt

Selenium

selenium logo

Selenium is an automation browser library. It controls a browser through a driver: for this script we’ll using chromedriver. With Selenium we’re also able to interact with pages.

Here you can find the right driver version based on your Chrome version (visit chrome://settings/help your version).

http://chromedriver.chromium.org/downloads

Once downloaded the zip archive, unzip in the driver folder.

d3

d3.js is a powerful JS library used for data visualization. For this dataviz I’ve used a tree graph or dendogram.

Base version can be found at this URL.

http://bl.ocks.org/d3noob/8375092

How to use it

Gquestions.py opens a new browser instance and search a query, clicking on each answer and generating a tree graph with d3.

Results are stored in the folder html/.

Move into the folder containing gquestions.py and type on terminal:

python gquestions.py query <keyword> (en|es) [depth <depth>] [--csv] [--headless]
  1. query: search a query. If no PAA box is found, the script exits automatically. This is a mandatory command.
  2. en or es: search in English or Spanish. This is a mandatory command.
  3. depth: Set the click depth. Default 0, max 1. Optional.
  4. --csv: Export all questions to a .csv file. Optional.
  5. --headless: Start Chrome headlessly. Optional.

I decided to limit the depth to only 2 levels because beyond the third level the graph becomes very messy.

Here you can see an example of 2 levels deep graph:

How to use data

These related questions give us a deeper insight than traditional keywords. Following there is a not exhaustive list of what you can use this data for.

Ideas for new content

Understanding what are the most common questions when users are searching for something is an amazing help if you are creating a new content or even a whole editorial plan.

We can also use the data to create FAQ pages.

Improve existing contents

We can use these questions to improve existing articles and make them more pertinent to the context.

It could also be a good idea to use them as paragraph headings or to improve title/meta description.

Question answering

Natural Language Processing (NLP) is a subfield of Machine Learning concerned with “understanding” texts and relations between topics expressed in textual contents.

Thanks to Deep Learning is also possible create models able to answer to questions in a content. All answers are extracted from the text.

Subscribe to Alessio Nittoli

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe