Web Scraper Bot Python

How to build a Web Scraper Bot in Python using Selenium

Posted by

Build a web scraper bot to scrape data from a website or a bot to do certain tasks on a website using Selenium. In this article, we will scrape google’s search results to store the header, link, and text in CSV

In this tutorial, we will build a web scraper bot that can search google for a term and store the results in a csv file.

Glossary

  • Web Scraping: Extracting relevant data from a website and storing it in a structured format like CSV or JSON objects. An example is extracting the name, brand, and price of products from Amazon and storing it in an excel file.
  • Bot: An application that can run automated tasks such as clicks, search, scroll on websites. It can basically mimic a human’s interaction with a website
  • Selenium: An open-source library which helps in web scraping and developing bots. It is also used to write scripts to automate common tasks on the web app to test them.

Setup

  • Although not necessary, I would recommend setting up a virtual environment. Type the following commands in your terminal.
pip install virtualenv  /* Install virtual environment */
virtualenv venv         /* Create a virtual environment */
venv/Scripts/activate   /* Activate the virtual environment */ 
  • Install Selenium and Pandas. We will use Pandas to create a CSV file from the extracted data
pip install selenium pandas
  • Download ChromeDriver. We will use ChromDriver as our browser to automate the tasks. After the download is complete, extract ChromeDriver.exe and copy its absolute path. I personally prefer keeping the file in Program Files. You can place the file anywhere you wish as long as you know the absolute path to it. In my case, it is ‘C:\Program Files (x86)’.

Open Google using Selenium

First, we import webdriver from the selenium library. Then we initialize all the necessary variables, i.e

  • path ( the absolute path to chromdriver.exe)
  • driver ( it is an instance of Chrome webdriver
  • url ( the URL to the website, in our case ‘www.google.com’
from selenium import webdriver

# path to chromedriver.exe 
path = 'C:\Program Files (x86)\chromedriver.exe'

driver = webdriver.Chrome(path)
url = 'https://www.google.com'
driver.get(url)

When you run the python script above, an instance of chromedriver with google home page should pop up.

As you can see, chrome has a message stating that it is controlled by an automated software

Accessing Elements

Selenium gives us the option to find an element by its class name, CSS selector, name, tags, link text, etc. We want to be able to access the search bar to provide our search keyword.

Right Click on the search bar and select inspect elements to get details about the Html code used to create the search bar.

The tag representing the search bar has an attribute name with value = ‘q’

The tag associated with the search bar has a class as well as a name. We will use the find_element_by_name() function since the name is a single letter. However, if you wish you could try the find_element_by_class_name() function too.

# set the keyword you want to search for
keyword = 'stocks' 

# we find the search bar using it's name attribute value
searchBar = driver.find_element_by_name('q')

# first we send our keyword to the search bar followed by the enter key
searchBar.send_keys(keyword)
searchBar.send_keys('\n')

Running the above script should display the search results for the keyword in chrome driver.

Accessing the search results

Each search result has a class name with value = ‘g’

Inspecting the search result element we find out each search result has an attribute class set to ‘g’. Since we need multiple results, we will be using the find_elements_by_class_name() function.

However this time we will need to do something different. The search results will take a second or two to be displayed, we do not want the find_elements_by_class_name() to be instantly executed after we send the enter ‘\n’ key. If it is instantly executed, there is a good chance the results might not have been displayed yet and as a result a tag with class = ‘g’ won’t be present and the function will return an error. There are a couple of ways to deal with this

  1. let the program sleep for a few seconds
from time import time
time.sleep(10)

2. Use Selenium’s explicit waits

This is the sample code provided in the docs

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
    )
finally:
    driver.quit()

We can copy the necessary imports and the try clause. Since an instance of webdriver has already been created, we do not need to create a new one. We also do not need the ‘finally’ clause since we still need to gather results from other pages.

Since we are searching by class name, we replace BY.ID with BY.CLASS_NAME and set the value to ‘g’.

def scrape():
   pageInfo = []
   try:
      # wait for search results to be fetched
      WebDriverWait(driver, 10).until(
      EC.presence_of_element_located((By.CLASS_NAME, "g"))
      )
   except Exception as e:
      print(e)
      driver.quit()
   # contains the search results
   searchResults = driver.find_elements_by_class_name('g')

Once the try clause succeeds, we can get all the search results. But the function is still not complete, we need to get the header, link, and text.

link is in the href attribute of a tag, the header is in the h3 tag and the text is inside a tag with attribute class = ‘st’

The link is in the href attribute of the tag, the header is the text content of an h3 tag and text is in a span tag with class = ‘st’

  for result in searchResults:
    element = result.find_element_by_css_selector('a') 
    link = element.get_attribute('href')
    header = result.find_element_by_css_selector('h3').text
    text = result.find_element_by_class_name('st').text        
    pageInfo.append({
      'header' : header, 'link' : link, 'text': text
    })
return pageInfo

We iterate over the search results and use the appropriate find_element_by functions to get the data.

  1. First, we use the find_element_by_css selector to get the ‘a’ tag and then use the .get_attribute() function to get the value of href attribute.
  2. We use the find_element_by_css selector to get the ‘h3’ tag and use .text to get the text inside the h3 tag, i.e the header
  3. To get the text we use the find_element_by_class_name() and .text

Then a dictionary is created with the appropriate key-value pairs and added to a list.

Now if you run the scrape() function, you should get the data from search result page 1 in a dictionary format.

Result from Next Page

Using the find_element_by_link_text() with the parameter set to ‘Next’ gives us access to the element with the link to the Next page. We can use the .click() method to go to the next page and call the scrape() function to get more data.

# Number of pages to scrape
numPages = 5
# All the scraped data
infoAll = []
# Scraped data from page 1
infoAll.extend(scrape())
for i in range(0 , numPages - 1):
   nextButton = driver.find_element_by_link_text('Next')
   nextButton.click()
   infoAll.extend(scrape())

After the script with the new addition is executed, infoAll will be a list with dictionaries containing the scraped data from the specified number of pages.

Covert it to a CSV file

The pandas library can be used to convert the list of dictionaries to a CSV file.

df = pd.DataFrame(infoAll)
fileName = keyword + '_' + str(numPages) + '.csv'
df.to_csv(fileName)
Example CSV File generated

Conclusion

Now you should have enough information to get started with web scraping and building bots using Python. If you would like to gain more experience with web scraping and building bots, you can try to build a bot to scrape GitHub or other websites such as MuscleWiki.