Creating a Web Crawler with Python

Tags: Python

To collect data on the Internet, you can create a Web crawler or Web scraping program with Python. A web crawler is a tool that extracts data from one or more web pages.

Configuring the Python environment

We assume that Python3 and pip are installed on your machine. You can also use a virtual environment to keep your project clean and control the library versions of your Python web crawler.

First of all, we’re going to install the requests library, which enables HTTP requests to be made to the server to retrieve data.

python -m pip install requests

To analyze and navigate Web data, we use the Beautiful Soup library, which allows us to work with tag-based scripts such as HTML or XML.

python -m pip install beautifulsoup4

Finally, we install the Selenium library, which automates web browser tasks. It can display dynamic web pages and perform actions on the interface. This library alone can be used for web scraping on the Internet, as it can work with a dynamic website running JavaScript.

python -m pip install selenium

To run Selenium with Mozilla, you will need to download Geckodriver

Recovering a Web page with resquest

If we want to retrieve technical data from an Arduino board, we can load the desired page with requests and bs4

page = requests.get("https://docs.arduino.cc/hardware/uno-rev3/")
content = BeautifulSoup(page.text, 'html.parser')

By observing the page structure, you can locate the tags, classes, identifiers or texts that interest you. In this example, we retrieve

card name
card description

N.B.: You can view the structure of the web page in your browser by right-clicking on the page and selecting “Inspect”.

import requests
from bs4 import BeautifulSoup

print("Starting Web Crawling ...")

#website to crawl
website="https://docs.arduino.cc/hardware/uno-rev3/"

#google search
#keywords = ["arduino","datasheet"]
#googlesearch = "https://www.google.com/search?client=firefox-b-d&q="
#search = "+".join(keywords)
#website = googlesearch+search

# get page
page = requests.get(website)

#extract html data
content = BeautifulSoup(page.text, 'html.parser')

# extract tags
h1_elms = content.find_all('h1')
print("Board : ",h1_elms)

#get element by class
description = content.find(class_="product-features__description").text
print("Description : ",description)

Starting Web Crawling ...
Board :  [<h1>UNO R3</h1>]
Description :  Arduino UNO is a microcontroller board based on the ATmega328P. It has 14 digital input/output pins (of which 6 can be used as PWM outputs), 6 analog inputs, a 16 MHz ceramic resonator, a USB connection, a power jack, an ICSP header and a reset button. It contains everything needed to support the microcontroller; simply connect it to a computer with a USB cable or power it with a AC-to-DC adapter or battery to get started. You can tinker with your UNO without worrying too much about doing something wrong, worst case scenario you can replace the chip for a few dollars and start over again.

We could imagine looping this operation around a list of URLs for several cards.

websites = [
    "https://docs.arduino.cc/hardware/uno-rev3/",
    "https://docs.arduino.cc/hardware/nano/",
    "https://docs.arduino.cc/hardware/mega-2560/",
    "https://docs.arduino.cc/hardware/leonardo/",
]

With this method, we unfortunately can’t load the detailed list of “Tech Specs”, so we have to use the browser.

Setting up a Web Crawler with Selenium

Loading a page is easy

from selenium import webdriver

GECKOPATH = "PATH_TO_GECKO"
sys.path.append(GECKOPATH)

print("Starting Web Crawling ...")

#website to crawl
website="https://docs.arduino.cc/hardware/uno-rev3/"

#create browser handler
browser = webdriver.Firefox()
browser.get(website)

#browser.quit()

Cookie validation

When the page is displayed, you’re likely to come across the cookie banner, which you’ll need to accept or reject in order to continue browsing. To do this, find and click on the “accept” button.

def acceptcookies():
	"""class="iubenda-cs-accept-btn iubenda-cs-btn-primary"""
	browser.find_elements(By.CLASS_NAME,"iubenda-cs-accept-btn")[0].click()
acceptcookies()

Waiting to load

As the page is displayed in the browser, it takes some time for it to load the data and for all the tags to be displayed. To wait for loading, you can wait an arbitrary amount of time

browser.implicitly_wait(10)

Or wait until a particular tag is present, such as the cookie acceptance button

from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions

def waitForElement(locator, timeout ):
    elm = WebDriverWait(browser, timeout).until(expected_conditions.presence_of_element_located(locator))
    return elm

myElem =waitForElement((By.CLASS_NAME , 'iubenda-cs-accept-btn'),30)

N.B: If you encounter any other problems (unknown element, non-clickable button, etc.) in the script when there are no problems on the Web page, don’t hesitate to use the time.sleep() function.

Find and press a DOM element

To display the technical specifications, the script must click on the ‘Tech Specs’ tab. This means finding the element from the text. There are two ways to do this: test the element text or use Xpath

#get element by text 
btn_text = 'Tech Specs'
btn_elms = browser.find_elements(By.CLASS_NAME,'tabs')[0].find_elements(By.TAG_NAME,'button')
for btn in btn_elms:
     if btn.text == btn_text:
          btn.click()

spec_btn = browser.find_element(By.XPATH, "//*[contains(text(),'Tech Specs')]")
spec_btn.click()

Retrieve the desired data

Once the desired page has been loaded, you can retrieve the data.

All data displayed in table form

#get all rows and children
print("Tech specs")
print("-------------------------------------")

tr_elms = browser.find_elements(By.TAG_NAME,'tr')

for tr in tr_elms:
     th_elms = tr.find_elements(By.XPATH, '*')
     if len(th_elms)>1:
        print(th_elms[0].text, " : ", th_elms[1].text)

Either specific data

#get parent and siblings
print("Specific data")
print("-------------------------------------")
data_row = browser.find_element(By.XPATH, "//*[contains(text(),'Main Processor')]")
data = data_row.find_element(By.XPATH, "following-sibling::*[1]").text
print(data_row.text, " : ", data)

Result of specification crawling

Starting Web Crawling ...
Page is ready!
Tech specs
-------------------------------------
Name  :  Arduino UNO R3
SKU  :  A000066
Built-in LED Pin  :  13
Digital I/O Pins  :  14
Analog input pins  :  6
PWM pins  :  6
UART  :  Yes
I2C  :  Yes
SPI  :  Yes
I/O Voltage  :  5V
Input voltage (nominal)  :  7-12V
DC Current per I/O Pin  :  20 mA
Power Supply Connector  :  Barrel Plug
Main Processor  :  ATmega328P 16 MHz
USB-Serial Processor  :  ATmega16U2 16 MHz
ATmega328P  :  2KB SRAM, 32KB FLASH, 1KB EEPROM
Weight  :  25 g
Width  :  53.4 mm
Length  :  68.6 mm
Specific data
-------------------------------------
Main Processor  :  ATmega328P 16 MHz
PS D:\Formation\Python\WebCrawler>

Retrieving data from different pages

Once you’ve mastered these tools and have a good idea of the data to be retrieved and the structure of the web pages, you can scrape data from several pages. In this last example, we’re retrieving technical data from various Arduino boards. To do this, we create a loop that will execute the preceding code on a list of sites

websites = [
    "https://docs.arduino.cc/hardware/uno-rev3/",
    "https://docs.arduino.cc/hardware/nano/",
    "https://docs.arduino.cc/hardware/mega-2560/",
    "https://docs.arduino.cc/hardware/leonardo/",
]

import sys
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By

from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions

GECKOPATH = "D:\\AranaCorp\\Marketing\\Prospects"
sys.path.append(GECKOPATH)

print("Starting Web Crawling ...")

websites = [
    "https://docs.arduino.cc/hardware/uno-rev3/",
    "https://docs.arduino.cc/hardware/nano/",
    "https://docs.arduino.cc/hardware/mega-2560/",
    "https://docs.arduino.cc/hardware/leonardo/",
]

#create browser handler
browser = webdriver.Firefox()#Firefox(firefox_binary=binary)
def acceptcookies():
    #class="iubenda-cs-accept-btn iubenda-cs-btn-primary
    browser.find_elements(By.CLASS_NAME,"iubenda-cs-accept-btn")[0].click()
     
def waitForElement(locator, timeout ):
    elm = WebDriverWait(browser, timeout).until(expected_conditions.presence_of_element_located(locator))
    return elm

cookie_accepted=False
for website in websites:
    browser.get(website)
    time.sleep(2)

    if not cookie_accepted: #accept cookie once
        myElem =waitForElement((By.CLASS_NAME , 'iubenda-cs-accept-btn'),30)
        print("Page is ready!")
        acceptcookies()
        cookie_accepted = True
    else:
         myElem =waitForElement((By.CLASS_NAME , 'tabs__item'),30)
         
    #get board name
    name = browser.find_element(By.TAG_NAME,'h1').text

    #get tab Tech Specs
    btn_text = 'Tech Specs'
    spec_btn = WebDriverWait(browser, 20).until(expected_conditions.element_to_be_clickable((By.XPATH, "//*[contains(text(),'{}')]".format(btn_text))))
    spec_btn.click()
    #browser.execute_script("arguments[0].click();", spec_btn) #use script to click

    #get all rows and children
    print(name+" "+btn_text)
    print("-------------------------------------")

    tr_elms = browser.find_elements(By.TAG_NAME,'tr')

    for tr in tr_elms:
        th_elms = tr.find_elements(By.XPATH, '*')
        if len(th_elms)>1:
            print(th_elms[0].text, " : ", th_elms[1].text)


    #get parent and siblings
    print("Specific data")
    print("-------------------------------------")
    try:
        data_row = browser.find_element(By.XPATH, "//*[contains(text(),'Main Processor')]")
    except:
        data_row = browser.find_element(By.XPATH, "//*[contains(text(),'Processor')]")  
    data = data_row.find_element(By.XPATH, "following-sibling::*[1]").text
    print(data_row.text, " : ", data)

browser.quit()

Starting Web Crawling ...
Page is ready!
UNO R3 Tech Specs
-------------------------------------
Main Processor  :  ATmega328P 16 MHz
Nano Tech Specs
-------------------------------------
Processor  :  ATmega328 16 MHz
Mega 2560 Rev3 Tech Specs
-------------------------------------
Main Processor  :  ATmega2560 16 MHz
Leonardo Tech Specs
-------------------------------------
Processor  :  ATmega32U4 16 MHz

Combining Selenium and BeautifulSoup

The two libraries can be combined to bring you all their features

from bs4 import BeautifulSoup
from selenium import webdriver

browser = webdriver.Firefox()
browser.get(website)

html = browser.page_source
content = BeautifulSoup(html, 'lxml')

browser.quit()

Applications

Automate web-based data collection tasks
Create your own image bank for neural network training
Find prospects
Market research