To collect data on the Internet, you can create a Web crawler or Web scraping program with Python. A web crawler is a tool that extracts data from one or more web pages.
Configuring the Python environment
We assume that Python3 and pip are installed on your machine. You can also use a virtual environment to keep your project clean and control the library versions of your Python web crawler.
First of all, we’re going to install the requests library, which enables HTTP requests to be made to the server to retrieve data.
python -m pip install requests
To analyze and navigate Web data, we use the Beautiful Soup library, which allows us to work with tag-based scripts such as HTML or XML.
python -m pip install beautifulsoup4
Finally, we install the Selenium library, which automates web browser tasks. It can display dynamic web pages and perform actions on the interface. This library alone can be used for web scraping on the Internet, as it can work with a dynamic website running JavaScript.
python -m pip install selenium
To run Selenium with Mozilla, you will need to download Geckodriver
Recovering a Web page with resquest
If we want to retrieve technical data from an Arduino board, we can load the desired page with requests and bs4
page = requests.get("https://docs.arduino.cc/hardware/uno-rev3/") content = BeautifulSoup(page.text, 'html.parser')
By observing the page structure, you can locate the tags, classes, identifiers or texts that interest you. In this example, we retrieve
- card name
- card description
N.B.: You can view the structure of the web page in your browser by right-clicking on the page and selecting “Inspect”.
import requests from bs4 import BeautifulSoup print("Starting Web Crawling ...") #website to crawl website="https://docs.arduino.cc/hardware/uno-rev3/" #google search #keywords = ["arduino","datasheet"] #googlesearch = "https://www.google.com/search?client=firefox-b-d&q=" #search = "+".join(keywords) #website = googlesearch+search # get page page = requests.get(website) #extract html data content = BeautifulSoup(page.text, 'html.parser') # extract tags h1_elms = content.find_all('h1') print("Board : ",h1_elms) #get element by class description = content.find(class_="product-features__description").text print("Description : ",description)
Starting Web Crawling ... Board : [<h1>UNO R3</h1>] Description : Arduino UNO is a microcontroller board based on the ATmega328P. It has 14 digital input/output pins (of which 6 can be used as PWM outputs), 6 analog inputs, a 16 MHz ceramic resonator, a USB connection, a power jack, an ICSP header and a reset button. It contains everything needed to support the microcontroller; simply connect it to a computer with a USB cable or power it with a AC-to-DC adapter or battery to get started. You can tinker with your UNO without worrying too much about doing something wrong, worst case scenario you can replace the chip for a few dollars and start over again.
We could imagine looping this operation around a list of URLs for several cards.
websites = [ "https://docs.arduino.cc/hardware/uno-rev3/", "https://docs.arduino.cc/hardware/nano/", "https://docs.arduino.cc/hardware/mega-2560/", "https://docs.arduino.cc/hardware/leonardo/", ]
With this method, we unfortunately can’t load the detailed list of “Tech Specs”, so we have to use the browser.
Setting up a Web Crawler with Selenium
Loading a page is easy
from selenium import webdriver GECKOPATH = "PATH_TO_GECKO" sys.path.append(GECKOPATH) print("Starting Web Crawling ...") #website to crawl website="https://docs.arduino.cc/hardware/uno-rev3/" #create browser handler browser = webdriver.Firefox() browser.get(website) #browser.quit()
Cookie validation
When the page is displayed, you’re likely to come across the cookie banner, which you’ll need to accept or reject in order to continue browsing. To do this, find and click on the “accept” button.
def acceptcookies(): """class="iubenda-cs-accept-btn iubenda-cs-btn-primary""" browser.find_elements(By.CLASS_NAME,"iubenda-cs-accept-btn")[0].click() acceptcookies()
Waiting to load
As the page is displayed in the browser, it takes some time for it to load the data and for all the tags to be displayed. To wait for loading, you can wait an arbitrary amount of time
browser.implicitly_wait(10)
Or wait until a particular tag is present, such as the cookie acceptance button
from selenium.common.exceptions import TimeoutException from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions def waitForElement(locator, timeout ): elm = WebDriverWait(browser, timeout).until(expected_conditions.presence_of_element_located(locator)) return elm myElem =waitForElement((By.CLASS_NAME , 'iubenda-cs-accept-btn'),30)
N.B: If you encounter any other problems (unknown element, non-clickable button, etc.) in the script when there are no problems on the Web page, don’t hesitate to use the time.sleep() function.
Find and press a DOM element
To display the technical specifications, the script must click on the ‘Tech Specs’ tab. This means finding the element from the text. There are two ways to do this: test the element text or use Xpath
#get element by text btn_text = 'Tech Specs' btn_elms = browser.find_elements(By.CLASS_NAME,'tabs')[0].find_elements(By.TAG_NAME,'button') for btn in btn_elms: if btn.text == btn_text: btn.click() spec_btn = browser.find_element(By.XPATH, "//*[contains(text(),'Tech Specs')]") spec_btn.click()
Retrieve the desired data
Once the desired page has been loaded, you can retrieve the data.
All data displayed in table form
#get all rows and children print("Tech specs") print("-------------------------------------") tr_elms = browser.find_elements(By.TAG_NAME,'tr') for tr in tr_elms: th_elms = tr.find_elements(By.XPATH, '*') if len(th_elms)>1: print(th_elms[0].text, " : ", th_elms[1].text)
Either specific data
#get parent and siblings print("Specific data") print("-------------------------------------") data_row = browser.find_element(By.XPATH, "//*[contains(text(),'Main Processor')]") data = data_row.find_element(By.XPATH, "following-sibling::*[1]").text print(data_row.text, " : ", data)
Result of specification crawling
Starting Web Crawling ... Page is ready! Tech specs ------------------------------------- Name : Arduino UNO R3 SKU : A000066 Built-in LED Pin : 13 Digital I/O Pins : 14 Analog input pins : 6 PWM pins : 6 UART : Yes I2C : Yes SPI : Yes I/O Voltage : 5V Input voltage (nominal) : 7-12V DC Current per I/O Pin : 20 mA Power Supply Connector : Barrel Plug Main Processor : ATmega328P 16 MHz USB-Serial Processor : ATmega16U2 16 MHz ATmega328P : 2KB SRAM, 32KB FLASH, 1KB EEPROM Weight : 25 g Width : 53.4 mm Length : 68.6 mm Specific data ------------------------------------- Main Processor : ATmega328P 16 MHz PS D:\Formation\Python\WebCrawler>
Retrieving data from different pages
Once you’ve mastered these tools and have a good idea of the data to be retrieved and the structure of the web pages, you can scrape data from several pages. In this last example, we’re retrieving technical data from various Arduino boards. To do this, we create a loop that will execute the preceding code on a list of sites
websites = [ "https://docs.arduino.cc/hardware/uno-rev3/", "https://docs.arduino.cc/hardware/nano/", "https://docs.arduino.cc/hardware/mega-2560/", "https://docs.arduino.cc/hardware/leonardo/", ]
import sys import time import requests from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.by import By from selenium.common.exceptions import TimeoutException from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions GECKOPATH = "D:\\AranaCorp\\Marketing\\Prospects" sys.path.append(GECKOPATH) print("Starting Web Crawling ...") websites = [ "https://docs.arduino.cc/hardware/uno-rev3/", "https://docs.arduino.cc/hardware/nano/", "https://docs.arduino.cc/hardware/mega-2560/", "https://docs.arduino.cc/hardware/leonardo/", ] #create browser handler browser = webdriver.Firefox()#Firefox(firefox_binary=binary) def acceptcookies(): #class="iubenda-cs-accept-btn iubenda-cs-btn-primary browser.find_elements(By.CLASS_NAME,"iubenda-cs-accept-btn")[0].click() def waitForElement(locator, timeout ): elm = WebDriverWait(browser, timeout).until(expected_conditions.presence_of_element_located(locator)) return elm cookie_accepted=False for website in websites: browser.get(website) time.sleep(2) if not cookie_accepted: #accept cookie once myElem =waitForElement((By.CLASS_NAME , 'iubenda-cs-accept-btn'),30) print("Page is ready!") acceptcookies() cookie_accepted = True else: myElem =waitForElement((By.CLASS_NAME , 'tabs__item'),30) #get board name name = browser.find_element(By.TAG_NAME,'h1').text #get tab Tech Specs btn_text = 'Tech Specs' spec_btn = WebDriverWait(browser, 20).until(expected_conditions.element_to_be_clickable((By.XPATH, "//*[contains(text(),'{}')]".format(btn_text)))) spec_btn.click() #browser.execute_script("arguments[0].click();", spec_btn) #use script to click #get all rows and children print(name+" "+btn_text) print("-------------------------------------") tr_elms = browser.find_elements(By.TAG_NAME,'tr') for tr in tr_elms: th_elms = tr.find_elements(By.XPATH, '*') if len(th_elms)>1: print(th_elms[0].text, " : ", th_elms[1].text) #get parent and siblings print("Specific data") print("-------------------------------------") try: data_row = browser.find_element(By.XPATH, "//*[contains(text(),'Main Processor')]") except: data_row = browser.find_element(By.XPATH, "//*[contains(text(),'Processor')]") data = data_row.find_element(By.XPATH, "following-sibling::*[1]").text print(data_row.text, " : ", data) browser.quit()
Starting Web Crawling ... Page is ready! UNO R3 Tech Specs ------------------------------------- Main Processor : ATmega328P 16 MHz Nano Tech Specs ------------------------------------- Processor : ATmega328 16 MHz Mega 2560 Rev3 Tech Specs ------------------------------------- Main Processor : ATmega2560 16 MHz Leonardo Tech Specs ------------------------------------- Processor : ATmega32U4 16 MHz
Combining Selenium and BeautifulSoup
The two libraries can be combined to bring you all their features
from bs4 import BeautifulSoup from selenium import webdriver browser = webdriver.Firefox() browser.get(website) html = browser.page_source content = BeautifulSoup(html, 'lxml') browser.quit()
Applications
- Automate web-based data collection tasks
- Create your own image bank for neural network training
- Find prospects
- Market research