Selenium With BeautifulSoup Tutorial[Python]

Aamir Ahmad Ansari
7 min readJan 17, 2022
Sherlock Season 4

Sherlock and Watson worked best as a team, more naively put, Sherlock the brains and Watson, the muscle. Well, not literally, but Watson was always a great help, in the sign of four or the Baskerville Hound and many more. In the same way, yes Selenium can work alone but it works best with BeautifulSoup(in my opinion). Selenium is a browser automation tool that was built primarily to test web applications flow but our data science community uses it to scrape data. Apologize to team selenium XD. In this article, we will look at how can we use both selenium and bs4 together to get the required data from the internet.

Prerequisites

If you know HTML tags and their attributes then that would be a plus, else I’ll explain the ones we will use. We will use python primarily and knowledge of its paradigms is great to have.

Selenium With BeautifulSoup

We did not discuss beautiful soup, which is a Python library for pulling data out of HTML and XML files. It works with your favourite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.[1] The workflow today will be straightforward, we will start by opening a browser instance on a windows machine and scrape WhoScored’s websites and search a football team from the search bar(Manchester United as you may have expected) and get the players name and ratings. In a nutshell, we are going to make our code do this for us:

Step 1
Step 2
Step 3
Step 4

Step 1: Let’s Configure

This is where we gonna have to do most of the work so, to reach this website https://1xbet.whoscored.com we need to configure our selenium instance and for that, we will require binaries in Linux and an executable driver on windows. As we are doing it on windows, download its zip file from here. Extract the zip file, you find chromedriver.exe( We will use google chrome as our browser you can choose firefox or any other browser supported by selenium). Remember the path of the chromedriver.exe file, we will use it. And now it is coding time!!

#Import the libraries
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
import pandas as pd
import numpy as np

These are the libraries we will require, now the path to that .exe file will come in handy and we will learn the syntax first.

Webdriver Object

This is the class that will give us the browser instance. The two important arguments you should remember are service, in which you will pass the executable path wrapped in a service object and options. Options, helps you customize the browser instance, for example, if you need to processor faster you can create an Options object and add an argument “ — headless” to it which will do everything in the background, more on these later but for now focus on the service object. We have already imported it, watch carefully above. We will create a service object and pass the path in its constructor.

Service Object

Now we will use this service object and pass this in web driver to begin our session.

Browser instance

In the first line, we create our browser instance and in the second we use the get method to open a URL and when you run that your browser instance will pop up with the URL we have entered. The code now looks like this,

Step 1Completed

We have completed step 1 here, we have opened the website successfully, now, we will locate the search box, fill values in it and look at the response in step 2.

Step 2: Find the Search Box

It is certainly not as hard as finding Nemo but a little tricky for that we have to inspect the website and look for the attributes in which the textbox for search is present.

Now that we know the search box contains the id “search-box”. The attribute ‘id’ is a unique identifier of any HTML tag, the search box has the input tag as it takes input from the user, # stands for id and a dot(.) stands for classes. Same classes can be present in multiple tags but the same ids can not. Remember that, it is a very useful tool. Now let’s find it in our code.

Finding the search box

The find_element(by, value) finds the element using its attributes in an HTML, by default the parameter by is set to ‘id’, you can use classes or any other attributes, the value takes in the value of the attribute as we found for the search box. The next step is to pass the input in the text area.

Pass the text

In the first line, we clear any prior input or placeholders used in the text box and by send_keys we push our values. It will take a little time for the instance to fill that value but when you will run this code, it will put ‘Manchester United’ in the text box. We then have to similarly find the submit button and click it using our code. The button is inside a div tag with id “search-button”, we will find the element similarly and instead of send_keys will use click() to submit our search.

Submit Search

We will now be redirected to the page at Step 3.

Step 3: Find the first result

Step 3

Note, now the browser object is also updated and contains the HTML of this page. We now have to find the first result and follow the link it contains, how will we find the first result? Inspecting elements will help to find the tags. The results are in a div tag with class “search-result”. Let’s find that in our code.

Find the first result

In the search-result, we find the first result by getting the first tag where the link is confined and the href property of the tag to get the proper link. Then we again get to this URL with browser, get and the browser object gets updated again and that's it that's the page we wanted. Now we will store this page by using .page_source to get the HTML of the page.

Get the HTML

That’s it Sherlock got what we wanted, it's Watson’s turn to use that(bs4). All that remains is getting the data from the source page and bs4 is to our rescue.

Step 4: Getting the Data

To begin with the extraction, we first will be required to parse the data. It means to structure the data we have and that is the first thing, we have to do for initiating the bs4 BeautifulSoup class.

The parsed data is in the structure of bs4.BeautifulSoup is effectively a tree data structure based on HTML structure. We will similarly inspect the landing page for the tags in which the data is stored. To get the names of the players we will use:

Names

The name of all the players are in this span tag with these classes. If you print names, you will get a list of spans:

We require the text that is the names between them. We will use the .text property to extract the text, similarly, we will find the classes of ratings and store our data.

The output you receive after this is this clean dataset:

Hence complete our quest here. We have successfully scrapped our data and I hope you guys found this useful. The code is available here at Github.

--

--

Aamir Ahmad Ansari

Sharing knowledge is gaining knowledge. Data Science Enthusiast and Master AI & ML fellow @ Univ.Ai