A beginner’s guide to web scraping with Python and Scrapy
Getting Started
To get us started, you will need to start a new Python3 project with and install Scrapy (a web scraping and web crawling library for Python). I’m using pipenv for this tutorial, but you can use pip and venv, or conda.
pipenv install scrapy
At this point, you have Scrapy, but you still need to create a new web scraping project, and for that scrapy provides us with a command line that does the work for us.
Let’s now create a new project named web_scraper by using the scrapy cli.
If you are using pipenv like me, use:
pipenv run scrapy startproject web_scraper .
Otherwise, from your virtual environment, use:
scrapy startproject web_scraper .
This will create a basic project in the current directory with the following structure:
Building our first Spider with XPath queries
We will start our web scraping tutorial with a very simple example. At first, we’ll locate the logo of theLive Code Streamwebsite inside HTML. And as we know, it is just a text and not an image, so we’ll simply extract this text.
The code
To get started we need to create a new spider for this project. We can do that by either creating a new file or using the CLI.
Since we know already the code we need we will create a new Python file on this path/web_scraper/spiders/live_code_stream.py
Here are the contents of this file.
Code explanation:
You can even use some external libraries like BeautifulSoup and lxml . But, for this example, we’ve used XPath.A quick way to determine the XPath of any HTML element is to open it inside the Chrome DevTools. Now, simply right-click on the HTML code of that element, hover the mouse cursor over “Copy” inside the popup menu that just appeared. Finally, click the “Copy XPath” menu item.
Have a look at the below screenshot to understand it better.
By the way, I used/text()after the actual XPath of the element to only retrieve the text from that element instead of the full element code.
Note:You’re not allowed to use any other name for the variable, list, or function as mentioned above. These names are pre-defined in Scrapy library. So, you must use them as it is. Otherwise, the program will not work as intended.
Run the Spider:
As we are already inside theweb_scraperfolder in command prompt. Let’s execute our spider and fill the result inside a new filelcs.jsonusingthe below code. Yes, the result we get will be well-structured using JSON format.
Results:
When the above code executes, we’ll see a new filelcs.jsonin our project folder.
Here are the contents of this file.
Another Spider with CSS query selectors
Most of us love sports, and when it comes to Football, it is my personal favorite.
Football tournaments are organized frequently throughout the world. There are several websites that provide a live feed of match results while they are being played. But, most of these websites don’t offer any official API.
In turn, it creates an opportunity for us to use our web scraping skills and extract meaningful information by directly scraping their website.
For example, let’s have a look atLivescore.czwebsite.
On their home page, they have nicely displayed tournaments and their matches that will be played today (the date when you visit the website).
We can retrieve information like:
In our code example, we will be extracting tournament names that have matches today.
The code
Let’s create a new spider in our project to retrieve the tournament names. I’ll name this file aslivescore_t.py
Here is the code that you need to enter inside/web_scraper/web_scraper/spiders/livescore_t.py
Run the newly created spider:
It’s time to see our spider in action. Run the below command to let the spider crawl the home page of Livescore.cz website. The web scraping result will then be added inside a new file calledls_t.jsonin JSON format.
By now you know the drill.
Results:
This is what our web spider has extracted on 18 November 2020 fromLivescore.cz. Remember that the output may change every day.