Project Review: Scraping
Examples of scraping content from external websites
Features
Objectives
I set out to scrape interesting content from the web in order to learn skills with the Requests and Beautiful Soup libraries. I started out by going for a couple of famous speeches which I selected as I was browsing through goodreads.com
The Approach & Solution
The main tool to use for this activity is Firefox Developer tools. This is required to inspect the DOM elements that are required for targeting.
It is also useful to use Jupyter Notebooks for this type of activity as their ability to present data like this is often superior to that of Python's interpreter, even if you use bPython.
Evaluation
There are a couple of enhancements that I would like to add to this project in order to improve its utility and to demonstrate further skills.
The scraping process takes about 15 seconds or so because it needs to extract that data from an external website and inspect the DOM elements of the web pages to do so. In any case, these processes should be respectful of a website as each page is a request to the website which could result in my site being viewed negatively.
To inform the user of progress, I would like to add a progress bar using Celery and Django.
I would also like to scrape the content using multi-processing to speed it up a little. It also gives me an opportunity to delve further into the multiprocessing library.