Web scraping with Python - Kishan Pipa[l|r]iya

What is web scraping?

Webscraping involves extracting publicly displayed information from a website.

Difficulty of web scraping a website can range from simple requests to solving captchas. Complex cases may require a combination of several libraries however for simple webpages only two libraries are sufficient.

The libraries required are requests(the most downloaded Python library) and beautiful soup 4.

Requests is a complex library performing a variety of tasks and is in consideration to be added to the python standard library, however for webscraping we just need it to get the hHTML code for a webpage.

Here comes the creative part. The HTML code is parsed using beautiful soup. We now need to cleverly get the required information from the parsed html. There are no hard rules regarding the approach and we may have to travel several layers deep into the parsed HTML to reach our goal.

For website using Javascript for interactions , a library named Selenium can be used to replicate keyboard and mouse interactions.

Webscraping should be used as a last resort when APIs are not available, as extreme webscraping can put unexpected load on webserver.

Leave a comment Cancel reply