GitHub - achmadbaund/python-scraping: Web scraping is the process of extracting structured and unstructured data from the web with the help of programs and exporting into a useful format

Python Libraries for Web Scraping

Web scraping is the process of extracting structured and unstructured data from the web with the help of programs and exporting into a useful format. If you want to learn more about web scraping, here are a couple of resources to get you started:

Requests (HTTP for Humans) Library for Web Scraping

Let’s start with the most basic Python library for web scraping. ‘Requests’ lets us make HTML requests to the website’s server for retrieving the data on its page. Getting the HTML content of a web page is the first and foremost step of web scraping.

web scraping tools requests

Requests is a Python library used for making various types of HTTP requests like GET, POST, etc. Because of its simplicity and ease of use, it comes with the motto of HTTP for Humans.

I would say this the most basic yet essential library for web scraping. However, the Requests library does not parse the HTML data retrieved. If we want to do that, we require libraries like lxml and Beautiful Soup (we’ll cover them further down in this article).

Let’s take a look at the advantages and disadvantages of the Requests Python library.

Advantages:

Simple
Basic/Digest Authentication
International Domains and URLs
Chunked Requests
HTTP(S) Proxy Support

Disadvantages:

Retrieves only static content of a page
Can’t be used for parsing HTML
Can’t handle websites made purely with JavaScript

Beautiful Soup Library for Web Scraping

BeautifulSoup is perhaps the most widely used Python library for web scraping. It creates a parse tree for parsing HTML and XML documents. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8.

web scraping tools beautiful soup

One of the primary reasons the Beautiful Soup library is so popular is that it is easier to work with and well suited for beginners. We can also combine Beautiful Soup with other parsers like lxml. But all this ease of use comes with a cost – it is slower than lxml. Even while using lxml as a parser, it is slower than pure lxml.

One major advantage of the Beautiful Soup library is that it works very well with poorly designed HTML and has a lot of functions. The combination of Beautiful Soup and Requests is quite common in the industry.

Advantages:

Requires a few lines of code
Great documentation
Easy to learn for beginners
Robust
Automatic encoding detection

Disadvantages:

Slower than lxml

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
beautifulsoup-scraping.py		beautifulsoup-scraping.py
garuda-python.py		garuda-python.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages