Skip to content

achmadbaund/python-scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

Python Libraries for Web Scraping

Web scraping is the process of extracting structured and unstructured data from the web with the help of programs and exporting into a useful format. If you want to learn more about web scraping, here are a couple of resources to get you started:

  1. Requests (HTTP for Humans) Library for Web Scraping

Let’s start with the most basic Python library for web scraping. ‘Requests’ lets us make HTML requests to the website’s server for retrieving the data on its page. Getting the HTML content of a web page is the first and foremost step of web scraping.

web scraping tools requests

Requests is a Python library used for making various types of HTTP requests like GET, POST, etc. Because of its simplicity and ease of use, it comes with the motto of HTTP for Humans.

I would say this the most basic yet essential library for web scraping. However, the Requests library does not parse the HTML data retrieved. If we want to do that, we require libraries like lxml and Beautiful Soup (we’ll cover them further down in this article).

Let’s take a look at the advantages and disadvantages of the Requests Python library.

Advantages:

Simple
Basic/Digest Authentication
International Domains and URLs
Chunked Requests
HTTP(S) Proxy Support

Disadvantages:

Retrieves only static content of a page
Can’t be used for parsing HTML
Can’t handle websites made purely with JavaScript
  1. Beautiful Soup Library for Web Scraping

BeautifulSoup is perhaps the most widely used Python library for web scraping. It creates a parse tree for parsing HTML and XML documents. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8.

web scraping tools beautiful soup

One of the primary reasons the Beautiful Soup library is so popular is that it is easier to work with and well suited for beginners. We can also combine Beautiful Soup with other parsers like lxml. But all this ease of use comes with a cost – it is slower than lxml. Even while using lxml as a parser, it is slower than pure lxml.

One major advantage of the Beautiful Soup library is that it works very well with poorly designed HTML and has a lot of functions. The combination of Beautiful Soup and Requests is quite common in the industry.

Advantages:

Requires a few lines of code
Great documentation
Easy to learn for beginners
Robust
Automatic encoding detection

Disadvantages:

Slower than lxml

About

Web scraping is the process of extracting structured and unstructured data from the web with the help of programs and exporting into a useful format

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages