SRCco.de

Scraping HTML Websites with Beautiful Soup

Posted:   |  More posts about python
../galleries/python-logo.png

Scraping HTML websites for information is a common task. This blog post shows how to extract information via the Beautiful Soup (bs4) Python library.

Some wanted information is buried in semi-structured HTML websites and not available via public APIs. In my case I wanted to extract stock and company information from a public finance website using Python. The Python standard library comes with ElementTree which can read well-formed XML documents, but will choke on HTML source code of most websites. Luckily there is Beautiful Soup to the rescue! BeautifulSoup ("bs4") is a Python library to read HTML in any form imaginable, i.e. even if it's not well-formed (X)HTML.

Let's print the latest blog post titles of my personal blog:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://srcco.de")
soup = BeautifulSoup(response.text, 'html.parser')
# find all HTML "article" tags
for article in soup.find_all("article"):
    # BeautifulSoup supports CSS selectors
    # via "select" and "select_one"
    title = article.select_one("h1>a").get_text()
    print(title)

Finding the latest blog post titles could also have been achieved by reading the well-formed RSS XML feed. Here I present the code as an example for HTML scraping, not that I would recommend doing exactly this! ;-)

So I mentioned reading stock information. Let's read a table of historic stock prices for Allianz into Python dictionaries (dict):

import requests
from bs4 import BeautifulSoup

url = "https://www.boerse.de/historische-kurse/Allianz-Aktie/DE0008404005"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.select_one(".histKurseDay table")

headers = []
for cell in table.thead.find_all("th"):
    headers.append(cell.get_text().strip())

for row in table.tbody.find_all("tr"):
    data = []
    for cell in row.find_all("td"):
        data.append(cell.get_text().strip())
    print(dict(zip(headers, data)))

Here we first get the table headers from th tags below thead and then read all data values from the rows in tbody. zip is a standard Python function which "zips" the list of header names and list of row values into one list of tuples.

Some notes:

  • The HTML tag hierarchy can be navigated directly via Python attribute access, e.g. element.body.table.thead

  • CSS selectors (.select(…) or .select_one(…)) are handy to get specific elements, e.g. div.main-content ul to get the ul tag in the div with .main-content CSS class.

  • find_all(…) will return all tags with the given name below the element, e.g. element.find_all("a") allows to iterate over all HTML links.

  • get_text() will return all text below the element including all whitespace. Often you want to strip leading and trailing whitespace with strip().

More information can be found in the Beautiful Soup Documentation.

This was a rather short blog post on how to scrape HTML websites. Scraping HTML websites can be ugly, but also very useful to get information not available via other means such as proper APIs.