Scraping HTML Websites with Beautiful Soup
Posted: | More posts about pythonScraping HTML websites for information is a common task. This blog post shows how to extract
information via the Beautiful Soup (bs4
) Python library.
Some wanted information is buried in semi-structured HTML websites and not available via public APIs. In my case I wanted to extract stock and company information from a public finance website using Python. The Python standard library comes with ElementTree which can read well-formed XML documents, but will choke on HTML source code of most websites. Luckily there is Beautiful Soup to the rescue! BeautifulSoup ("bs4") is a Python library to read HTML in any form imaginable, i.e. even if it's not well-formed (X)HTML.
Let's print the latest blog post titles of my personal blog:
import requests from bs4 import BeautifulSoup response = requests.get("https://srcco.de") soup = BeautifulSoup(response.text, 'html.parser') # find all HTML "article" tags for article in soup.find_all("article"): # BeautifulSoup supports CSS selectors # via "select" and "select_one" title = article.select_one("h1>a").get_text() print(title)
Finding the latest blog post titles could also have been achieved by reading the well-formed RSS XML feed. Here I present the code as an example for HTML scraping, not that I would recommend doing exactly this! ;-)
So I mentioned reading stock information. Let's read a table of historic stock prices for Allianz into Python dictionaries (dict
):
import requests from bs4 import BeautifulSoup url = "https://www.boerse.de/historische-kurse/Allianz-Aktie/DE0008404005" response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') table = soup.select_one(".histKurseDay table") headers = [] for cell in table.thead.find_all("th"): headers.append(cell.get_text().strip()) for row in table.tbody.find_all("tr"): data = [] for cell in row.find_all("td"): data.append(cell.get_text().strip()) print(dict(zip(headers, data)))
Here we first get the table headers from th
tags below thead
and then read all data values from the rows in tbody
.
zip
is a standard Python function which "zips" the list of header names and list of row values into one list of tuples.
Some notes:
The HTML tag hierarchy can be navigated directly via Python attribute access, e.g.
element.body.table.thead
CSS selectors (
.select(…)
or.select_one(…)
) are handy to get specific elements, e.g.div.main-content ul
to get theul
tag in thediv
with.main-content
CSS class.find_all(…)
will return all tags with the given name below the element, e.g.element.find_all("a")
allows to iterate over all HTML links.get_text()
will return all text below the element including all whitespace. Often you want to strip leading and trailing whitespace withstrip()
.
More information can be found in the Beautiful Soup Documentation.
This was a rather short blog post on how to scrape HTML websites. Scraping HTML websites can be ugly, but also very useful to get information not available via other means such as proper APIs.