Python Web Scraping Using BeautifulSoup and lxml


Introduction

Web scraping is the process of extracting data from websites. Python is one of the best tools for web scraping, thanks to libraries like:

  • requests – to make HTTP requests
  • BeautifulSoup – to parse and navigate HTML
  • lxml – for faster XML/HTML parsing (used as a parser with BeautifulSoup)

Install Required Modules

pip install requests beautifulsoup4 lxml


Step-by-Step Web Scraping Example

Step 1: Import Modules

import requests
from bs4 import BeautifulSoup

Step 2: Make an HTTP Request

url = 'https://example.com'
response = requests.get(url)
print(response.status_code)  # 200 means OK

Step 3: Parse HTML with BeautifulSoup and lxml

soup = BeautifulSoup(response.content, 'lxml')  # using lxml parser
print(soup.title.text)


Common BeautifulSoup Functions

Task Code
Get all <a> tags soup.find_all('a')
Get all <p> tags soup.find_all('p')
Get tag by ID soup.find(id="main")
Get tag by class soup.find_all(class_="product-title")
Get text only tag.text
Get attribute value tag['href']


Example: Scraping Article Titles

import requests
from bs4 import BeautifulSoup

url = "https://quotes.toscrape.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

quotes = soup.find_all("span", class_="text")

for i, quote in enumerate(quotes, 1):
    print(f"{i}. {quote.text}")

Output:

1. "The world as we have created it is a process of our thinking."
2. "It is our choices that show what we truly are..."
...


Parsing Tables from HTML

table = soup.find('table')
rows = table.find_all('tr')

for row in rows:
    cols = row.find_all('td')
    data = [col.text.strip() for col in cols]
    print(data)


Using lxml Directly (Advanced)

If you want faster parsing and XPath support:

from lxml import html

url = "https://example.com"
response = requests.get(url)

tree = html.fromstring(response.content)

# Extract all links
links = tree.xpath('//a/@href')
print(links)


Handling Headers & User-Agent (To Avoid Blocks)

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}

response = requests.get("https://example.com", headers=headers)


Exporting Scraped Data to CSV

import csv

with open("quotes.csv", "w", newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(["Quote"])
    for quote in quotes:
        writer.writerow([quote.text])


Ethical Note on Web Scraping

  • Always read a website's robots.txt before scraping.
  • Avoid overloading servers.
  • Do not scrape private or sensitive data.