Python Web Scraping Using BeautifulSoup and lxml
Introduction
Web scraping is the process of extracting data from websites. Python is one of the best tools for web scraping, thanks to libraries like:
- requests – to make HTTP requests
- BeautifulSoup – to parse and navigate HTML
- lxml – for faster XML/HTML parsing (used as a parser with BeautifulSoup)
Install Required Modules
pip install requests beautifulsoup4 lxml
Step-by-Step Web Scraping Example
Step 1: Import Modules
import requests
from bs4 import BeautifulSoup
Step 2: Make an HTTP Request
url = 'https://example.com'
response = requests.get(url)
print(response.status_code) # 200 means OK
Step 3: Parse HTML with BeautifulSoup and lxml
soup = BeautifulSoup(response.content, 'lxml') # using lxml parser
print(soup.title.text)
Common BeautifulSoup Functions
Task | Code |
---|---|
Get all <a> tags | soup.find_all('a') |
Get all <p> tags | soup.find_all('p') |
Get tag by ID | soup.find(id="main") |
Get tag by class | soup.find_all(class_="product-title") |
Get text only | tag.text |
Get attribute value | tag['href'] |
Example: Scraping Article Titles
import requests
from bs4 import BeautifulSoup
url = "https://quotes.toscrape.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
quotes = soup.find_all("span", class_="text")
for i, quote in enumerate(quotes, 1):
print(f"{i}. {quote.text}")
Output:
1. "The world as we have created it is a process of our thinking." 2. "It is our choices that show what we truly are..." ...
Parsing Tables from HTML
table = soup.find('table')
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
data = [col.text.strip() for col in cols]
print(data)
Using lxml Directly (Advanced)
If you want faster parsing and XPath support:
from lxml import html
url = "https://example.com"
response = requests.get(url)
tree = html.fromstring(response.content)
# Extract all links
links = tree.xpath('//a/@href')
print(links)
Handling Headers & User-Agent (To Avoid Blocks)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}
response = requests.get("https://example.com", headers=headers)
Exporting Scraped Data to CSV
import csv
with open("quotes.csv", "w", newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(["Quote"])
for quote in quotes:
writer.writerow([quote.text])
Ethical Note on Web Scraping
- Always read a website's robots.txt before scraping.
- Avoid overloading servers.
- Do not scrape private or sensitive data.