Last modified: Jan 12, 2026 By Alexander Williams
Extract Microdata & JSON-LD with BeautifulSoup
Web scraping often targets structured data. This data is hidden in HTML.
BeautifulSoup is a key tool for this task. It parses HTML and XML documents.
This guide shows how to find and extract Microdata and JSON-LD. These are common formats.
What is Structured Data?
Structured data is organized information. It uses a standard format.
Search engines love it. It helps them understand webpage content.
Two main types are Microdata and JSON-LD. Both embed data in HTML.
Extracting this data gives clean, reliable results. It beats parsing raw HTML.
Understanding Microdata
Microdata uses HTML tag attributes. It defines item types and properties.
The main attributes are itemscope, itemtype, and itemprop.
They mark up content like products or events. This makes data machine-readable.
Here is a simple HTML example with Microdata.
# Example HTML snippet with Microdata
html_microdata = """
<div itemscope itemtype="https://schema.org/Product">
<h1 itemprop="name">Awesome Laptop</h1>
<p itemprop="description">A powerful laptop for developers.</p>
<span itemprop="price" content="999.99">$999.99</span>
<div itemprop="brand" itemscope itemtype="https://schema.org/Brand">
<span itemprop="name">TechCorp</span>
</div>
</div>
"""
Extracting Microdata with BeautifulSoup
BeautifulSoup can find tags with specific attributes. Use the find_all method.
First, locate elements with the itemscope attribute. This identifies the data container.
Then, find nested elements with itemprop. This gets the property values.
Let's write code to parse the example above.
from bs4 import BeautifulSoup
# Parse the HTML
soup = BeautifulSoup(html_microdata, 'html.parser')
# Find the main product item (itemscope)
product_div = soup.find(attrs={"itemscope": True})
if product_div:
# Extract properties using itemprop
product_name = product_div.find(itemprop="name")
description = product_div.find(itemprop="description")
price = product_div.find(itemprop="price")
brand = product_div.find(itemprop="brand").find(itemprop="name")
# Print the extracted data
print("Product Name:", product_name.text if product_name else "Not found")
print("Description:", description.text if description else "Not found")
print("Price:", price.get('content') if price else "Not found") # Use 'content' attribute
print("Brand:", brand.text if brand else "Not found")
Product Name: Awesome Laptop
Description: A powerful laptop for developers.
Price: 999.99
Brand: TechCorp
Understanding JSON-LD
JSON-LD is another popular format. It uses a script tag in the HTML head or body.
The script type is application/ld+json. The data is pure JSON inside.
This format is easier for developers to parse. It's also favored by Google.
Here is an example of JSON-LD in HTML.
# Example HTML snippet with JSON-LD
html_jsonld = """
<html>
<head>
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Product",
"name": "Wireless Headphones",
"description": "Noise-cancelling over-ear headphones.",
"brand": {
"@type": "Brand",
"name": "SoundMax"
},
"offers": {
"@type": "Offer",
"price": "199.99",
"priceCurrency": "USD"
}
}
</script>
</head>
<body>
<h1>Wireless Headphones</h1>
</body>
</html>
"""
Extracting JSON-LD with BeautifulSoup
Extracting JSON-LD is a two-step process. First, find the script tag.
Use BeautifulSoup to locate script tags with the correct type.
Then, parse the JSON string into a Python dictionary. Use Python's json module.
This gives you direct access to all structured data fields.
import json
from bs4 import BeautifulSoup
# Parse the HTML
soup = BeautifulSoup(html_jsonld, 'html.parser')
# Find all script tags with type application/ld+json
jsonld_scripts = soup.find_all('script', type='application/ld+json')
for script in jsonld_scripts:
# Get the string inside the script tag
json_data = script.string
if json_data:
try:
# Parse the JSON string into a Python dictionary
data = json.loads(json_data)
print("Extracted JSON-LD Data:")
print(f" Type: {data.get('@type')}")
print(f" Name: {data.get('name')}")
print(f" Description: {data.get('description')}")
print(f" Brand: {data.get('brand', {}).get('name')}")
print(f" Price: {data.get('offers', {}).get('price')} {data.get('offers', {}).get('priceCurrency')}")
except json.JSONDecodeError as e:
print(f"Error parsing JSON: {e}")
Extracted JSON-LD Data:
Type: Product
Name: Wireless Headphones
Description: Noise-cancelling over-ear headphones.
Brand: SoundMax
Price: 199.99 USD
Microdata vs JSON-LD: Key Differences
Microdata is embedded within HTML elements. It mixes content and structure.
JSON-LD is separate from the visible content. It lives in its own script block.
JSON-LD is often easier to extract and parse. The data is already in JSON format.
Microdata requires navigating the HTML tree. You must find the right attributes.
Choose your extraction method based on the website's format.
Handling Multiple Data Items
Pages often have multiple structured data items. A product list page is a good example.
For Microdata, find all elements with itemscope. Loop through each one.
For JSON-LD, the script may contain a JSON list. Or there may be multiple script tags.
Always check the structure of the parsed JSON. It could be a list or a single object.
For large-scale projects, follow BeautifulSoup large-scale scraping best practices.
Common Challenges and Solutions
Data might be missing or malformed. Always add error handling to your code.
Use try-except blocks for JSON parsing. Use checks for None when finding tags.
Some sites load data dynamically with AJAX. You might need to scrape AJAX content with BeautifulSoup.
To avoid being blocked, rotate user agents and use proxies. Learn more in our guide on how to avoid getting blocked while scraping BeautifulSoup.
Speed is also crucial for big jobs. Consider using asynchronous techniques.
Conclusion
Extracting structured data is a powerful scraping skill. Microdata and JSON-LD are key sources.
BeautifulSoup provides the tools to find and parse this data. Combine it with Python's json module for JSON-LD.
This method yields accurate and clean data. It is more reliable than parsing raw text.
Remember to respect website terms and robots.txt. Use scraping responsibly.
Start by inspecting a page's source. Look for itemscope attributes or ld+json scripts.
Then, apply the techniques from this guide to build robust data extractors.