Scrape Websites with Python, Scrapy, and MongoDB

Build powerful, scalable web scraping pipelines that extract, process, and store data efficiently using Python's leading crawling framework and MongoDB's flexible document database.

Why Scrapy and MongoDB for Web Scraping?

Web scraping has become essential for data-driven applications, enabling competitive intelligence, pricing monitoring, content aggregation, and ML datasets. Scrapy stands out as Python's powerful asynchronous crawling framework designed specifically for large-scale scraping. Combined with MongoDB's flexible document model, you get robust data pipelines that extract, process, and store web content at scale.

Key advantages of this approach:

  • Asynchronous architecture handles thousands of concurrent requests efficiently
  • MongoDB's schema-flexible storage accommodates varied scraped data structures
  • Built-in middleware ecosystem handles cookies, redirects, and rate limiting
  • Pipeline architecture cleanly separates extraction from storage logic
  • Proven combination used by data teams at scale

For organizations building custom Python solutions, Scrapy and MongoDB provide the foundation for reliable data acquisition systems.

What You'll Learn

Master the complete scraping workflow

Scrapy Framework Fundamentals

Understand Scrapy's architecture, spider creation, and the request-response lifecycle

MongoDB Integration

Configure pipelines for seamless item storage with proper schema design

Spider Development

Build robust spiders with CSS/XPath selectors and link following

Performance Optimization

Tune concurrency, implement throttling, and batch database writes

Anti-Detection Tactics

Handle user-agent rotation, proxies, and JavaScript rendering

Production Deployment

Scale with distributed crawling and implement monitoring

Getting Started with Scrapy

Scrapy is a high-level web crawling and scraping framework that provides a complete toolkit for extracting data from websites. Unlike simpler libraries like Beautiful Soup that only handle HTML parsing, Scrapy offers complete crawling solutions including URL scheduling, request handling, response parsing, and item pipeline processing, as covered in the GeeksforGeeks Scrapy guide.

Installing Scrapy

pip install scrapy

The framework uses an asynchronous architecture based on Twisted, allowing it to handle thousands of concurrent requests efficiently without blocking. This non-blocking design means your scraper can request multiple pages simultaneously while processing responses as they arrive.

Creating a Project

scrapy startproject myscraper
cd myscraper
scrapy genspider example example.com

The project structure organizes spiders, items, and pipelines systematically. Spiders define extraction logic, items provide data containers, and pipelines handle post-processing. For teams building web development solutions, Scrapy provides the foundation for data acquisition systems that power competitive intelligence and content aggregation platforms.

Complete Product Spider Example
1import scrapy2from itemloaders import ItemLoader3from items import ProductItem4from datetime import datetime5 6class ProductSpider(scrapy.Spider):7 name = 'products'8 allowed_domains = ['example.com']9 start_urls = ['https://example.com/products']10 11 def parse(self, response):12 # Extract product links from listing page13 product_links = response.css('.product-card a::attr(href)').getall()14 15 for link in product_links:16 yield response.follow(link, self.parse_product)17 18 # Handle pagination19 next_page = response.css('.pagination .next a::attr(href)').get()20 if next_page:21 yield response.follow(next_page, self.parse)22 23 def parse_product(self, response):24 loader = ItemLoader(item=ProductItem(), selector=response)25 26 loader.add_css('name', '.product-title::text')27 loader.add_css('price', '.price-current::text')28 loader.add_css('description', '.product-description::text')29 loader.add_css('category', '.breadcrumb li:last-child a::text')30 loader.add_css('rating', '.rating-score::attr(data-rating)')31 loader.add_value('url', response.url)32 loader.add_value('scraped_at', datetime.now().isoformat())33 34 yield loader.load_item()

MongoDB Integration Architecture

MongoDB's document-oriented model aligns naturally with Scrapy's item-based architecture. Rather than forcing scraped data into rigid relational schemas, MongoDB allows you to store each scraped item as a JSON-like document with nested structures that match the natural organization of web content.

Why MongoDB for Scraped Data?

  • Schema flexibility accommodates varied page structures without upfront schema design
  • Nested documents naturally represent hierarchical content from web pages
  • Atomic upserts handle both new and updated content in single operations
  • Indexing supports efficient queries on URL, timestamps, and content fields
  • Query language handles complex filtering without joins

The integration happens through Scrapy's item pipelines, where each extracted item is processed and stored in MongoDB without requiring transformation to a different data model. When building enterprise web applications, this approach enables seamless data aggregation from multiple sources into unified data stores for analytics and reporting.

MongoDB Pipeline Implementation
1from pymongo import MongoClient2from pymongo.errors import DuplicateKeyError3from datetime import datetime4import logging5 6class MongoDBPipeline:7 def __init__(self, mongo_uri, mongo_db, collection_name):8 self.mongo_uri = mongo_uri9 self.mongo_db = mongo_db10 self.collection_name = collection_name11 12 @classmethod13 def from_crawler(cls, crawler):14 return cls(15 mongo_uri=crawler.settings.get('MONGODB_URI'),16 mongo_db=crawler.settings.get('MONGODB_DATABASE'),17 collection_name=crawler.settings.get('MONGODB_COLLECTION')18 )19 20 def open_spider(self, spider):21 self.client = MongoClient(self.mongo_uri)22 self.db = self.client[self.mongo_db]23 self.collection = self.db[self.collection_name]24 self.collection.create_index('url', unique=True)25 self.collection.create_index('scraped_at')26 27 def close_spider(self, spider):28 self.client.close()29 30 def process_item(self, item, spider):31 data = dict(item)32 data['last_updated'] = datetime.utcnow()33 34 try:35 self.collection.update_one(36 {'url': item['url']},37 {'$set': data, '$setOnInsert': {'first_scraped': datetime.utcnow()}},38 upsert=True39 )40 except DuplicateKeyError:41 spider.logger.warning(f'Duplicate item: {item["url"]}')42 43 return item

Configuring MongoDB Settings

Add these settings to your Scrapy project's settings.py:

# MongoDB Configuration
MONGODB_URI = 'mongodb://localhost:27017/'
MONGODB_DATABASE = 'scraped_data'
MONGODB_COLLECTION = 'products'

# Performance Settings
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

# Pipeline Order
ITEM_PIPELINES = {
 'myscraper.pipelines.MongoDBPipeline': 300,
}

Key settings explained:

  • CONCURRENT_REQUESTS controls simultaneous requests (16-64 optimal range)
  • AUTOTHROTTLE automatically adjusts pacing based on server response
  • Pipeline order determines execution sequence (lower values run first)

These configurations form the foundation of robust data pipelines that support custom web development initiatives requiring reliable data collection at scale.

Performance Optimization Strategies

Optimizing scraping performance involves balancing throughput against reliability and target server constraints. For high-volume data collection operations, implementing AI automation workflows can help coordinate complex multi-step extraction pipelines.

Request Optimization

  • Concurrent requests between 16-64 provide good throughput
  • Adaptive throttling reduces request rate when servers slow down
  • Connection pooling reuses TCP connections to targets

Database Write Optimization

Batching database writes significantly improves throughput:

# Batch insert example
class BatchedMongoPipeline:
 def __init__(self, batch_size=100):
 self.batch_size = batch_size
 self.buffer = []

 def process_item(self, item, spider):
 self.buffer.append(dict(item))
 if len(self.buffer) >= self.batch_size:
 self._flush(spider)
 return item

 def _flush(self, spider):
 if self.buffer:
 self.collection.insert_many(self.buffer)
 spider.logger.info(f'Inserted batch of {len(self.buffer)} items')
 self.buffer = []

Common Bottlenecks

BottleneckSolution
Network latencyIncrease concurrency
Server rate limitingEnable auto-throttling
Database writesBatch insertions
Memory pressureLimit item buffer size
CPU parsingConsider parallel processing

Handling Anti-Scraping Measures

Modern websites implement sophisticated anti-bot measures that require careful handling. According to Oxylabs' research on advanced web scraping tactics, effective countermeasure implementation balances stealth with performance.

User-Agent Rotation

# Middleware for user-agent rotation
class RandomUserAgentMiddleware:
 def __init__(self, user_agents):
 self.user_agents = user_agents

 @classmethod
 def from_crawler(cls, crawler):
 return cls(crawler.settings.getlist('USER_AGENTS'))

 def process_request(self, request, spider):
 request.headers['User-Agent'] = random.choice(self.user_agents)

Proxy Rotation

# settings.py
PROXY_LIST = [
 'http://proxy1:port',
 'http://proxy2:port',
]
DOWNLOADER_MIDDLEWARES = {
 'myscraper.middleware.ProxyMiddleware': 100,
}

Common Countermeasures

TechniquePurposeImplementation
User-Agent rotationAppear as different browsersMiddleware with UA pool
Proxy rotationDistribute requests across IPsResidential/datacenter proxies
Request throttlingAvoid rate limitingAutoThrottle or delays
Session managementHandle cookiesCookiesMiddleware (enabled by default)
JavaScript renderingHandle dynamic contentSplash or Playwright integration

For organizations running large-scale data collection, combining robust anti-detection with proper SEO data collection practices ensures sustainable access to competitive intelligence.

Best Practices and Robustness Patterns

Production scraping requires defensive coding and comprehensive error handling to ensure reliable data collection over time.

Error Handling

  • Retry transient failures with exponential backoff
  • Handle missing elements gracefully using .get() instead of direct access
  • Log unexpected structures for spider maintenance
  • Validate data before database insertion
# Robust parsing example
def parse_product(self, response):
 try:
 loader = ItemLoader(item=ProductItem(), selector=response)
 loader.add_css('name', '.product-title::text')
 loader.add_css('price', '.price-current::text')
 # Use get() for optional fields
 loader.add_css('sku', '.product-sku::text') # May not exist
 yield loader.load_item()
 except Exception as e:
 self.logger.error(f'Parse error: {e}')

Deduplication Strategy

  • URL-based deduplication with unique index on url
  • Content hashing for change detection
  • Upsert operations update existing records

Monitoring Essentials

Track these metrics for production reliability:

  • Request success/failure rates
  • Item extraction counts
  • Database write latencies
  • Retry queue depth
  • Memory and CPU usage

For organizations implementing custom software solutions, these patterns ensure data pipelines remain reliable as target sites evolve.

Frequently Asked Questions

Is Scrapy better than Beautiful Soup for large-scale scraping?

Yes. Scrapy's asynchronous architecture handles concurrent requests efficiently, while Beautiful Soup is purely a parsing library requiring manual request management. Scrapy includes built-in support for following links, handling cookies, and respecting robots.txt.

How do I scrape JavaScript-rendered pages with Scrapy?

For simple JavaScript content, use Scrapy with Splash (a JavaScript rendering service). For complex SPAs requiring full browser capabilities, integrate Playwright or Selenium through middleware that executes JavaScript before returning the rendered HTML.

What's the best way to handle rate limiting?

Enable Scrapy's AUTOTHROTTLE setting, which automatically adjusts request rate based on server response times. For more aggressive rate limits, add delays between requests and consider using proxy rotation to distribute load across multiple IPs.

How do I prevent getting blocked while scraping?

Use realistic user-agent strings, implement request throttling, rotate through proxy pools (especially residential proxies), and mimic human browsing patterns. For heavily protected sites, consider headless browsers with stealth modifications.

Should I use MongoDB or PostgreSQL for scraped data?

MongoDB excels for scraping due to its flexible schema that accommodates varied page structures without upfront schema design. PostgreSQL works well when you need complex queries, strict schemas, or relational operations on your scraped data.

Need Help Building Your Web Scraping Infrastructure?

Our team specializes in building robust data pipelines with Scrapy, MongoDB, and modern Python technologies. From spider development to production deployment, we can help you build reliable web scraping solutions.

Sources

  1. LogRocket: Scrape a website with Python, Scrapy, and MongoDB - Comprehensive tutorial covering Scrapy project setup, spider creation, MongoDB pipeline integration, and data storage patterns
  2. Oxylabs: Advanced Web Scraping With Python Tactics in 2025 - Advanced techniques for handling JavaScript, avoiding blocks, CAPTCHAs, async processing, and performance optimization
  3. GeeksforGeeks: Implementing Web Scraping in Python with Scrapy - Framework comparison and Scrapy architecture overview