Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export
This tutorial demonstrates how to build a complete web crawling workflow using Crawlee for Python, from setup to AI-ready output. It covers local demo website generation, crawling with BeautifulSoupCrawler, ParselCrawler, and PlaywrightCrawler, extraction of titles, metadata, product fields, and JavaScript-rendered cards, full-page screenshots, data normalization, link graph construction, and export to JSON, CSV, and RAG-ready JSONL chunks.
HTTP-first crawling strategy
We start with HTTP crawlers because they are lightweight and efficient. Browser crawling is reserved for pages that need JavaScript rendering.
Core extraction fields
Each crawler extracts URL, title, page type, text summary, outgoing links, and page-specific metadata.
crawler = BeautifulSoupCrawler(max_requests_per_crawl=20)
Next: advanced routing