💻 Coding Prompt
Claude for Full Stack Devs: Build a Web Scraper with Error Handling
Advanced Claude prompts for Full Stack Developers building robust web scrapers with proper error handling
The Prompt
You are an expert full stack developer with 11 years of experience building production-grade web scraping systems for data-driven companies where a scraper that fails silently, loses data, or crashes without recovery is more dangerous than no scraper at all, and where robust error handling is the difference between a reliable data pipeline and a maintenance burden that consumes more engineering time than it saves. Help me write a web scraper with comprehensive error handling so I can reduce production bugs and ship a scraper that runs unattended in production, handles all failure modes gracefully, and produces structured logs that allow the engineering team to debug failures without running the scraper again.
My situation:
- Scraping target and data type: [e.g., "a publicly available e-commerce competitor pricing page — scraping product name, SKU, current price, and stock status for approximately 3,400 products across 68 category pages — the target site uses JavaScript rendering via React"]
- Language and framework: [e.g., "Python 3.11 with Playwright for browser automation and BeautifulSoup4 for HTML parsing — the scraper runs on a schedule in a Docker container on AWS ECS — output is written to a PostgreSQL database"]
- Current error handling state: [e.g., "the current scraper has a bare try-except that catches all exceptions and logs a single 'scrape failed' message with no context — when the scraper fails it exits silently, leaving partial data in the database with no indication of how many products were processed before the failure"]
- Failure modes that must be handled: [e.g., "rate limiting (HTTP 429 responses), JavaScript rendering timeout, element not found (the site changes its HTML structure periodically), network timeout, database write failure, and anti-bot detection (the site returns a CAPTCHA page for requests that exceed a certain rate)"]
- Recovery requirement: [e.g., "the scraper must be resumable — if it fails mid-run, the next scheduled run must skip products already processed in the current run rather than starting from scratch and creating duplicate records"]
- Logging and alerting requirement: [e.g., "structured JSON logs to CloudWatch with a per-product log entry covering the product SKU, the processing result (success, skipped, failed), the failure type if applicable, and the timestamp — a Slack alert must fire if more than 5% of products in a run fail"]
- Performance constraint: [e.g., "the scraper must complete a full run of 3,400 products within 4 hours — the current implementation without rate limiting takes 2.5 hours and occasionally triggers rate limiting that causes it to fail mid-run"]
Deliver:
1. A scraper architecture overview — a module breakdown covering the four components (page fetcher, HTML parser, data writer, and run state manager) with the responsibility of each component, the data flow between them, and the error isolation principle (a failure in one component must not silently affect another)
2. A rate-limiting-aware page fetcher — a Python class with exponential backoff retry logic for HTTP 429 responses (initial wait 30 seconds, doubling to a maximum of 8 minutes), a configurable request delay between product pages (minimum 2 seconds, jittered ±0.5 seconds), and a CAPTCHA detection method that identifies the anti-bot page by a known HTML marker and raises a specific CAPTCHADetectedError rather than a generic exception
3. A structured exception hierarchy — a custom exception class tree for the scraper covering ScraperBaseError, NetworkError, ParseError, DatabaseWriteError, RateLimitError, CAPTCHADetectedError, and ElementNotFoundError — each with a message template, a severity level, and a boolean flag indicating whether the error is retryable
4. A resumable run state manager — a Python class that writes the scrape run state to a PostgreSQL run_log table (run_id, started_at, last_processed_sku, products_processed, products_failed, status), reads the last incomplete run on startup to resume from the last successful product, and marks the run as complete or failed with a summary at the end of the run
5. A structured logging implementation — a Python logging configuration that outputs JSON-formatted log entries to stdout (for CloudWatch capture) covering the per-product log fields specified, plus run-level summary logs at start and end of each run, and a log-level filter that writes DEBUG-level entries only in the development environment
6. A Slack alerting module — a Python function that posts a Slack webhook message when the run failure rate exceeds 5%, covering the run summary (total products, success count, failure count, failure rate), a breakdown of failure types by exception class, and the CloudWatch log link for the failing run — triggered by the run state manager at run completion
7. A test suite for the error handling layer — eight pytest test cases covering the exponential backoff retry behavior (mock HTTP 429 response sequence), the CAPTCHA detection trigger, the resumable run state restoration from a partial previous run, the database write failure isolation (write failure on one product does not abort the full run), the 5% failure threshold Slack alert trigger, the ElementNotFoundError fallback to a default value rather than a run abort, and the run state completion marking on a successful full run
**Write every code component as production-ready Python with correct type hints, docstrings, and PEP 8 compliance — the scraper will be reviewed by a senior engineer before deployment and must meet the same code quality bar as the rest of the production codebase, not the 'good enough for a script' standard that one-off scrapers are typically written to.**
💡 How to use this prompt
- Build the resumable run state manager from output item 4 before writing the page fetcher or the error handling. The run state manager is the architectural foundation of the entire scraper — without it, every other component has no way to communicate progress or recovery state, and a mid-run failure requires a full restart. Building the state manager first forces every other component to be designed around resumability rather than retrofitted for it.
- The most common mistake is implementing the exponential backoff retry logic inside the page fetcher without also implementing the CAPTCHA detection. An exponential backoff that retries on a CAPTCHA page will retry the same CAPTCHA response up to the maximum retry limit, burning the full backoff window on a failure type that cannot be resolved by waiting. The CAPTCHADetectedError must be raised immediately on detection and must not trigger the retry logic.
- Claude outperforms ChatGPT on this task because it maintains the custom exception hierarchy and the structured logging schema consistently across the page fetcher, the run state manager, and the test suite without using generic Exception catches in the generated code. Use Claude for the full scraper implementation, then paste individual modules into ChatGPT if you need faster iteration on the Slack alerting or the logging configuration.
Best Tools for This Prompt
🤖 Best AI Coding Tools for This Prompt
Tested & reviewed — run this prompt with the best AI tools
Related Topics
About This Coding AI Prompt
This free Coding prompt is designed for Claude and works with any modern AI assistant including ChatGPT, Claude, Gemini, and more. Simply copy the prompt above, paste it into your preferred AI tool, and customize the bracketed sections to fit your specific needs.
Coding prompts like this one help you get better, more consistent results from AI tools. Instead of starting from scratch every time, you can use this tested prompt as a foundation and adapt it to your workflow. Browse more Coding prompts →