💻 Coding Prompt
ChatGPT for Data Engineers: Refactor Legacy ETL Code to Cut Bug Rate
Intermediate ChatGPT prompts for Data Engineers refactoring legacy ETL code to reduce production bug rate
The Prompt
You are a senior data engineering specialist with 10 years of experience refactoring legacy ETL pipelines for startup data teams where the codebase has accumulated technical debt faster than the team can address it and where a high production bug rate in data pipelines is causing downstream analytics errors that erode trust in the data product across the organization. Help me refactor the legacy ETL code so I can build features without starting from scratch and produce a refactored pipeline codebase that reduces the production bug rate, is testable without running against live data sources, and can be extended by a mid-level data engineer without requiring senior review of every change.
My situation:
- Pipeline stack and scale: [e.g., "a Python-based ETL pipeline running on Apache Airflow 2.4 — the pipeline ingests data from 6 sources (3 REST APIs, 2 PostgreSQL databases, 1 S3 bucket), transforms and joins the data, and loads to a Snowflake data warehouse — approximately 14,000 lines of DAG and transformation code"]
- Legacy code problems: [e.g., "the transformation logic is written as 800-line DAG files with no function separation, all database credentials are hardcoded in DAG files, there are no unit tests, and the pipeline has no retry logic — failed tasks leave partial data in the staging tables with no cleanup"]
- Production bug rate: [e.g., "an average of 8 production bugs per sprint — the top three categories are: wrong data type coercions that silently convert nulls to zeros (4 per sprint), missing handling for API response schema changes (2 per sprint), and manual SQL queries that are modified incorrectly during updates (2 per sprint)"]
- Team capacity: [e.g., "one senior data engineer and two mid-level data engineers — the senior engineer currently spends 60% of their time debugging production bugs rather than building new pipeline features"]
- Test environment: [e.g., "a staging Airflow environment exists but is configured identically to production — there is no mock data layer, so testing requires running against real data sources and produces real records in the staging Snowflake tables"]
- Refactoring constraint: [e.g., "the pipeline must remain operational during refactoring — the refactoring must be done in incremental, independently deployable stages rather than a full rewrite that requires a freeze period"]
- Priority metric: [e.g., "reducing the null-to-zero silent conversion bug rate from 4 per sprint to 0 is the highest-priority outcome — this bug category has caused two incorrect executive dashboards in the last month"]
Deliver:
1. A refactoring roadmap — a six-stage incremental refactoring plan covering stage 1 (extract credentials to environment variables and a secrets manager), stage 2 (extract transformation logic from DAG files into standalone Python modules), stage 3 (introduce a mock data layer for unit testing without live sources), stage 4 (add type coercion validation with explicit null handling), stage 5 (add retry logic and staging table cleanup on failure), and stage 6 (add API response schema validation with alerting) — each stage independently deployable without breaking the running pipeline
2. A type coercion validation module — a Python module with a validate_and_coerce function that accepts a DataFrame, a schema definition (field name, expected type, null handling rule), and raises a TypeCoercionError with the field name, the received type, and the invalid value rather than silently converting — plus a unit test suite for the five null handling scenarios that currently produce silent zero conversions
3. A mock data layer specification — a Python fixture system for pytest that provides consistent mock DataFrames for each of the six data sources, covering the happy path schema, a schema with one unexpected new field (simulating an API schema change), a schema with null values in required fields, and a schema with type mismatches — enabling all transformation unit tests to run without network access or Snowflake credentials
4. A DAG file refactoring template — a before-and-after example showing an 800-line monolithic DAG file refactored into a DAG file (under 80 lines, containing only task definitions and dependencies) and a separate transformation module (containing all business logic as testable functions with type annotations and docstrings) — with the import structure and the Airflow task decorator pattern to follow
5. An API response schema validation module — a Python module using Pydantic that validates each API response against a defined schema before passing it to the transformation layer, raises an APISchemaChangedError with the unexpected fields and missing fields listed when the schema does not match, and sends an alert to the team's Slack channel with the schema diff — preventing the silent data corruption that occurs when an API adds or renames a field
6. A staging table cleanup Airflow operator — a custom Airflow operator that runs on task failure and deletes any records inserted into staging tables during the failed task's execution window (using the task run ID as a partition key), preventing partial data accumulation in staging tables that currently requires manual cleanup after each failure
7. A refactoring progress dashboard — a weekly tracking metric covering production bugs per sprint by category (type coercion, schema change, SQL error), test coverage percentage by pipeline module, and the percentage of DAG files fully refactored to the new module structure — giving the senior data engineer a weekly measure of whether the refactoring is reducing the bug rate as each stage is deployed
**Write the type coercion module and the API schema validation module as production-ready Python with full type annotations, pytest-compatible test fixtures, and docstrings that explain the business rule behind each validation — the mid-level data engineers who will maintain these modules must be able to extend the schema definitions and add new null handling rules without senior review.**
💡 How to use this prompt
- Implement the type coercion validation module from output item 2 first, before any other refactoring stage. The null-to-zero silent conversion bug is generating two incorrect executive dashboards per month — this is the highest-trust-damage bug category and the most immediately addressable with a targeted module. The validate_and_coerce function can be integrated into the existing transformation code without restructuring the DAG files, making it deployable in a single sprint before the broader refactoring begins.
- The most common mistake is starting the refactoring with the credential extraction stage (stage 1) because it feels safe. While necessary, moving credentials to environment variables does not reduce the production bug rate at all — it is a security improvement, not a reliability improvement. The team will spend one sprint on a change that does not move the bug rate metric, losing momentum before the stages that actually reduce bugs. Start with the type coercion module, then extract credentials in parallel during the same sprint.
- ChatGPT handles this task well and produces clean Python ETL refactoring code quickly. For the full seven-output system including the Pydantic schema validation module and the Airflow custom operator, switch to Claude — it holds the type safety and the null handling logic consistently across the validation module, the mock data fixtures, and the unit test suite without introducing implicit type coercions in the generated code.
Best Tools for This Prompt
🤖 Best AI Coding Tools for This Prompt
Tested & reviewed — run this prompt with the best AI tools
Related Topics
About This Coding AI Prompt
This free Coding prompt is designed for ChatGPT and works with any modern AI assistant including ChatGPT, Claude, Gemini, and more. Simply copy the prompt above, paste it into your preferred AI tool, and customize the bracketed sections to fit your specific needs.
Coding prompts like this one help you get better, more consistent results from AI tools. Instead of starting from scratch every time, you can use this tested prompt as a foundation and adapt it to your workflow. Browse more Coding prompts →