A new CLI tool helps Python developers separate SQL queries from their code by finding, formatting, and extracting embedded SQL from triple-quoted strings, addressing a common maintenance challenge in data engineering workflows.
Many Python developers working with data know the pain point: long SQL queries embedded directly within Python code as triple-quoted strings. This pattern, while convenient, creates challenges for code maintainability, testing, and version control. A new open-source tool called py-sql-cleaner aims to solve this problem by providing a way to identify, format, and extract these embedded SQL queries into separate files.
py-sql-cleaner is a command-line interface tool designed specifically for Python codebases where SQL queries are written directly inside triple-quoted strings. The project, currently in MVP stage, uses SQLGlot internally for SQL formatting and offers several practical features for improving code organization.
The tool addresses a common pattern in data engineering and data science workflows where SQL queries are embedded directly in Python code. While this approach works for simple scripts, it becomes problematic as projects grow. Maintaining consistent formatting, testing SQL queries independently, and tracking changes to complex queries becomes increasingly difficult when SQL is mixed with Python logic.
"py-sql-cleaner is built for Python codebases where long SQL queries are written directly inside triple-quoted Python strings," explains the project's README. "It can format that SQL in place, or extract it into an external .sql file."
Key capabilities include:
- Formatting SQL embedded in Python triple-quoted strings
- Extracting embedded SQL into external .sql files
- Replacing embedded SQL strings with file references
- Detecting common SQL variable names such as sql, query, *_sql, and *_query
- Skipping unsafe blocks, including f-strings, Jinja-like templates, and runtime placeholders
- Supporting explicit dialect selection with --dialect/-d flags
- Providing check mode for CI integration
- Offering dry-run mode before rewriting files
One of the tool's strengths is its conservative approach to safety. By default, it skips potentially problematic code blocks rather than attempting to rewrite them. This prevents unintended changes to runtime behavior, which is crucial when dealing with parameterized queries or template variables that are resolved at runtime.
"py-sql-cleaner is conservative by default: f-strings, Jinja-like templates, and runtime placeholders are detected but skipped instead of being rewritten," the documentation explains. "Preserving runtime behavior is more important than formatting every SQL-looking string."
The tool currently supports Python triple-quoted strings and uses SQLGlot for formatting, defaulting to SQLGlot's generic dialect. It also supports database-specific dialects that have been explicitly enabled, including MySQL, PostgreSQL, and Redshift. For Redshift, the tool preserves command-style statements such as COPY and UNLOAD to avoid changing load/export options.
Installation is straightforward, with options to install via pip, pipx, or run directly without installation using uvx. Once installed, developers can use commands like list, format, extract, and check to manage embedded SQL in their codebases.
The project represents a practical solution to a common problem in Python data workflows. By separating SQL logic from Python code, teams can improve maintainability, enable better testing practices, and potentially reuse SQL queries across different parts of their codebase.
As data engineering workflows continue to evolve, tools like py-sql-cleaner that help maintain clean separation between different components of a data pipeline will likely become increasingly valuable. The project's conservative approach to safety and its focus on practical functionality make it a promising addition to the Python data ecosystem.
Currently, the project is seeking contributions and feedback as it continues development. The roadmap includes expanding support for additional SQL dialects and potentially adding more sophisticated analysis capabilities for embedded SQL queries.

Comments
Please log in or register to join the discussion