Concepts
Overview
Understanding clgraph comes down to three core concepts:
- How we build the graph - From your SQL files to complete lineage
- What you can do with it - Table dependencies, pipeline execution, and orchestration
- How to document it - Automatic metadata extraction from SQL comments
These concepts work together to give you complete control over your data pipelines.
Core Concepts
From SQL to Lineage Graph
How clgraph works under the hood.
You write SQL files. We parse them into an AST (Abstract Syntax Tree) and extract everything:
- Tables and dependencies
- Columns and how they flow
- Transformations (SUM, JOIN, CASE, etc.)
- Complete lineage graph
from clgraph import Pipeline
# Point to your SQL
pipeline = Pipeline.from_sql_files("queries/", dialect="bigquery")
# Everything is in the Pipeline
# - pipeline.table_graph # Table dependencies
# - pipeline.columns # Column-level lineage
# - pipeline.edges # Lineage relationships
What you'll learn: - The three-step process: SQL → AST → Graph - What gets extracted from your SQL - How to query the lineage graph - Why the complete graph matters
Table Lineage & Orchestration
What you can do with the graph.
Once you have the graph, you can:
- Understand dependencies - Which tables depend on which
- Execute pipelines - Run SQL in topologically sorted order
- Generate Airflow DAGs - Automatic task generation with dependencies
- Split large pipelines - Create subpipelines by schedule or team
# Execute locally
results = pipeline.run(executor=my_executor, max_workers=4)
# Or generate Airflow DAG
dag = pipeline.to_airflow_dag(
executor=my_executor,
dag_id="my_pipeline",
schedule="@daily"
)
# Or split by schedule
subpipelines = pipeline.split(
sinks=[["realtime_tables"], ["daily_tables"]]
)
What you'll learn: - Table lineage and dependency tracking - Synchronous and asynchronous execution - Airflow integration - Pipeline splitting strategies
Metadata from Comments
Document your data where you define it.
Add structured metadata directly in your SQL comments:
SELECT
user_id, -- User identifier [pii: false]
email, -- Email address [pii: true, owner: data-team]
SUM(revenue) as total /* Total revenue [tags: metric finance] */
FROM users
GROUP BY user_id, email
clgraph automatically extracts: - Column descriptions - PII flags (with automatic propagation) - Ownership information - Tags and custom metadata
# Access metadata via Pipeline API
email_col = pipeline.columns["query.email"]
print(email_col.description) # "Email address"
print(email_col.pii) # True
print(email_col.owner) # "data-team"
# Find columns by metadata
pii_columns = pipeline.get_pii_columns()
metrics = pipeline.get_columns_by_tag("metric")
What you'll learn: - Comment format and syntax - Supported metadata fields - Metadata propagation rules - Integration with LLM documentation - Best practices
Limitations
We made opinionated decisions to keep clgraph simple and effective.
clgraph is not designed to replace runtime-based lineage systems. Instead, we provide a lightweight alternative that covers most common use cases without the operational overhead. For teams that need runtime statistics, query profiling, or dynamic execution tracking, runtime-based systems remain the better choice.
Runtime Information is Not Tracked
clgraph performs static analysis on SQL files. This means:
- ❌ Unresolved template variables -
SELECT * FROM ${table_name}where the variable isn't resolved before analysis - ❌ Database-side dynamic SQL - SQL generated during execution (stored procedures, database templating)
- ❌ Query statistics - Execution times, row counts, data volumes
This is an intentional design decision.
Static analysis provides significant advantages:
- ✅ Development-time feedback - Catch issues before deployment
- ✅ Fast graph building - No database execution required
- ✅ Version control friendly - Lineage changes tracked in git
- ✅ Pre-deployment validation - Test lineage locally without infrastructure
For most teams, static lineage provides 95% of the value at 5% of the complexity.
Star Notation Requires Database Connection
When your SQL uses SELECT * or SELECT table.*, clgraph needs schema information to expand the columns:
# Requires database connection for star expansion
pipeline = Pipeline.from_sql_files(
"queries/",
dialect="bigquery",
credentials="path/to/credentials.json"
)
Without a connection, star notations are tracked as a single * column in the lineage graph. This is the only case where clgraph needs database access.
Why? Star notation doesn't explicitly list columns in SQL, so we query the database schema to resolve them.
Quick Navigation
New to clgraph? Start with From SQL to Lineage Graph to understand how it works.
Ready to use it? Jump to Table Lineage & Orchestration to learn execution and deployment.
Need to document? Read Metadata from Comments to learn automatic metadata extraction.
Want to try it? Head to Quick Start for hands-on examples.
The Big Picture
graph LR
A[Your SQL Files] -->|Parse| B[AST]
B -->|Extract| C[Lineage Graph]
C --> D[Table Dependencies]
C --> E[Column Lineage]
D --> F[Pipeline Execution]
D --> G[Airflow DAGs]
E --> H[Metadata Propagation]
E --> I[Impact Analysis]
E --> J[LLM Documentation]
Your SQL already describes the lineage. We extract it. You use it.
Next Steps
- Why We Built This - Our philosophy and approach
- From SQL to Lineage Graph - Understand the foundation
- Table Lineage & Orchestration - Learn what you can do
- Metadata from Comments - Document your data
- Quick Start - Try it yourself
- API Documentation - Full reference