Concepts

Overview

Understanding clgraph comes down to three core concepts:

How we build the graph - From your SQL files to complete lineage
What you can do with it - Table dependencies, pipeline execution, and orchestration
How to document it - Automatic metadata extraction from SQL comments

These concepts work together to give you complete control over your data pipelines.

Core Concepts

From SQL to Lineage Graph

How clgraph works under the hood.

You write SQL files. We parse them into an AST (Abstract Syntax Tree) and extract everything:

Tables and dependencies
Columns and how they flow
Transformations (SUM, JOIN, CASE, etc.)
Complete lineage graph

from clgraph import Pipeline

# Point to your SQL
pipeline = Pipeline.from_sql_files("examples/sql_files/", dialect="bigquery")

# Everything is in the Pipeline
# - pipeline.table_graph  # Table dependencies
# - pipeline.columns      # Column-level lineage
# - pipeline.edges        # Lineage relationships

What you'll learn: - The three-step process: SQL → AST → Graph - What gets extracted from your SQL - How to query the lineage graph - Why the complete graph matters

Table Lineage & Orchestration

What you can do with the graph.

Once you have the graph, you can:

Understand dependencies - Which tables depend on which
Execute pipelines - Run SQL in topologically sorted order
Generate Airflow DAGs - Automatic task generation with dependencies
Split large pipelines - Create subpipelines by schedule or team

# Execute locally
results = pipeline.run(executor=my_executor, max_workers=4)

# Or generate Airflow DAG
dag = pipeline.to_airflow_dag(
    executor=my_executor,
    dag_id="my_pipeline",
    schedule="@daily"
)

# Or split by schedule
subpipelines = pipeline.split(
    sinks=[["realtime_tables"], ["daily_tables"]]
)

What you'll learn: - Table lineage and dependency tracking - Synchronous and asynchronous execution - Airflow integration - Pipeline splitting strategies

Metadata from Comments

Document your data where you define it.

Add structured metadata directly in your SQL comments:

SELECT
  user_id,  -- User identifier [pii: false]
  email,    -- Email address [pii: true, owner: data-team]
  SUM(revenue) as total  /* Total revenue [tags: metric finance] */
FROM users
GROUP BY user_id, email

clgraph automatically extracts: - Column descriptions - PII flags (with automatic propagation) - Ownership information - Tags and custom metadata

# Access metadata via Pipeline API
email_col = pipeline.columns["query.email"]
print(email_col.description)  # "Email address"
print(email_col.pii)          # True
print(email_col.owner)        # "data-team"

# Find columns by metadata
pii_columns = pipeline.get_pii_columns()
metrics = pipeline.get_columns_by_tag("metric")

What you'll learn: - Comment format and syntax - Supported metadata fields - Metadata propagation rules - Integration with LLM documentation - Best practices

Limitations

We made opinionated decisions to keep clgraph simple and effective.

clgraph is not designed to replace runtime-based lineage systems. Instead, we provide a lightweight alternative that covers most common use cases without the operational overhead. For teams that need runtime statistics, query profiling, or dynamic execution tracking, runtime-based systems remain the better choice.

Runtime Information is Not Tracked

clgraph performs static analysis on SQL files. This means:

❌ Unresolved template variables - SELECT * FROM ${table_name} where the variable isn't resolved before analysis
❌ Database-side dynamic SQL - SQL generated during execution (stored procedures, database templating)
❌ Query statistics - Execution times, row counts, data volumes

This is an intentional design decision.

Static analysis provides significant advantages:

✅ Development-time feedback - Catch issues before deployment
✅ Fast graph building - No database execution required
✅ Version control friendly - Lineage changes tracked in git
✅ Pre-deployment validation - Test lineage locally without infrastructure

For most teams, static lineage provides 95% of the value at 5% of the complexity.

Star Notation Handling

When your SQL uses SELECT * or SELECT table.*, clgraph handles it in two ways:

1. Cross-Query Expansion - When reading from a table created by an earlier query in the pipeline, clgraph automatically expands * to the known columns:

from clgraph import Pipeline

queries = [
    ("create_staging", "CREATE TABLE staging.orders AS SELECT id, amount FROM raw.orders"),
    ("copy_all", "CREATE TABLE analytics.orders AS SELECT * FROM staging.orders"),
]
pipeline = Pipeline(queries, dialect="bigquery")
# Query 2's SELECT * is expanded to: id, amount (from Query 1)

# Verify expansion
output_cols = [c for c in pipeline.columns.values()
               if c.query_id == "copy_all" and c.layer == "output"]
print([c.column_name for c in output_cols])  # ['id', 'amount']

2. Unknown Schema - When reading from external tables with unknown schema, * is tracked as a single node in the lineage graph:

from clgraph import Pipeline

# External table - schema unknown
queries = [("count_all", "SELECT COUNT(*) FROM external.customers")]
pipeline = Pipeline(queries, dialect="bigquery")

# Star node preserved for unknown schema
star_nodes = [c for c in pipeline.columns.values() if c.is_star]
print(f"Star nodes: {len(star_nodes)}")  # Star nodes: 1

Future Roadmap: Database schema querying to resolve * for external tables is planned but not yet implemented.

New to clgraph? Start with From SQL to Lineage Graph to understand how it works.

Ready to use it? Jump to Table Lineage & Orchestration to learn execution and deployment.

Need to document? Read Metadata from Comments to learn automatic metadata extraction.

Want to try it? Head to Quick Start for hands-on examples.

The Big Picture

graph LR
    A[Your SQL Files] -->|Parse| B[AST]
    B -->|Extract| C[Lineage Graph]
    C --> D[Table Dependencies]
    C --> E[Column Lineage]
    D --> F[Pipeline Execution]
    D --> G[Airflow DAGs]
    E --> H[Metadata Propagation]
    E --> I[Impact Analysis]
    E --> J[LLM Documentation]

Your SQL already describes the lineage. We extract it. You use it.

Next Steps

Why We Built This - Our philosophy and approach
From SQL to Lineage Graph - Understand the foundation
Table Lineage & Orchestration - Learn what you can do
Metadata from Comments - Document your data
Quick Start - Try it yourself
API Documentation - Full reference