Skip to content

Concepts

Overview

Understanding clgraph comes down to three core concepts:

  1. How we build the graph - From your SQL files to complete lineage
  2. What you can do with it - Table dependencies, pipeline execution, and orchestration
  3. How to document it - Automatic metadata extraction from SQL comments

These concepts work together to give you complete control over your data pipelines.


Core Concepts

From SQL to Lineage Graph

How clgraph works under the hood.

You write SQL files. We parse them into an AST (Abstract Syntax Tree) and extract everything:

  • Tables and dependencies
  • Columns and how they flow
  • Transformations (SUM, JOIN, CASE, etc.)
  • Complete lineage graph
from clgraph import Pipeline

# Point to your SQL
pipeline = Pipeline.from_sql_files("queries/", dialect="bigquery")

# Everything is in the Pipeline
# - pipeline.table_graph  # Table dependencies
# - pipeline.columns      # Column-level lineage
# - pipeline.edges        # Lineage relationships

What you'll learn: - The three-step process: SQL → AST → Graph - What gets extracted from your SQL - How to query the lineage graph - Why the complete graph matters

Read more →


Table Lineage & Orchestration

What you can do with the graph.

Once you have the graph, you can:

  • Understand dependencies - Which tables depend on which
  • Execute pipelines - Run SQL in topologically sorted order
  • Generate Airflow DAGs - Automatic task generation with dependencies
  • Split large pipelines - Create subpipelines by schedule or team
# Execute locally
results = pipeline.run(executor=my_executor, max_workers=4)

# Or generate Airflow DAG
dag = pipeline.to_airflow_dag(
    executor=my_executor,
    dag_id="my_pipeline",
    schedule="@daily"
)

# Or split by schedule
subpipelines = pipeline.split(
    sinks=[["realtime_tables"], ["daily_tables"]]
)

What you'll learn: - Table lineage and dependency tracking - Synchronous and asynchronous execution - Airflow integration - Pipeline splitting strategies

Read more →


Metadata from Comments

Document your data where you define it.

Add structured metadata directly in your SQL comments:

SELECT
  user_id,  -- User identifier [pii: false]
  email,    -- Email address [pii: true, owner: data-team]
  SUM(revenue) as total  /* Total revenue [tags: metric finance] */
FROM users
GROUP BY user_id, email

clgraph automatically extracts: - Column descriptions - PII flags (with automatic propagation) - Ownership information - Tags and custom metadata

# Access metadata via Pipeline API
email_col = pipeline.columns["query.email"]
print(email_col.description)  # "Email address"
print(email_col.pii)          # True
print(email_col.owner)        # "data-team"

# Find columns by metadata
pii_columns = pipeline.get_pii_columns()
metrics = pipeline.get_columns_by_tag("metric")

What you'll learn: - Comment format and syntax - Supported metadata fields - Metadata propagation rules - Integration with LLM documentation - Best practices

Read more →


Limitations

We made opinionated decisions to keep clgraph simple and effective.

clgraph is not designed to replace runtime-based lineage systems. Instead, we provide a lightweight alternative that covers most common use cases without the operational overhead. For teams that need runtime statistics, query profiling, or dynamic execution tracking, runtime-based systems remain the better choice.

Runtime Information is Not Tracked

clgraph performs static analysis on SQL files. This means:

  • Unresolved template variables - SELECT * FROM ${table_name} where the variable isn't resolved before analysis
  • Database-side dynamic SQL - SQL generated during execution (stored procedures, database templating)
  • Query statistics - Execution times, row counts, data volumes

This is an intentional design decision.

Static analysis provides significant advantages:

  • Development-time feedback - Catch issues before deployment
  • Fast graph building - No database execution required
  • Version control friendly - Lineage changes tracked in git
  • Pre-deployment validation - Test lineage locally without infrastructure

For most teams, static lineage provides 95% of the value at 5% of the complexity.

Star Notation Requires Database Connection

When your SQL uses SELECT * or SELECT table.*, clgraph needs schema information to expand the columns:

# Requires database connection for star expansion
pipeline = Pipeline.from_sql_files(
    "queries/",
    dialect="bigquery",
    credentials="path/to/credentials.json"
)

Without a connection, star notations are tracked as a single * column in the lineage graph. This is the only case where clgraph needs database access.

Why? Star notation doesn't explicitly list columns in SQL, so we query the database schema to resolve them.


Quick Navigation

New to clgraph? Start with From SQL to Lineage Graph to understand how it works.

Ready to use it? Jump to Table Lineage & Orchestration to learn execution and deployment.

Need to document? Read Metadata from Comments to learn automatic metadata extraction.

Want to try it? Head to Quick Start for hands-on examples.


The Big Picture

graph LR
    A[Your SQL Files] -->|Parse| B[AST]
    B -->|Extract| C[Lineage Graph]
    C --> D[Table Dependencies]
    C --> E[Column Lineage]
    D --> F[Pipeline Execution]
    D --> G[Airflow DAGs]
    E --> H[Metadata Propagation]
    E --> I[Impact Analysis]
    E --> J[LLM Documentation]

Your SQL already describes the lineage. We extract it. You use it.


Next Steps