Why We Built clgraph

The Insight

A lot of SQL-related tasks—impact analysis, root cause debugging, PII compliance, DAG construction, documentation, AI context engineering—are really just traversing the column lineage graph.

The problem? Extracting the column lineage graph is hard.

The Column Lineage Problem Space

Column-level lineage is notoriously difficult to implement correctly. The problem space includes:

SQL's complexity: Multiple dialects, non-trivial features (CTEs, window functions, UDFs), and constant evolution
Fine-grained tracking: Beyond table-level, you need expression trees, aggregations, and join lineage
Multi-system pipelines: Data flows through warehouses, ETL tools, orchestration, and BI layers
Data-dependent logic: Some lineage can only be determined at runtime
Scale challenges: Thousands of tables, millions of queries, complex visualization needs
Social complexity: Different teams need lineage for different purposes (debugging, governance, business understanding)

These challenges have led to a fragmented landscape where:

Enterprise tools are expensive, complex, and require deep integration with data platforms
Most open-source tools either focus on table-level lineage or are limited to column lineage within a single query, making it impossible to trace data through multi-step pipelines
Many teams resort to manual documentation that quickly becomes stale

The Practical Pain

In practice, extracting column lineage today means stitching together information from multiple sources:

Code repositories — Parse SQL files to understand transformations
Database catalogs — Query table schemas to resolve column references
Execution logs — Track what actually ran and when
Custom glue code — Write scripts to connect all these pieces

This is why most teams give up on column lineage entirely, or pay enterprise vendors to solve it for them.

The Core Insight: Why We Built This

The problem with traditional lineage tools: They try to reverse-engineer lineage from query logs, execution metadata, or runtime introspection. This requires expensive platform integration, database access, and complex infrastructure.

Our insight: When SQL is written with explicit column names and clear transformations, static analysis can provide perfect column lineage—without database access, without runtime integration, and without query logs.

The Two Ingredients

1. Lineage-Friendly SQL

SQL where column lineage is obvious from reading the code itself:

-- Ambiguous SQL (requires database to understand)
SELECT id, COUNT(*), total + 10
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY id

-- Lineage-friendly SQL (obvious from code)
SELECT
    u.id as user_id,
    COUNT(*) as order_count,
    o.total + 10 as adjusted_total
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.id

Key principle: Every expression has an explicit name. Every column's source is clear. No guessing required.

2. Perfect Static Analysis

When SQL is lineage-friendly, a static analyzer can:

Parse the SQL without database access
Build exact column-level dependency graphs
Trace lineage through arbitrarily complex queries
Handle CTEs, subqueries, window functions, aggregations perfectly

The combination: Lineage-friendly SQL + Perfect static analysis = 90% of column lineage value with 10% of the complexity.

This is why we built clgraph.

What This Approach Enables

Traditional Approach (Runtime Integration)

Write SQL → Execute queries → Capture metadata → Parse query logs →
Reverse-engineer lineage → Store in catalog → Hope it's accurate

Requirements:

Database access
Query log collection
Platform-specific integrations
Runtime metadata capture
Expensive infrastructure

Result: Best-guess lineage that breaks on edge cases

Lineage-First Approach (Static Analysis)

Write lineage-friendly SQL → Run static analyzer → Get perfect lineage

Requirements:

SQL files (no database needed)
Static analysis tool (clgraph)

Result: Perfect lineage, immediately, for any SQL

The Difference

Speed: Immediate vs hours of setup
Accuracy: Perfect lineage vs best-guess
Cost: Free vs expensive platform integration
Reliability: Works for all SQL vs breaks on edge cases
Infrastructure: None vs complex data platform integration

What We Built: clgraph

We built clgraph to prove this approach works in production. It provides perfect static analysis for lineage-friendly SQL.

1. Perfect Static SQL Analysis

We chose to do one thing exceptionally well: analyze SQL with mathematical precision.

What we built:

Perfect column lineage for any single SQL query, regardless of complexity
Recursive query parsing that handles arbitrary nesting of CTEs and subqueries
Multi-dialect support via sqlglot (BigQuery, Snowflake, Postgres, etc.)
Expression-level tracking through aggregations, window functions, and transformations
Cross-query lineage for multi-step pipelines

Example of what this solves:

-- clgraph handles this perfectly, with full column lineage:
WITH RECURSIVE category_hierarchy AS (
  SELECT id, parent_id, name, 1 as level
  FROM categories WHERE parent_id IS NULL
  UNION ALL
  SELECT c.id, c.parent_id, c.name, ch.level + 1
  FROM categories c
  JOIN category_hierarchy ch ON c.parent_id = ch.id
),
monthly_sales AS (
  SELECT
    user_id,
    DATE_TRUNC(order_date, MONTH) as month,
    SUM(amount * COALESCE(fx_rate, 1.0)) as revenue_usd,
    COUNT(DISTINCT order_id) FILTER (WHERE status = 'completed') as completed_orders
  FROM orders o
  LEFT JOIN fx_rates f ON o.currency = f.currency AND o.order_date = f.rate_date
  GROUP BY 1, 2
)
SELECT
  u.email,
  ch.name as category,
  ms.month,
  ms.revenue_usd,
  ms.completed_orders,
  RANK() OVER (PARTITION BY ch.name ORDER BY ms.revenue_usd DESC) as revenue_rank
FROM users u
JOIN monthly_sales ms ON u.id = ms.user_id
JOIN order_items oi ON ms.user_id = oi.user_id
JOIN category_hierarchy ch ON oi.category_id = ch.id
WHERE ch.level = 2

No database required. Just parse the SQL and get perfect lineage.

2. Pipeline-Level Analysis

Real data pipelines aren't single queries—they're DAGs of dependent transformations.

What we built:

Table dependency graphs with topological sorting
Cross-query column lineage - trace final_table.metric back through 10 intermediate tables to raw.source_field
Forward and backward tracing for impact analysis
Template variable support for parameterized pipelines

Example:

from clgraph import Pipeline

queries = [
    ("raw_events", "CREATE TABLE raw_events AS SELECT user_id, event_type, event_timestamp FROM source_events"),
    ("daily_metrics", "CREATE TABLE daily_metrics AS SELECT user_id, DATE(event_timestamp) as date, COUNT(*) as event_count FROM raw_events GROUP BY 1, 2"),
    ("user_summary", "CREATE TABLE user_summary AS SELECT u.name, d.date, d.event_count FROM users u JOIN daily_metrics d ON u.id = d.user_id"),
]

pipeline = Pipeline(queries, dialect="bigquery")

# Trace backward: where does user_summary.event_count ultimately come from?
sources = pipeline.trace_column_backward("user_summary", "event_count")
# Returns: [source_events.event_type]  (via raw_events, daily_metrics)

# Trace forward: what's impacted if I change source_events.user_id?
impacts = pipeline.trace_column_forward("source_events", "user_id")
# Returns: [raw_events.user_id, daily_metrics.user_id, user_summary.event_count]

Still no database required. Pure static analysis across multi-query pipelines.

3. Metadata Propagation

Column lineage is most valuable when paired with business metadata.

What we built:

Inline comment parsing - extract descriptions, PII flags, ownership from SQL comments
Automatic propagation - metadata flows through lineage graph
LLM integration - generate natural language descriptions using Ollama, OpenAI, etc.
Diff tracking - detect changes between pipeline versions

Example:

from clgraph import Pipeline

sql = """
CREATE TABLE metrics AS
SELECT
  user_id,  -- User identifier [pii: false]
  email,    -- Email address [pii: true, owner: data-team]
  SUM(revenue) as total_revenue  -- [tags: metric financial]
FROM orders
GROUP BY user_id, email
"""

pipeline = Pipeline([("metrics", sql)], dialect="bigquery")

# Metadata automatically extracted and propagated
for col in pipeline.columns.values():
    if col.pii:
        print(f"PII: {col.full_name}")

4. LLM-Powered Features

Perfect lineage enables powerful AI integrations.

What we built:

LineageAgent - Natural language interface to query lineage ("Where does revenue come from?")
Lineage Tools - 16 building blocks for AI agents (trace backward/forward, search, governance)
Text-to-SQL - Generate accurate SQL using lineage context (no hallucinated columns)
Description Generation - Auto-generate column descriptions using LLMs
MCP Server - Integrate with Claude and other AI assistants

Example:

from clgraph.agent import LineageAgent

agent = LineageAgent(pipeline, llm=my_llm)

# Ask questions in plain English
result = agent.query("Where does customer_metrics.lifetime_value come from?")
print(result.answer)  # "Column is derived from: raw.orders.amount"

# Generate SQL from natural language
result = agent.query("Write SQL to get top 10 customers by revenue")
print(result.data["sql"])

Why this matters: The lineage graph provides perfect context for LLMs—they know exactly what tables and columns exist, what each means, and how they relate. No hallucinations, no guessing.

What We Intentionally Don't Cover

We're transparent about the limitations of static analysis, and we believe that's a feature, not a bug.

We Don't Try To:

1. Parse non-SQL code

No Python/Spark/Java lineage
No integration with Informatica, Talend, etc.
Why: This requires execution-time integration and is what expensive enterprise tools sell

2. Handle data-dependent logic

Dynamic SQL built at runtime
Value-based pivoting without explicit column lists
Why: This fundamentally requires runtime execution or query log parsing

3. Provide execution-time integration

No automatic catalog discovery
No streaming pipeline support
No "shadow pipeline" detection
Why: This requires deep integration with data platforms

Why These Limitations Are Acceptable

The 90/10 rule applies: For most data teams, the core lineage challenge is understanding dependencies in their SQL pipelines—not tracing data across 5 different systems with Python code in between.

Our position:

If you write lineage-friendly SQL in your warehouse, clgraph gives you perfect column lineage
If you need cross-system lineage (Fivetran → Snowflake → Looker + Python), buy Atlan or Monte Carlo
For everyone else, static analysis solves 90% of needs with 10% of the complexity

The 90% We Cover

For most data teams, the core lineage challenge is:

"We have dbt models / SQL pipelines in our warehouse. We need to understand dependencies, track PII, and do impact analysis when schemas change."

This is pure SQL in a single system—exactly what clgraph excels at.

Use cases we solve completely:

dbt projects - trace column lineage through your entire DAG
Data warehouse pipelines - BigQuery, Snowflake, Postgres transformations
Impact analysis - "What breaks if I rename this column?"
PII governance - "Where does user email flow through our warehouse?"
Pipeline optimization - "Which columns are never used downstream?"
Documentation - Auto-generate lineage diagrams and column descriptions

All without database access. All from static SQL analysis.

What Makes clgraph Different

1. No Database Required

Works directly on SQL files
No credentials, no network access, no query execution
Perfect for CI/CD, local development, code review

2. Correctness First

We don't approximate lineage—if we can parse it, we get it right
If we can't parse it, we tell you (no silent gaps)
Mathematical precision, not heuristics

3. Developer-Friendly

Pure Python library, pip installable
Works with your existing SQL files
No infrastructure to deploy
No vendor lock-in

4. Open Source & Free

MIT license
Local LLM support (Ollama) for description generation
No usage limits or quotas

5. AI-Ready

LineageAgent for natural language queries
Lineage Tools as building blocks for AI agents
MCP Server for Claude and AI assistant integration
Text-to-SQL with perfect schema context

6. Transparent About Limitations

We're clear about what static analysis can and can't do
No marketing fluff about "AI-powered lineage across all your data"
If you need cross-system lineage, we'll be honest and point you to tools that specialize in platform integration

Who This Is For

Perfect fit:

Teams using dbt, SQL-first ELT, or warehouse-native transformations
Teams willing to write lineage-friendly SQL (explicit aliases, clear transformations, this make code review easier anyway)
Data engineers who need impact analysis during development
Governance teams tracking PII in SQL pipelines
Anyone who wants column lineage without enterprise tool complexity

Not a fit:

Cross-platform lineage (Spark + BigQuery + Tableau)
Streaming pipelines (Flink, Kafka Streams)
Heavy Python/Java transformation pipelines
Teams that can't control SQL style (legacy codebases with poor practices)
Organizations that need vendor support and SLAs

The Philosophy

We believe that most data lineage problems are actually SQL lineage problems—and when SQL is written clearly, static analysis is sufficient.

The Lineage-First Approach

Traditional tools say: "Write SQL however you want, we'll figure out lineage from query logs"

We say: "Write SQL where lineage is obvious from the code, and we'll give you perfect lineage instantly"

Why This Works

When you combine:

Lineage-friendly SQL (explicit column names, clear transformations)
Perfect static analysis (clgraph's recursive parser and lineage builder)

You get:

Immediate results (no database setup, no waiting for query logs)
Perfect accuracy (no guessing, no approximations)
Zero infrastructure (just Python + SQL files)
Complete transparency (you can see exactly how lineage is derived)

The Alternative

Trying to be "the one tool that solves all lineage" leads to:

Complex architecture and integration requirements
Expensive licensing (because platform integration is expensive to build)
Vendor lock-in (tied to specific data platforms)
Inevitable gaps in coverage (because the full problem is impossible to solve)

We chose a different path: do one thing perfectly (static SQL analysis) and be honest about what we don't cover.

Conclusion

Column lineage is hard because the problem space is enormous. We built clgraph by ruthlessly focusing on the subset of the problem that:

Affects the most people (SQL pipelines in warehouses)
Can be solved perfectly (static analysis of lineage-friendly SQL)
Delivers immediate value (no infrastructure, works on existing code)
Enables incremental adoption (migrate query by query—no big-bang rewrite, and lineage-friendly SQL makes code reviews easier anyway)

The key insight: When SQL is written with explicit column names and clear transformations, you don't need runtime integration or expensive tooling. Static analysis is sufficient—and it's more accurate, faster, and simpler.

That's why we built clgraph: To prove that lineage-friendly SQL + perfect static analysis can solve 90% of column lineage needs without the complexity of enterprise tools.

If you have SQL and you're willing to write it clearly, we give you perfect column lineage. If you need cross-system runtime integration, we'll be honest and point you to tools that specialize in platform integration.

Learn more about lineage-friendly SQL: How to Write SQL for Perfect Column Lineage