Skip to content

Why We Built clgraph

The Column Lineage Problem Space

Column-level lineage is notoriously difficult to implement correctly. The problem space includes:

  • SQL's complexity: Multiple dialects, non-trivial features (CTEs, window functions, UDFs), and constant evolution
  • Fine-grained tracking: Beyond table-level, you need expression trees, aggregations, and join lineage
  • Multi-system pipelines: Data flows through warehouses, ETL tools, orchestration, and BI layers
  • Data-dependent logic: Some lineage can only be determined at runtime
  • Scale challenges: Thousands of tables, millions of queries, complex visualization needs
  • Social complexity: Different teams need lineage for different purposes (debugging, governance, business understanding)

These challenges have led to a fragmented landscape where:

  • Enterprise tools are expensive, complex, and require deep integration with data platforms
  • Most open-source tools either focus on table-level lineage or are limited to column lineage within a single query, making it impossible to trace data through multi-step pipelines
  • Many teams resort to manual documentation that quickly becomes stale

The Core Insight: Why We Built This

The problem with traditional lineage tools: They try to reverse-engineer lineage from query logs, execution metadata, or runtime introspection. This requires expensive platform integration, database access, and complex infrastructure.

Our insight: When SQL is written with explicit column names and clear transformations, static analysis can provide perfect column lineage—without database access, without runtime integration, and without query logs.

The Two Ingredients

1. Lineage-Friendly SQL

SQL where column lineage is obvious from reading the code itself:

-- Ambiguous SQL (requires database to understand)
SELECT id, COUNT(*), total + 10
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY id

-- Lineage-friendly SQL (obvious from code)
SELECT
    u.id as user_id,
    COUNT(*) as order_count,
    o.total + 10 as adjusted_total
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.id

Key principle: Every expression has an explicit name. Every column's source is clear. No guessing required.

2. Perfect Static Analysis

When SQL is lineage-friendly, a static analyzer can:

  • Parse the SQL without database access
  • Build exact column-level dependency graphs
  • Trace lineage through arbitrarily complex queries
  • Handle CTEs, subqueries, window functions, aggregations perfectly

The combination: Lineage-friendly SQL + Perfect static analysis = 90% of column lineage value with 10% of the complexity.

This is why we built clgraph.

What This Approach Enables

Traditional Approach (Runtime Integration)

Write SQL → Execute queries → Capture metadata → Parse query logs →
Reverse-engineer lineage → Store in catalog → Hope it's accurate

Requirements:

  • Database access
  • Query log collection
  • Platform-specific integrations
  • Runtime metadata capture
  • Expensive infrastructure

Result: Best-guess lineage that breaks on edge cases

Lineage-First Approach (Static Analysis)

Write lineage-friendly SQL → Run static analyzer → Get perfect lineage

Requirements:

  • SQL files (no database needed)
  • Static analysis tool (clgraph)

Result: Perfect lineage, immediately, for any SQL

The Difference

  • Speed: Immediate vs hours of setup
  • Accuracy: Perfect lineage vs best-guess
  • Cost: Free vs expensive platform integration
  • Reliability: Works for all SQL vs breaks on edge cases
  • Infrastructure: None vs complex data platform integration

What We Built: clgraph

We built clgraph to prove this approach works in production. It provides perfect static analysis for lineage-friendly SQL.

1. Perfect Static SQL Analysis

We chose to do one thing exceptionally well: analyze SQL with mathematical precision.

What we built:

  • Perfect column lineage for any single SQL query, regardless of complexity
  • Recursive query parsing that handles arbitrary nesting of CTEs and subqueries
  • Multi-dialect support via sqlglot (BigQuery, Snowflake, Postgres, etc.)
  • Expression-level tracking through aggregations, window functions, and transformations
  • Cross-query lineage for multi-step pipelines

Example of what this solves:

-- clgraph handles this perfectly, with full column lineage:
WITH RECURSIVE category_hierarchy AS (
  SELECT id, parent_id, name, 1 as level
  FROM categories WHERE parent_id IS NULL
  UNION ALL
  SELECT c.id, c.parent_id, c.name, ch.level + 1
  FROM categories c
  JOIN category_hierarchy ch ON c.parent_id = ch.id
),
monthly_sales AS (
  SELECT
    user_id,
    DATE_TRUNC(order_date, MONTH) as month,
    SUM(amount * COALESCE(fx_rate, 1.0)) as revenue_usd,
    COUNT(DISTINCT order_id) FILTER (WHERE status = 'completed') as completed_orders
  FROM orders o
  LEFT JOIN fx_rates f ON o.currency = f.currency AND o.order_date = f.rate_date
  GROUP BY 1, 2
)
SELECT
  u.email,
  ch.name as category,
  ms.month,
  ms.revenue_usd,
  ms.completed_orders,
  RANK() OVER (PARTITION BY ch.name ORDER BY ms.revenue_usd DESC) as revenue_rank
FROM users u
JOIN monthly_sales ms ON u.id = ms.user_id
JOIN order_items oi ON ms.user_id = oi.user_id
JOIN category_hierarchy ch ON oi.category_id = ch.id
WHERE ch.level = 2

No database required. Just parse the SQL and get perfect lineage.

2. Pipeline-Level Analysis

Real data pipelines aren't single queries—they're DAGs of dependent transformations.

What we built:

  • Table dependency graphs with topological sorting
  • Cross-query column lineage - trace final_table.metric back through 10 intermediate tables to raw.source_field
  • Forward and backward tracing for impact analysis
  • Template variable support for parameterized pipelines

Example:

from clgraph import Pipeline

queries = [
    ("raw_events", "CREATE TABLE raw_events AS SELECT user_id, event_type, event_timestamp FROM source_events"),
    ("daily_metrics", "CREATE TABLE daily_metrics AS SELECT user_id, DATE(event_timestamp) as date, COUNT(*) as event_count FROM raw_events GROUP BY 1, 2"),
    ("user_summary", "CREATE TABLE user_summary AS SELECT u.name, d.date, d.event_count FROM users u JOIN daily_metrics d ON u.id = d.user_id"),
]

pipeline = Pipeline(queries, dialect="bigquery")

# Trace backward: where does user_summary.event_count ultimately come from?
sources = pipeline.trace_column_backward("user_summary", "event_count")
# Returns: [source_events.event_type]  (via raw_events, daily_metrics)

# Trace forward: what's impacted if I change source_events.user_id?
impacts = pipeline.trace_column_forward("source_events", "user_id")
# Returns: [raw_events.user_id, daily_metrics.user_id, user_summary.event_count]

Still no database required. Pure static analysis across multi-query pipelines.

3. Metadata Propagation

Column lineage is most valuable when paired with business metadata.

What we built:

  • Inline comment parsing - extract descriptions, PII flags, ownership from SQL comments
  • Automatic propagation - metadata flows through lineage graph
  • LLM integration - generate natural language descriptions using Ollama, OpenAI, etc.
  • Diff tracking - detect changes between pipeline versions

Example:

sql = """
SELECT
  user_id,  -- User identifier [pii: false]
  email,    -- Email address [pii: true, owner: data-team]
  SUM(revenue) as total_revenue  -- [tags: metric financial]
FROM orders
GROUP BY user_id, email
"""

pipeline = Pipeline([("metrics", sql)])

# Metadata automatically extracted and propagated
for col in pipeline.columns.values():
    if col.pii:
        print(f"PII: {col.full_name}")

# LLM-powered description generation (using local Ollama - no API costs)
pipeline.column_graph.llm = ChatOllama(model="qwen3-coder:30b")
pipeline.column_graph.generate_all_descriptions()

What We Intentionally Don't Cover

We're transparent about the limitations of static analysis, and we believe that's a feature, not a bug.

We Don't Try To:

1. Parse non-SQL code

  • No Python/Spark/Java lineage
  • No integration with Informatica, Talend, etc.
  • Why: This requires execution-time integration and is what expensive enterprise tools sell

2. Handle data-dependent logic

  • Dynamic SQL built at runtime
  • Value-based pivoting without explicit column lists
  • Why: This fundamentally requires runtime execution or query log parsing

3. Provide execution-time integration

  • No automatic catalog discovery
  • No streaming pipeline support
  • No "shadow pipeline" detection
  • Why: This requires deep integration with data platforms

Why These Limitations Are Acceptable

The 90/10 rule applies: For most data teams, the core lineage challenge is understanding dependencies in their SQL pipelines—not tracing data across 5 different systems with Python code in between.

Our position:

  • If you write lineage-friendly SQL in your warehouse, clgraph gives you perfect column lineage
  • If you need cross-system lineage (Fivetran → Snowflake → Looker + Python), buy Atlan or Monte Carlo
  • For everyone else, static analysis solves 90% of needs with 10% of the complexity

The 90% We Cover

For most data teams, the core lineage challenge is:

"We have dbt models / SQL pipelines in our warehouse. We need to understand dependencies, track PII, and do impact analysis when schemas change."

This is pure SQL in a single system—exactly what clgraph excels at.

Use cases we solve completely:

  • dbt projects - trace column lineage through your entire DAG
  • Data warehouse pipelines - BigQuery, Snowflake, Postgres transformations
  • Impact analysis - "What breaks if I rename this column?"
  • PII governance - "Where does user email flow through our warehouse?"
  • Pipeline optimization - "Which columns are never used downstream?"
  • Documentation - Auto-generate lineage diagrams and column descriptions

All without database access. All from static SQL analysis.

What Makes clgraph Different

1. No Database Required

  • Works directly on SQL files
  • No credentials, no network access, no query execution
  • Perfect for CI/CD, local development, code review

2. Correctness First

  • We don't approximate lineage—if we can parse it, we get it right
  • If we can't parse it, we tell you (no silent gaps)
  • Mathematical precision, not heuristics

3. Developer-Friendly

  • Pure Python library, pip installable
  • Works with your existing SQL files
  • No infrastructure to deploy
  • No vendor lock-in

4. Open Source & Free

  • MIT license
  • Local LLM support (Ollama) for description generation
  • No usage limits or quotas

5. Transparent About Limitations

  • We're clear about what static analysis can and can't do
  • No marketing fluff about "AI-powered lineage across all your data"
  • If you need cross-system lineage, we'll be honest and point you to tools that specialize in platform integration

Who This Is For

Perfect fit:

  • Teams using dbt, SQL-first ELT, or warehouse-native transformations
  • Teams willing to write lineage-friendly SQL (explicit aliases, clear transformations, this make code review easier anyway)
  • Data engineers who need impact analysis during development
  • Governance teams tracking PII in SQL pipelines
  • Anyone who wants column lineage without enterprise tool complexity

Not a fit:

  • Cross-platform lineage (Spark + BigQuery + Tableau)
  • Streaming pipelines (Flink, Kafka Streams)
  • Heavy Python/Java transformation pipelines
  • Teams that can't control SQL style (legacy codebases with poor practices)
  • Organizations that need vendor support and SLAs

The Philosophy

We believe that most data lineage problems are actually SQL lineage problems—and when SQL is written clearly, static analysis is sufficient.

The Lineage-First Approach

Traditional tools say: "Write SQL however you want, we'll figure out lineage from query logs"

We say: "Write SQL where lineage is obvious from the code, and we'll give you perfect lineage instantly"

Why This Works

When you combine:

  • Lineage-friendly SQL (explicit column names, clear transformations)
  • Perfect static analysis (clgraph's recursive parser and lineage builder)

You get:

  • Immediate results (no database setup, no waiting for query logs)
  • Perfect accuracy (no guessing, no approximations)
  • Zero infrastructure (just Python + SQL files)
  • Complete transparency (you can see exactly how lineage is derived)

The Alternative

Trying to be "the one tool that solves all lineage" leads to:

  • Complex architecture and integration requirements
  • Expensive licensing (because platform integration is expensive to build)
  • Vendor lock-in (tied to specific data platforms)
  • Inevitable gaps in coverage (because the full problem is impossible to solve)

We chose a different path: do one thing perfectly (static SQL analysis) and be honest about what we don't cover.

Conclusion

Column lineage is hard because the problem space is enormous. We built clgraph by ruthlessly focusing on the subset of the problem that:

  1. Affects the most people (SQL pipelines in warehouses)
  2. Can be solved perfectly (static analysis of lineage-friendly SQL)
  3. Delivers immediate value (no infrastructure, works on existing code)

The key insight: When SQL is written with explicit column names and clear transformations, you don't need runtime integration or expensive tooling. Static analysis is sufficient—and it's more accurate, faster, and simpler.

That's why we built clgraph: To prove that lineage-friendly SQL + perfect static analysis can solve 90% of column lineage needs without the complexity of enterprise tools.

If you have SQL and you're willing to write it clearly, we give you perfect column lineage. If you need cross-system runtime integration, we'll be honest and point you to tools that specialize in platform integration.


Learn more about lineage-friendly SQL: How to Write SQL for Perfect Column Lineage