Graph Theory

Dec 29, 2025
in Column Lineage, Data Engineering, Graph Theory
5 min read

SQL Tasks Are Graph Problems

Most SQL tasks I've dealt with in my data engineering career—impact analysis, debugging data quality issues, migration planning, compliance tracking—are actually graph traversal problems.

The Graph Structure

A typical data warehouse includes the following components:

Source tables (raw.orders, raw.users, raw.products) from applications or external APIs.
Transformation queries (CTEs, views, materialized tables) to transform the raw tables into canonical form.
Output tables (analytics.revenue, reports.user_metrics) for analytics or reporting purposes.

If we draw a line from every source column to every column it flows into based on the transformation queries, it's a directed acyclic graph (DAG) with nodes and edges defined as:

Nodes = columns at various transformation stages
Edges = data flow relationships

SQL Tasks as Graph Operations

0. Writing SQL = Graph Construction

Task: Write a query showing customer lifetime value (LTV) by segment.

When I write this query, I'm doing:

Start from the output requirement
Traverse backward through existing schema to find relevant source tables
Identify join paths connecting those tables
Construct new edges (transformations) that produce the desired output

Business question: "LTV by segment"
  └── What do I need? → customer_id, segment, total_spent
       ├── Where is segment? → users.segment (found via schema lookup)
       ├── Where is spending? → orders.amount (found via schema lookup)
       └── How do they connect? → users.id = orders.user_id (join path)

I have done this manually by looking at table schemas, source code, and query logs, then visualizing the relationships on paper or in my head. However, if I apply graph terminology to writing SQL, it is graph traversal, and the query is just the language to construct the subgraph.

1. Impact Analysis = Forward Traversal

Example Question: If I change customer_segment logic in users, what breaks downstream?

Graph operation: Start at users.customer_segment, traverse all outgoing edges, collect every reachable node.

users.customer_segment
  └── staging.user_metrics.segment
       ├── analytics.revenue_by_segment.segment
       │    └── dashboard.executive_summary.segment_revenue
       └── ml.churn_prediction.user_segment

This is equivalent to forward-reachable subgraph extraction.

2. Root Cause Analysis = Backward Traversal

Example Question: The total_revenue number in the dashboard is wrong. Where did it come from?

Graph operation: Start at dashboard.total_revenue, traverse all incoming edges backward, collect the complete ancestry.

dashboard.total_revenue
  └── analytics.daily_revenue.total
       └── staging.orders.amount
            └── raw.orders.amount (SOURCE)

This is backward-reachable subgraph extraction.

3. Data Quality Debugging = Path Finding

Example Question: How does raw.orders.amount flow into reports.margin_analysis.profit?

Graph operation: Find all paths between two nodes.

Path 1: raw.orders.amount → staging.order_metrics.revenue → reports.margin_analysis.profit
Path 2: raw.orders.amount → analytics.daily_totals.amount → reports.margin_analysis.profit

This scenario is often very challenging to me because I might miss one or two paths, especially when the query is written using templates.

4. Compliance & PII Tracking = Labeled Subgraph Extraction

Sample Question: Where does PII (email, SSN, phone) flow in the pipeline?

Graph operation: Find all nodes labeled as PII, then compute the union of their forward-reachable subgraphs.

PII Sources: [users.email, users.ssn, users.phone]
  └── Forward traversal reveals:
       - users.email → staging.user_profiles.contact_email → reports.user_directory.email
       - users.ssn → (nowhere - properly isolated)
       - users.phone → staging.user_profiles.phone → analytics.contact_list.phone

5. Migration Planning = Connected Component Analysis

Question: We're deprecating legacy_orders. What's affected?

Graph operation: Find all nodes connected to legacy_orders.* in either direction.

legacy_orders.* connects to:
  Forward: [staging.combined_orders.*, analytics.historical_revenue.*]
  Backward: [external_feed.orders_dump.*]

Total affected tables: 3
Total affected columns: 47

Similar to the path finding scenario, manual processes often only capture the most impacted nodes and miss one or two nodes that people are rarely aware of.

6. Unused Column Detection = Reachability Analysis

Typical Question: Which source columns are never used downstream?

Graph operation: Find all source nodes with no outgoing edges.

Unused source columns:
  - raw.users.legacy_id (no downstream dependencies)
  - raw.products.deprecated_sku (no downstream dependencies)
  - raw.orders.internal_notes (no downstream dependencies)

7. Critical Path Analysis = Edge Centrality

Typical Question: Which transformations are bottlenecks?

Graph operation: Compute betweenness centrality or count paths through each node.

High-centrality nodes:
  1. staging.user_metrics.user_id (flows into 47 downstream columns)
  2. staging.order_facts.order_date (flows into 38 downstream columns)
  3. analytics.revenue.total (flows into 23 downstream columns)

Why This Framing Matters

Manual Tracing Does Not Scale

For 10 queries, I can trace dependencies manually. For 100 queries, it's slow—probably a full-day job, and I don't expect to do anything else that day. For 1,000+ queries, I would probably just give up.

However, graph algorithms can scale. BFS/DFS traversal is O(V + E). Path finding is well-studied. Subgraph extraction is solved.

Different Questions, Same Algorithm

If we treat the table columns and queries as a graph, we can answer all these questions with the same data structure.

SQL Task	Graph Operation
Writing queries	Graph construction via traversal
Impact analysis	Forward BFS
Root cause analysis	Backward BFS
"Where does X come from?"	Backward path finding
"What does X affect?"	Forward path finding
PII tracking	Property-based subgraph
Migration planning	Connected component
Dead column detection	Degree analysis
Bottleneck detection	Centrality metrics

If you want to learn more about graph operations I've mentioned above; checkout Wikipedia pages.

The Graph Is the Documentation

Manual documentation always decays. Someone changes a query, forgets to update the wiki, and lineage docs become stale. When docs become stale and no longer reflect reality, people stop updating them. This is a negative feedback loop.

If we can derive the graph directly from SQL, the graph is the documentation. When we change the SQL, the graph updates automatically. No manual process.

Unfortunately, Building the Graph Is Hard

Extracting column lineage requires:

SQL parsing into an AST across dialects (BigQuery, Snowflake, Postgres, Spark SQL)
Alias resolution (u.name → users.name) through multiple levels
CTE tracking (each CTE is a virtual table with its own column namespace)
Subquery handling (nested scopes with their own column mappings)
Star expansion (SELECT * → which columns?) requires schema knowledge
Expression tracking (aggregations, window functions, CASE expressions)
Cross-query connection (output of query A → input of query B)

Plus all the SQL edge cases I cannot imagine yet.

Practical Challenges

Extracting column lineage requires stitching together the following components:

Code repositories for SQL files
Database catalogs for table schemas
Execution logs for runtime information
Custom glue code to connect these pieces

Scale Challenges

Real data warehouses have thousands of tables, millions of historical queries, and evolving schemas. The graph changes with every SQL execution.

Data-Dependent Logic

Some lineage can only be determined at runtime:

Dynamic SQL built from variables
Value-based pivoting without explicit column lists
Conditional logic that varies by data content

Common Approaches

Most teams either:

Don't track lineage (manually answer graph questions each time)—it's not a top priority for the majority of engineers, and they fall into the negative feedback loop.
Use third-party tools that build lineage from query logs.
Build custom solutions that only handle simple cases.
Resort to manual documentation that becomes stale (negative feedback loop again).

Summary

In this post, I've framed SQL tasks as graph traversal problems. There is nothing controversial here. What I've described is exactly what Column Lineage tries to solve. The lineage edges are exactly the edges in the graph that connect nodes (columns).

Debugging a data quality issue or planning a migration is graph traversal. The question is how to build a graph that reflects reality and is easy to use.

This is the third post of a series related to treating SQL queries as graph traversal problem.