Data Engineering

Dec 29, 2025
in Column Lineage, Data Engineering, Graph Theory
5 min read

SQL Tasks Are Graph Problems

Most SQL tasks I've dealt with in my data engineering career—impact analysis, debugging data quality issues, migration planning, compliance tracking—are actually graph traversal problems.

The Graph Structure

A typical data warehouse includes the following components:

Source tables (raw.orders, raw.users, raw.products) from applications or external APIs.
Transformation queries (CTEs, views, materialized tables) to transform the raw tables into canonical form.
Output tables (analytics.revenue, reports.user_metrics) for analytics or reporting purposes.

If we draw a line from every source column to every column it flows into based on the transformation queries, it's a directed acyclic graph (DAG) with nodes and edges defined as:

Nodes = columns at various transformation stages
Edges = data flow relationships

SQL Tasks as Graph Operations

0. Writing SQL = Graph Construction

Task: Write a query showing customer lifetime value (LTV) by segment.

When I write this query, I'm doing:

Start from the output requirement
Traverse backward through existing schema to find relevant source tables
Identify join paths connecting those tables
Construct new edges (transformations) that produce the desired output

Business question: "LTV by segment"
  └── What do I need? → customer_id, segment, total_spent
       ├── Where is segment? → users.segment (found via schema lookup)
       ├── Where is spending? → orders.amount (found via schema lookup)
       └── How do they connect? → users.id = orders.user_id (join path)

I have done this manually by looking at table schemas, source code, and query logs, then visualizing the relationships on paper or in my head. However, if I apply graph terminology to writing SQL, it is graph traversal, and the query is just the language to construct the subgraph.

1. Impact Analysis = Forward Traversal

Example Question: If I change customer_segment logic in users, what breaks downstream?

Graph operation: Start at users.customer_segment, traverse all outgoing edges, collect every reachable node.

users.customer_segment
  └── staging.user_metrics.segment
       ├── analytics.revenue_by_segment.segment
       │    └── dashboard.executive_summary.segment_revenue
       └── ml.churn_prediction.user_segment

This is equivalent to forward-reachable subgraph extraction.

2. Root Cause Analysis = Backward Traversal

Example Question: The total_revenue number in the dashboard is wrong. Where did it come from?

Graph operation: Start at dashboard.total_revenue, traverse all incoming edges backward, collect the complete ancestry.

dashboard.total_revenue
  └── analytics.daily_revenue.total
       └── staging.orders.amount
            └── raw.orders.amount (SOURCE)

This is backward-reachable subgraph extraction.

3. Data Quality Debugging = Path Finding

Example Question: How does raw.orders.amount flow into reports.margin_analysis.profit?

Graph operation: Find all paths between two nodes.

Path 1: raw.orders.amount → staging.order_metrics.revenue → reports.margin_analysis.profit
Path 2: raw.orders.amount → analytics.daily_totals.amount → reports.margin_analysis.profit

This scenario is often very challenging to me because I might miss one or two paths, especially when the query is written using templates.

4. Compliance & PII Tracking = Labeled Subgraph Extraction

Sample Question: Where does PII (email, SSN, phone) flow in the pipeline?

Graph operation: Find all nodes labeled as PII, then compute the union of their forward-reachable subgraphs.

PII Sources: [users.email, users.ssn, users.phone]
  └── Forward traversal reveals:
       - users.email → staging.user_profiles.contact_email → reports.user_directory.email
       - users.ssn → (nowhere - properly isolated)
       - users.phone → staging.user_profiles.phone → analytics.contact_list.phone

5. Migration Planning = Connected Component Analysis

Question: We're deprecating legacy_orders. What's affected?

Graph operation: Find all nodes connected to legacy_orders.* in either direction.

legacy_orders.* connects to:
  Forward: [staging.combined_orders.*, analytics.historical_revenue.*]
  Backward: [external_feed.orders_dump.*]

Total affected tables: 3
Total affected columns: 47

Similar to the path finding scenario, manual processes often only capture the most impacted nodes and miss one or two nodes that people are rarely aware of.

6. Unused Column Detection = Reachability Analysis

Typical Question: Which source columns are never used downstream?

Graph operation: Find all source nodes with no outgoing edges.

Unused source columns:
  - raw.users.legacy_id (no downstream dependencies)
  - raw.products.deprecated_sku (no downstream dependencies)
  - raw.orders.internal_notes (no downstream dependencies)

7. Critical Path Analysis = Edge Centrality

Typical Question: Which transformations are bottlenecks?

Graph operation: Compute betweenness centrality or count paths through each node.

High-centrality nodes:
  1. staging.user_metrics.user_id (flows into 47 downstream columns)
  2. staging.order_facts.order_date (flows into 38 downstream columns)
  3. analytics.revenue.total (flows into 23 downstream columns)

Why This Framing Matters

Manual Tracing Does Not Scale

For 10 queries, I can trace dependencies manually. For 100 queries, it's slow—probably a full-day job, and I don't expect to do anything else that day. For 1,000+ queries, I would probably just give up.

However, graph algorithms can scale. BFS/DFS traversal is O(V + E). Path finding is well-studied. Subgraph extraction is solved.

Different Questions, Same Algorithm

If we treat the table columns and queries as a graph, we can answer all these questions with the same data structure.

SQL Task	Graph Operation
Writing queries	Graph construction via traversal
Impact analysis	Forward BFS
Root cause analysis	Backward BFS
"Where does X come from?"	Backward path finding
"What does X affect?"	Forward path finding
PII tracking	Property-based subgraph
Migration planning	Connected component
Dead column detection	Degree analysis
Bottleneck detection	Centrality metrics

If you want to learn more about graph operations I've mentioned above; checkout Wikipedia pages.

The Graph Is the Documentation

Manual documentation always decays. Someone changes a query, forgets to update the wiki, and lineage docs become stale. When docs become stale and no longer reflect reality, people stop updating them. This is a negative feedback loop.

If we can derive the graph directly from SQL, the graph is the documentation. When we change the SQL, the graph updates automatically. No manual process.

Unfortunately, Building the Graph Is Hard

Extracting column lineage requires:

SQL parsing into an AST across dialects (BigQuery, Snowflake, Postgres, Spark SQL)
Alias resolution (u.name → users.name) through multiple levels
CTE tracking (each CTE is a virtual table with its own column namespace)
Subquery handling (nested scopes with their own column mappings)
Star expansion (SELECT * → which columns?) requires schema knowledge
Expression tracking (aggregations, window functions, CASE expressions)
Cross-query connection (output of query A → input of query B)

Plus all the SQL edge cases I cannot imagine yet.

Practical Challenges

Extracting column lineage requires stitching together the following components:

Code repositories for SQL files
Database catalogs for table schemas
Execution logs for runtime information
Custom glue code to connect these pieces

Scale Challenges

Real data warehouses have thousands of tables, millions of historical queries, and evolving schemas. The graph changes with every SQL execution.

Data-Dependent Logic

Some lineage can only be determined at runtime:

Dynamic SQL built from variables
Value-based pivoting without explicit column lists
Conditional logic that varies by data content

Common Approaches

Most teams either:

Don't track lineage (manually answer graph questions each time)—it's not a top priority for the majority of engineers, and they fall into the negative feedback loop.
Use third-party tools that build lineage from query logs.
Build custom solutions that only handle simple cases.
Resort to manual documentation that becomes stale (negative feedback loop again).

Summary

In this post, I've framed SQL tasks as graph traversal problems. There is nothing controversial here. What I've described is exactly what Column Lineage tries to solve. The lineage edges are exactly the edges in the graph that connect nodes (columns).

Debugging a data quality issue or planning a migration is graph traversal. The question is how to build a graph that reflects reality and is easy to use.

This is the third post of a series related to treating SQL queries as graph traversal problem.

Dec 7, 2025
in SQL Best Practices, Data Engineering, Column Lineage
4 min read

How to Write SQL for Perfect Column Lineage

I've learned that column lineage should be obvious from reading the SQL code itself—without checking table schemas or guessing auto-generated column names.

When I write lineage-friendly SQL, anyone (including tools) can trace data flow without database access. Code reviews are faster, debugging is easier, and static analysis becomes possible.

The Problem with Ambiguous SQL

Consider this query:

SELECT
    user_id,
    COUNT(*),
    total,
    id + 1
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY user_id

What columns are in the output?

user_id - From which table? users or orders?
COUNT(*) - What's the column name? count, COUNT(*), or something database-dependent?
total - From orders? Or computed somewhere?
id + 1 - What's this column called?

To understand this query, I need to:

Check if both tables have a user_id column
Know what my database auto-names COUNT(*) as
Figure out if total is ambiguous
Know how my database names expressions

This is inefficient for code review and impossible for static analysis. I've wasted hours on queries like this.

Rules I Follow

These are rules I follow to make column lineage explicit and traceable without database access.

These are opinionated rules based on my experience building lineage tools. Not all teams will agree with every rule. The goal is to enable lineage analysis and improve code readability.

1: No Duplicate Column Names

Consider the following query:

SELECT u.id, o.id
FROM users u JOIN orders o ON u.id = o.user_id

During code review, I will ask which id is which? Most databases will show an error with this syntax, and even if some databases allow this query, it's ambiguous.

If we rewrite the query in the following way:

SELECT u.id as user_id, o.id as order_id
FROM users u JOIN orders o ON u.id = o.user_id

The query now clearly shows which id is the user_id and which is the order_id, and we know the output columns precisely.

2: Explicit Aliases for Aggregates

Bad:

SELECT user_id, COUNT(*), SUM(total), AVG(amount)
FROM orders GROUP BY user_id

Aggregate column names are database-dependent (count, sum, avg, or ?column?).

Good:

SELECT
    user_id,
    COUNT(*) as order_count,
    SUM(total) as total_revenue,
    AVG(amount) as avg_order_amount
FROM orders GROUP BY user_id

3: Explicit Aliases for Window Functions

Bad:

SELECT employee_id, salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC)
FROM employees

Good:

SELECT employee_id, salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees

4: Qualified Columns in JOINs

Bad:

SELECT id, name, total
FROM users u JOIN orders o ON u.id = o.user_id

Source table of each column is ambiguous.

Good:

SELECT u.id as user_id, u.name as user_name, o.total as order_total
FROM users u JOIN orders o ON u.id = o.user_id

Each column traces to its source table without database access.

5: Explicit Aliases for CASE Expressions

Bad:

SELECT product_id,
    CASE WHEN price > 100 THEN 'expensive'
         WHEN price > 50 THEN 'moderate'
         ELSE 'cheap' END
FROM products

Good:

SELECT product_id,
    CASE WHEN price > 100 THEN 'expensive'
         WHEN price > 50 THEN 'moderate'
         ELSE 'cheap' END as price_category
FROM products

6: Explicit Aliases for Scalar Subqueries

Bad:

SELECT u.id, u.name,
    (SELECT COUNT(*) FROM orders WHERE user_id = u.id)
FROM users u

Good:

SELECT u.id, u.name,
    (SELECT COUNT(*) FROM orders WHERE user_id = u.id) as order_count
FROM users u

7: Explicit Aliases for Literals

Bad:

SELECT 'v1.0', CURRENT_TIMESTAMP(), 1 FROM table1

Good:

SELECT 'v1.0' as version, CURRENT_TIMESTAMP() as query_timestamp, 1 as partition_id
FROM table1

8: Explicit Aliases for Arithmetic

Bad:

SELECT price * 1.1, quantity * price, revenue - cost FROM products

Good:

SELECT
    price * 1.1 as price_with_tax,
    quantity * price as line_total,
    revenue - cost as profit
FROM products

9: Explicit Aliases for Type Casts (Optional)

Bad:

SELECT CAST(created_at AS DATE), SAFE_CAST(amount AS INT64) FROM orders

Good:

SELECT CAST(created_at AS DATE) as order_date, SAFE_CAST(amount AS INT64) as amount_int
FROM orders

10: The PIVOT/UNPIVOT Problem

Dynamic PIVOT breaks static lineage completely.

This is bad:

SELECT *
FROM (SELECT product_id, month, revenue FROM sales)
PIVOT (
    SUM(revenue)
    FOR month IN (SELECT DISTINCT month FROM sales)  -- Dynamic!
)

Output columns depend on data values. Impossible to trace lineage statically.

Better - Explicit PIVOT:

SELECT product_id, jan_revenue, feb_revenue, mar_revenue
FROM (SELECT product_id, month, revenue FROM sales)
PIVOT (
    SUM(revenue)
    FOR month IN ('Jan', 'Feb', 'Mar')
) AS pivoted (product_id, jan_revenue, feb_revenue, mar_revenue)

Best - Explicit CASE Statements:

SELECT
    product_id,
    SUM(CASE WHEN month = 'Jan' THEN revenue ELSE 0 END) as jan_revenue,
    SUM(CASE WHEN month = 'Feb' THEN revenue ELSE 0 END) as feb_revenue,
    SUM(CASE WHEN month = 'Mar' THEN revenue ELSE 0 END) as mar_revenue
FROM sales
GROUP BY product_id

Each output column shows exactly what it computes. Works across all SQL dialects. No special syntax.

When dynamic PIVOT is unavoidable (user-configurable reports), we should isolate it and document that lineage is dynamic for that section.

Why This Matters to Me

I spent years drawing column lineage diagrams on paper while debugging SQL pipelines. Eventually I discovered that SQL parsers can construct these diagrams automatically—but only if the SQL is written clearly. I wrote about the full journey in Why Column Lineage Should Be Built-In, Not Bolted-On.

The rules in this post are the practical foundation for that approach. When every expression has an explicit alias and every column traces to its source, lineage becomes a property of the code itself rather than something reverse-engineered from query logs after deployment.

Nov 15, 2025
in Column Lineage, Data Engineering, Architecture
6 min read

Why Column Lineage Should Be Built-In, Not Bolted-On

In my first few data engineering jobs, I spent hours, sometimes days, debugging data quality issues. My workflow was always the same: dig through SQL queries, check information schema, scan query logs, and try to piece together what went wrong. For impact analysis, I'd do the same thing in reverse: trace forward through the code to see what might break.

I found myself drawing diagrams on paper to visualize the relationships between columns. Sometimes within a single query with complex CTEs, sometimes across multiple queries in a pipeline. These hand-drawn graphs were the only way I could keep track of how data flowed.