Building a Sankey Diagram from SQL Results

How do I build a Sankey diagram from SQL results?

Transforming relational query outputs into the node-link structure required to visualize proportional flows with a Sankey diagram.

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)

Oops! Something went wrong while submitting the form.

Description

Example H2

Example H3

Sankey diagrams are one of the most effective ways to show how quantities flow from one set of categories to another.

This guide walks through every step—writing an aggregation query, reshaping results, assigning node IDs, and rendering the final visualization in Python or with BI tools—so you can go from raw SQL rows to an interactive Sankey in minutes.

Why Use a Sankey Diagram?

Unlike simple bar or line charts, a Sankey diagram emphasizes flow proportionality. The width of each link encodes the magnitude of movement between two stages—ideal for mapping user journeys, energy transfers, or money moving through cost centers.

Prerequisites

Access to a relational database (e.g., PostgreSQL, MySQL, Snowflake)
A SQL editor such as Galaxy for running and iterating on queries
A visualization environment—Python (Plotly, matplotlib-sankey), JavaScript (D3.js), or a BI tool that supports Sankey visuals (Power BI, Tableau, Looker Studio)

Step 1 – Model the Flow in SQL

At minimum, your query must output three columns:

source – the starting node label
target – the ending node label
value – a numeric measure of magnitude (COUNT, SUM, etc.)

Example for a user-funnel table called session_events:

SELECT prev_step AS source, next_step AS target, COUNT(*) AS value FROM (SELECT user_id, LAG(step) OVER (PARTITION BY user_id ORDER BY occurred_at) AS prev_step, step AS next_step FROM session_events) t WHERE prev_step IS NOT NULL GROUP BY 1, 2;

Step 2 – Export or Pull Results into Code

With Galaxy, you can:

Run the query, inspect the grid for correctness
Save it to a Collection so teammates can reuse or endorse it
Click Export → CSV/Parquet or use galaxy-cli query <id> --output json in CI

Step 3 – Construct Node and Link Arrays

Most libraries expect:

nodes: an array of unique names
links: sourceID, targetID, value

Python example:

import pandas as pd, plotly.graph_objects as go # df = pd.read_csv('flow.csv') # or use Galaxy API result nodes = pd.Series(pd.concat([df.source, df.target])).unique() node_map = {name: i for i, name in enumerate(nodes)} links = dict( source=df.source.map(node_map), target=df.target.map(node_map), value=df.value ) fig = go.Figure(go.Sankey( node=dict(label=nodes), link=links )) fig.show()

Best Practices

Aggregate in SQL, Not in Code

Do heavy grouping and filtering within your database to keep the dataset small and consistent.

Keep Node Labels Human-Readable

Down-case, title-case, or otherwise normalize text so identical concepts aren’t split in the visualization.

Use Meaningful Sorting

Plotly and D3 respect the node order supplied. Ordering nodes logically (e.g., funnel stage) improves readability.

Common Mistakes and How to Fix Them

Over-granular Rows

Problem: You leave raw event rows ungrouped, flooding the sankey with duplicates.
Fix: Ensure a GROUP BY source, target with an aggregate measure.

Inconsistent Node Naming

Problem: ‘signup’ vs ‘Sign-Up’ vs ‘SignUp’ are treated as separate nodes.
Fix: Normalize in SQL: LOWER(TRIM(step)) or a CASE statement mapping synonyms.

Missing Stages

Problem: Users who skip a step cause NULL nodes and fail the join.
Fix: Use COALESCE to label missing steps (‘None’, ‘Drop-off’) so the flow is captured.

Galaxy & Sankey Diagrams

While Galaxy focuses on lightning-fast querying, its upcoming lightweight visualization layer will let you preview Sankeys without leaving the editor. Today you can:

Write and optimize the SQL with Galaxy’s AI copilot
Version-control the query in a Collection
Export results to Python, Power BI, or another front-end to build the final diagram

End-to-End Example

-- 1. Build edges in PostgreSQL WITH ordered AS ( SELECT user_id, occurred_at, step, LAG(step) OVER (PARTITION BY user_id ORDER BY occurred_at) AS prev_step FROM session_events ), edges AS ( SELECT prev_step AS source, step AS target, COUNT(*) AS value FROM ordered WHERE prev_step IS NOT NULL GROUP BY 1, 2 ) SELECT * FROM edges; # 2. Visualize in Plotly import pandas as pd, plotly.graph_objects as go df = pd.read_sql("SELECT * FROM edges", conn) # or Galaxy export nodes = pd.Series(pd.concat([df.source, df.target])).unique() node_id = {v: k for k, v in enumerate(nodes)} fig = go.Figure(go.Sankey( node=dict(label=nodes, pad=15, thickness=20), link=dict( source=df.source.map(node_id), target=df.target.map(node_id), value=df.value, ) )) fig.update_layout(title_text="User Journey Sankey", font_size=12) fig.show()

Next Steps

Add more dimensions (e.g., color links by user segment)
Schedule the SQL in Galaxy and push the output to an S3 bucket daily
Embed the Plotly chart in an internal dashboard

Why Building a Sankey Diagram from SQL Results is important

Relational databases excel at counting and summing but fall short when you need to communicate complex flows to stakeholders. Translating SQL aggregates into a Sankey diagram bridges that gap, letting analysts surface actionable insights—where users drop off, which cost centers drain budget, or how data moves between systems. Mastering this workflow means you can stay in SQL (your single source of truth) while delivering rich, visually compelling stories that drive decisions.

Building a Sankey Diagram from SQL Results Example Usage


SELECT source, target, SUM(amount) AS value FROM payments_flow GROUP BY 1,2;

Building a Sankey Diagram from SQL Results Syntax

Common Mistakes

Relying on raw event rows without aggregation leads to millions of duplicate edges. Always GROUP BY and pre-aggregate in SQL to keep the dataset lean and meaningful.
Using inconsistent labels (e.g., case or spacing differences) creates duplicate nodes. Normalize or map them in the SELECT clause or a dimension table.
Failing to assign numeric node IDs causes library errors. Build a mapping in code (<code>enumerate()</code>) or generate IDs in SQL with <code>DENSE_RANK()</code>.

Frequently Asked Questions (FAQs)

Can I build the entire Sankey diagram inside SQL?

You can generate the node and link tables (including numeric IDs) in SQL, but drawing the diagram still requires a visualization layer such as Plotly, D3, or your BI tool.

How does Galaxy help with Sankey workflows?

Galaxy speeds up query authoring with AI autocomplete, lets you share and endorse the SQL in Collections, and offers one-click exports your code can ingest for visualization. Upcoming features will preview Sankeys directly in the editor.

Which SQL dialects support window functions needed for funnels?

PostgreSQL, Snowflake, Redshift, BigQuery, and most modern warehouses support LAG()/LEAD() and other window functions used to derive sequential steps.

What’s the best library for interactive Sankey diagrams?

For Python users, Plotly provides a high-level API; in JavaScript, D3.js offers maximum flexibility. BI users can leverage built-in Sankey visuals in Power BI or Tableau.