Building a Sankey Diagram from SQL Results

Galaxy Glossary

How do I build a Sankey diagram from SQL results?

Transforming relational query outputs into the node-link structure required to visualize proportional flows with a Sankey diagram.

Sign up for the latest in SQL knowledge from the Galaxy Team!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Description

Sankey diagrams are one of the most effective ways to show how quantities flow from one set of categories to another.

This guide walks through every step—writing an aggregation query, reshaping results, assigning node IDs, and rendering the final visualization in Python or with BI tools—so you can go from raw SQL rows to an interactive Sankey in minutes.

Why Use a Sankey Diagram?

Unlike simple bar or line charts, a Sankey diagram emphasizes flow proportionality. The width of each link encodes the magnitude of movement between two stages—ideal for mapping user journeys, energy transfers, or money moving through cost centers.

Prerequisites

  • Access to a relational database (e.g., PostgreSQL, MySQL, Snowflake)
  • A SQL editor such as Galaxy for running and iterating on queries
  • A visualization environment—Python (Plotly, matplotlib-sankey), JavaScript (D3.js), or a BI tool that supports Sankey visuals (Power BI, Tableau, Looker Studio)

Step 1 – Model the Flow in SQL

At minimum, your query must output three columns:

  1. source – the starting node label
  2. target – the ending node label
  3. value – a numeric measure of magnitude (COUNT, SUM, etc.)

Example for a user-funnel table called session_events:

SELECT prev_step AS source,
next_step AS target,
COUNT(*) AS value
FROM (SELECT user_id,
LAG(step) OVER (PARTITION BY user_id ORDER BY occurred_at) AS prev_step,
step AS next_step
FROM session_events) t
WHERE prev_step IS NOT NULL
GROUP BY 1, 2;

Step 2 – Export or Pull Results into Code

With Galaxy, you can:

  • Run the query, inspect the grid for correctness
  • Save it to a Collection so teammates can reuse or endorse it
  • Click Export → CSV/Parquet or use galaxy-cli query <id> --output json in CI

Step 3 – Construct Node and Link Arrays

Most libraries expect:

  • nodes: an array of unique names
  • links: sourceID, targetID, value

Python example:

import pandas as pd, plotly.graph_objects as go

# df = pd.read_csv('flow.csv') # or use Galaxy API result
nodes = pd.Series(pd.concat([df.source, df.target])).unique()
node_map = {name: i for i, name in enumerate(nodes)}

links = dict(
source=df.source.map(node_map),
target=df.target.map(node_map),
value=df.value
)

fig = go.Figure(go.Sankey(
node=dict(label=nodes),
link=links
))
fig.show()

Best Practices

Aggregate in SQL, Not in Code

Do heavy grouping and filtering within your database to keep the dataset small and consistent.

Keep Node Labels Human-Readable

Down-case, title-case, or otherwise normalize text so identical concepts aren’t split in the visualization.

Use Meaningful Sorting

Plotly and D3 respect the node order supplied. Ordering nodes logically (e.g., funnel stage) improves readability.

Common Mistakes and How to Fix Them

Over-granular Rows

Problem: You leave raw event rows ungrouped, flooding the sankey with duplicates.
Fix: Ensure a GROUP BY source, target with an aggregate measure.

Inconsistent Node Naming

Problem: ‘signup’ vs ‘Sign-Up’ vs ‘SignUp’ are treated as separate nodes.
Fix: Normalize in SQL: LOWER(TRIM(step)) or a CASE statement mapping synonyms.

Missing Stages

Problem: Users who skip a step cause NULL nodes and fail the join.
Fix: Use COALESCE to label missing steps (‘None’, ‘Drop-off’) so the flow is captured.

Galaxy & Sankey Diagrams

While Galaxy focuses on lightning-fast querying, its upcoming lightweight visualization layer will let you preview Sankeys without leaving the editor. Today you can:

  • Write and optimize the SQL with Galaxy’s AI copilot
  • Version-control the query in a Collection
  • Export results to Python, Power BI, or another front-end to build the final diagram

End-to-End Example

-- 1. Build edges in PostgreSQL
WITH ordered AS (
SELECT user_id,
occurred_at,
step,
LAG(step) OVER (PARTITION BY user_id ORDER BY occurred_at) AS prev_step
FROM session_events
), edges AS (
SELECT prev_step AS source,
step AS target,
COUNT(*) AS value
FROM ordered
WHERE prev_step IS NOT NULL
GROUP BY 1, 2
)
SELECT * FROM edges;
# 2. Visualize in Plotly
import pandas as pd, plotly.graph_objects as go

df = pd.read_sql("SELECT * FROM edges", conn) # or Galaxy export
nodes = pd.Series(pd.concat([df.source, df.target])).unique()
node_id = {v: k for k, v in enumerate(nodes)}

fig = go.Figure(go.Sankey(
node=dict(label=nodes, pad=15, thickness=20),
link=dict(
source=df.source.map(node_id),
target=df.target.map(node_id),
value=df.value,
)
))
fig.update_layout(title_text="User Journey Sankey", font_size=12)
fig.show()

Next Steps

  • Add more dimensions (e.g., color links by user segment)
  • Schedule the SQL in Galaxy and push the output to an S3 bucket daily
  • Embed the Plotly chart in an internal dashboard

Why Building a Sankey Diagram from SQL Results is important

Relational databases excel at counting and summing but fall short when you need to communicate complex flows to stakeholders. Translating SQL aggregates into a Sankey diagram bridges that gap, letting analysts surface actionable insights—where users drop off, which cost centers drain budget, or how data moves between systems. Mastering this workflow means you can stay in SQL (your single source of truth) while delivering rich, visually compelling stories that drive decisions.

Building a Sankey Diagram from SQL Results Example Usage


SELECT source, target, SUM(amount) AS value FROM payments_flow GROUP BY 1,2;

Building a Sankey Diagram from SQL Results Syntax



Common Mistakes

Frequently Asked Questions (FAQs)

Can I build the entire Sankey diagram inside SQL?

You can generate the node and link tables (including numeric IDs) in SQL, but drawing the diagram still requires a visualization layer such as Plotly, D3, or your BI tool.

How does Galaxy help with Sankey workflows?

Galaxy speeds up query authoring with AI autocomplete, lets you share and endorse the SQL in Collections, and offers one-click exports your code can ingest for visualization. Upcoming features will preview Sankeys directly in the editor.

Which SQL dialects support window functions needed for funnels?

PostgreSQL, Snowflake, Redshift, BigQuery, and most modern warehouses support LAG()/LEAD() and other window functions used to derive sequential steps.

What’s the best library for interactive Sankey diagrams?

For Python users, Plotly provides a high-level API; in JavaScript, D3.js offers maximum flexibility. BI users can leverage built-in Sankey visuals in Power BI or Tableau.

Want to learn about other SQL terms?