Transforming relational query outputs into the node-link structure required to visualize proportional flows with a Sankey diagram.
Sankey diagrams are one of the most effective ways to show how quantities flow from one set of categories to another.
This guide walks through every step—writing an aggregation query, reshaping results, assigning node IDs, and rendering the final visualization in Python or with BI tools—so you can go from raw SQL rows to an interactive Sankey in minutes.
Unlike simple bar or line charts, a Sankey diagram emphasizes flow proportionality. The width of each link encodes the magnitude of movement between two stages—ideal for mapping user journeys, energy transfers, or money moving through cost centers.
At minimum, your query must output three columns:
source
– the starting node labeltarget
– the ending node labelvalue
– a numeric measure of magnitude (COUNT, SUM, etc.)Example for a user-funnel table called session_events
:
SELECT prev_step AS source,
next_step AS target,
COUNT(*) AS value
FROM (SELECT user_id,
LAG(step) OVER (PARTITION BY user_id ORDER BY occurred_at) AS prev_step,
step AS next_step
FROM session_events) t
WHERE prev_step IS NOT NULL
GROUP BY 1, 2;
With Galaxy, you can:
galaxy-cli query <id> --output json
in CIMost libraries expect:
Python example:
import pandas as pd, plotly.graph_objects as go
# df = pd.read_csv('flow.csv') # or use Galaxy API result
nodes = pd.Series(pd.concat([df.source, df.target])).unique()
node_map = {name: i for i, name in enumerate(nodes)}
links = dict(
source=df.source.map(node_map),
target=df.target.map(node_map),
value=df.value
)
fig = go.Figure(go.Sankey(
node=dict(label=nodes),
link=links
))
fig.show()
Do heavy grouping and filtering within your database to keep the dataset small and consistent.
Down-case, title-case, or otherwise normalize text so identical concepts aren’t split in the visualization.
Plotly and D3 respect the node order supplied. Ordering nodes logically (e.g., funnel stage) improves readability.
Problem: You leave raw event rows ungrouped, flooding the sankey with duplicates.
Fix: Ensure a GROUP BY source, target
with an aggregate measure.
Problem: ‘signup’ vs ‘Sign-Up’ vs ‘SignUp’ are treated as separate nodes.
Fix: Normalize in SQL: LOWER(TRIM(step))
or a CASE statement mapping synonyms.
Problem: Users who skip a step cause NULL nodes and fail the join.
Fix: Use COALESCE
to label missing steps (‘None’, ‘Drop-off’) so the flow is captured.
While Galaxy focuses on lightning-fast querying, its upcoming lightweight visualization layer will let you preview Sankeys without leaving the editor. Today you can:
-- 1. Build edges in PostgreSQL
WITH ordered AS (
SELECT user_id,
occurred_at,
step,
LAG(step) OVER (PARTITION BY user_id ORDER BY occurred_at) AS prev_step
FROM session_events
), edges AS (
SELECT prev_step AS source,
step AS target,
COUNT(*) AS value
FROM ordered
WHERE prev_step IS NOT NULL
GROUP BY 1, 2
)
SELECT * FROM edges;
# 2. Visualize in Plotly
import pandas as pd, plotly.graph_objects as go
df = pd.read_sql("SELECT * FROM edges", conn) # or Galaxy export
nodes = pd.Series(pd.concat([df.source, df.target])).unique()
node_id = {v: k for k, v in enumerate(nodes)}
fig = go.Figure(go.Sankey(
node=dict(label=nodes, pad=15, thickness=20),
link=dict(
source=df.source.map(node_id),
target=df.target.map(node_id),
value=df.value,
)
))
fig.update_layout(title_text="User Journey Sankey", font_size=12)
fig.show()
Relational databases excel at counting and summing but fall short when you need to communicate complex flows to stakeholders. Translating SQL aggregates into a Sankey diagram bridges that gap, letting analysts surface actionable insights—where users drop off, which cost centers drain budget, or how data moves between systems. Mastering this workflow means you can stay in SQL (your single source of truth) while delivering rich, visually compelling stories that drive decisions.
You can generate the node and link tables (including numeric IDs) in SQL, but drawing the diagram still requires a visualization layer such as Plotly, D3, or your BI tool.
Galaxy speeds up query authoring with AI autocomplete, lets you share and endorse the SQL in Collections, and offers one-click exports your code can ingest for visualization. Upcoming features will preview Sankeys directly in the editor.
PostgreSQL, Snowflake, Redshift, BigQuery, and most modern warehouses support LAG()
/LEAD()
and other window functions used to derive sequential steps.
For Python users, Plotly provides a high-level API; in JavaScript, D3.js offers maximum flexibility. BI users can leverage built-in Sankey visuals in Power BI or Tableau.