A Sankey diagram displays proportional flows between categories; building one from SQL data involves aggregating source-to-target values and piping them into a visualization layer such as a BI dashboard.
A Sankey diagram is a flow chart in which the width of the links is proportional to the magnitude of the flow between nodes. It is ideal for visualizing how quantitative values split and merge across stages, such as energy balances, customer journeys, or money transfers.
Traditional bar or line charts show amounts at discrete points, but they hide how those amounts move from one category to the next. Sankey diagrams unlock flow intelligence: where value leaks, which paths dominate, and how complex systems branch or converge. In analytics and operations, these insights drive conversion uplift, cost reduction, and process optimization.
Each category (state, department, web page, etc.) becomes a node. Every pairwise movement between two nodes is a link. The link’s thickness equals the quantitative value that travels the path—often a sum of transactions, pageviews, or energy units.
A tidy table with those three columns is enough for most visualization libraries (Plotly, D3, Highcharts, Grafana, etc.).
The core task is grouping raw events into source–target pairs. For example, suppose we track payments flowing from payor_account_type
to payee_account_type
:
SELECT
payor_account_type AS source,
payee_account_type AS target,
SUM(amount_usd) AS value
FROM payment_events
WHERE event_date BETWEEN '2023-01-01' AND '2023-06-30'
GROUP BY 1,2
ORDER BY value DESC;
The query collapses millions of events into a source-target-value
matrix your dashboard tool can consume.
ROW_NUMBER()
partition to capture sequences like Product View → Cart → Checkout.Pre-aggregate into a materialized view or incremental table if source data is huge. Limit time windows or dimensions to maintain sub-second refreshes.
Expose parameters—dates, customer segments, or currency—to let users pivot without editing SQL. Parameterization is native inside modern SQL editors such as Galaxy, which automatically validates types and injects safely.
Ensure the field names your BI tool expects (source
, target
, value
) match exactly. Galaxy’s AI copilot can refactor column aliases on the fly when upstream models change.
Sum of outgoing flows from a node should equal sum of incoming flows (save losses). Validate this with assertion queries or dbt tests.
Because Galaxy is a developer-centric SQL editor, it is the personal cockpit for writing and iterating on the aggregation query. Features that accelerate Sankey work:
vw_payment_flows
.Why it happens: Joining fact tables without distinct keys.
Fix: Aggregate first, then join; or use CTEs with DISTINCT.
Why it happens: High-cardinality dimensions like user_id
produce thousands of nodes, overwhelming the chart.
Fix: Bucket into cohorts or top-N plus Other.
Why it happens: NULL source or target rows break rendering.
Fix: COALESCE to a label such as Unknown in SQL.
Transforming raw operational data into a Sankey diagram is largely a SQL exercise: aggregate flows, test balances, and surface the results in a dashboard. With tools like Galaxy accelerating query iteration and collaboration, turning complex data journeys into intuitive flow visuals becomes a matter of minutes, not days.
Understanding how values flow through systems is often more actionable than isolated totals. A Sankey diagram lets teams pinpoint bottlenecks and leakage in processes like conversion funnels, financial transfers, and supply chains. Bridging SQL—where data lives—to the visualization layer closes the feedback loop between analysis and decision-making. Developers who master this pattern can deliver high-impact insights quickly, especially when working in collaborative SQL editors such as Galaxy.
Pick categorical fields with meaningful business interpretation—statuses, departments, or funnel stages. Avoid high-cardinality IDs unless you aggregate them into cohorts.
Not today. Galaxy focuses on query authoring and collaboration. You write and share the SQL in Galaxy, then feed the result into a BI layer that supports Sankey visuals.
Any ANSI-compliant engine—PostgreSQL, Snowflake, BigQuery, Redshift—can produce the required source
, target
, value
columns. Adjust syntax for date functions and materialized views.
Sum all outbound values for each source and compare to inbound sums for each target. Differences highlight data quality issues or natural losses you should explain in the dashboard.