Sankey Diagram: Example SQL to Dashboard

How do I build a Sankey diagram from SQL data and embed it in a dashboard?

A Sankey diagram displays proportional flows between categories; building one from SQL data involves aggregating source-to-target values and piping them into a visualization layer such as a BI dashboard.

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)

Oops! Something went wrong while submitting the form.

Description

Example H2

Example H3

What Is a Sankey Diagram?

A Sankey diagram is a flow chart in which the width of the links is proportional to the magnitude of the flow between nodes. It is ideal for visualizing how quantitative values split and merge across stages, such as energy balances, customer journeys, or money transfers.

Why Should Data Teams Care?

Traditional bar or line charts show amounts at discrete points, but they hide how those amounts move from one category to the next. Sankey diagrams unlock flow intelligence: where value leaks, which paths dominate, and how complex systems branch or converge. In analytics and operations, these insights drive conversion uplift, cost reduction, and process optimization.

How a Sankey Diagram Works

Nodes and Links

Each category (state, department, web page, etc.) becomes a node. Every pairwise movement between two nodes is a link. The link’s thickness equals the quantitative value that travels the path—often a sum of transactions, pageviews, or energy units.

Data Requirements

source – starting node
target – ending node
value – numeric magnitude

A tidy table with those three columns is enough for most visualization libraries (Plotly, D3, Highcharts, Grafana, etc.).

Deriving Sankey Data with SQL

The core task is grouping raw events into source–target pairs. For example, suppose we track payments flowing from payor_account_type to payee_account_type:

SELECT payor_account_type AS source, payee_account_type AS target, SUM(amount_usd) AS value FROM payment_events WHERE event_date BETWEEN '2023-01-01' AND '2023-06-30' GROUP BY 1,2 ORDER BY value DESC;

The query collapses millions of events into a source-target-value matrix your dashboard tool can consume.

Typical Transformations

Stage labeling: map URLs or statuses to friendly names using CASE expressions or join tables.
Funnel position: build a ROW_NUMBER() partition to capture sequences like Product View → Cart → Checkout.
Thresholding: filter out micro-flows below 0.1 % to declutter the visual.

Best Practices for Building Dashboards

1. Keep the Query Fast

Pre-aggregate into a materialized view or incremental table if source data is huge. Limit time windows or dimensions to maintain sub-second refreshes.

2. Design for Iteration

Expose parameters—dates, customer segments, or currency—to let users pivot without editing SQL. Parameterization is native inside modern SQL editors such as Galaxy, which automatically validates types and injects safely.

3. Sync Schema and Visual

Ensure the field names your BI tool expects (source, target, value) match exactly. Galaxy’s AI copilot can refactor column aliases on the fly when upstream models change.

4. Test for Imbalance

Sum of outgoing flows from a node should equal sum of incoming flows (save losses). Validate this with assertion queries or dbt tests.

Galaxy in the Workflow

Because Galaxy is a developer-centric SQL editor, it is the personal cockpit for writing and iterating on the aggregation query. Features that accelerate Sankey work:

AI Copilot – suggest the correct GROUP BY, detect missing time filters, and propose index hints.
Collections – share the final source-target query with analytics and product teams; endorse it as the single source of truth.
Desktop Speed – render millions of preview rows without freezing, letting you validate path outliers rapidly.

End-to-End Example

Write aggregation SQL in Galaxy (see full code below).
Save to a Flows collection and endorse it.
Connect your dashboard tool (e.g., Grafana, Metabase) to the database view vw_payment_flows.
Choose a Sankey visualization and map fields.
Publish the dashboard and schedule refresh every hour.

Common Mistakes and How to Fix Them

1. Double Counting Flows

Why it happens: Joining fact tables without distinct keys.
Fix: Aggregate first, then join; or use CTEs with DISTINCT.

2. Exceeding Node Limits

Why it happens: High-cardinality dimensions like user_id produce thousands of nodes, overwhelming the chart.
Fix: Bucket into cohorts or top-N plus Other.

3. Ignoring Null Paths

Why it happens: NULL source or target rows break rendering.
Fix: COALESCE to a label such as Unknown in SQL.

Conclusion

Transforming raw operational data into a Sankey diagram is largely a SQL exercise: aggregate flows, test balances, and surface the results in a dashboard. With tools like Galaxy accelerating query iteration and collaboration, turning complex data journeys into intuitive flow visuals becomes a matter of minutes, not days.

Why Sankey Diagram: Example SQL to Dashboard is important

Understanding how values flow through systems is often more actionable than isolated totals. A Sankey diagram lets teams pinpoint bottlenecks and leakage in processes like conversion funnels, financial transfers, and supply chains. Bridging SQL—where data lives—to the visualization layer closes the feedback loop between analysis and decision-making. Developers who master this pattern can deliver high-impact insights quickly, especially when working in collaborative SQL editors such as Galaxy.

Sankey Diagram: Example SQL to Dashboard Example Usage


SELECT stage_from AS source, stage_to AS target, COUNT(*) AS value FROM user_journey_events WHERE event_ts >= CURRENT_DATE - INTERVAL '30 days' GROUP BY 1,2;

Sankey Diagram: Example SQL to Dashboard Syntax

Common Mistakes

Aggregating after joining multiple fact tables causes bloated counts. The correct approach is to aggregate each fact table separately into source-target pairs, then UNION or join on conformed dimensions.
Feeding more than ~100 nodes to most Sankey libraries leads to unreadable hairballs. Trim to top contributors or bucket long tails into an 'Other' category before visualization.
Leaving NULL values in the source or target fields breaks rendering or hides flows. Always COALESCE NULLs to an explicit label like 'Unknown' in your SQL.

Frequently Asked Questions (FAQs)

How do I decide which dimensions become nodes?

Pick categorical fields with meaningful business interpretation—statuses, departments, or funnel stages. Avoid high-cardinality IDs unless you aggregate them into cohorts.

Can I build a Sankey diagram directly inside Galaxy?

Not today. Galaxy focuses on query authoring and collaboration. You write and share the SQL in Galaxy, then feed the result into a BI layer that supports Sankey visuals.

What SQL dialects work with this pattern?

Any ANSI-compliant engine—PostgreSQL, Snowflake, BigQuery, Redshift—can produce the required source, target, value columns. Adjust syntax for date functions and materialized views.

How can I validate that totals balance across nodes?

Sum all outbound values for each source and compare to inbound sums for each target. Differences highlight data quality issues or natural losses you should explain in the dashboard.