Use SELECT statements on BigQuery’s public GitHub dataset to inspect, filter, and aggregate open-source license data.
Quickly audit codebases, track license trends, and ensure compliance by scanning millions of GitHub repositories already mirrored in BigQuery’s public datasets.
The bigquery-public-data.github_repos.licenses
table stores one row per repository license. Pair it with sample_contents
or your own tables for deeper insights.
Use a standard SELECT
listing the columns you need, then FROM
the licenses table.Apply WHERE
filters to narrow results and GROUP BY
for roll-ups.
SELECT license, COUNT(*) AS repo_count FROM `bigquery-public-data.github_repos.licenses` WHERE license = 'MIT' GROUP BY license;
Store repository URLs for each product in a Products
table.Join that list against the public dataset to pull license info for each item you sell.
SELECT p.id, p.name, l.license FROM Products p JOIN `bigquery-public-data.github_repos.licenses` l ON p.repo_url = l.repo_name;
Always select only needed columns, use date or license filters, and preview query bytes.Partition or materialize results when repeatedly accessed.
• Reference tables with full project IDs.
• Use LIMIT
during development.
• Persist heavy joins to temporary tables.
• Keep API keys and secrets out of shared queries.
1. Build a list of repositories tied to your products.
2. Query licenses.
3. Store flagged GPL or AGPL repos in a review table.
4. Share dashboards with legal teams.
Forgetting a WHERE
clause can read 100+ GB.Always filter.
Some repositories have multiple licenses. Aggregate correctly or deduplicate first.
.
No. Repositories without detected licenses won’t appear in github_repos.licenses
. Always LEFT JOIN if you need missing-license rows.
The public snapshot refreshes roughly once a week. For the latest changes, pull directly via GitHub’s API or mirror your own data.
Yes. Train models to predict license types or repository compliance risks using CREATE MODEL
on aggregated license and metadata features.