TABLESAMPLE lets you scan only a percentage or fixed number of rows instead of the entire table, reducing I/O and giving a quick approximation of large data sets. The sampling happens at the storage level, so rows are chosen without evaluating predicates first. Because of this, the subset is not ordered and may not be perfectly uniform; distribution depends on the underlying storage engine and the chosen sampling method (BERNOULLI or SYSTEM). The optional REPEATABLE clause ensures the same sample when the same seed is supplied. TABLESAMPLE is applied immediately after the FROM clause and before WHERE, GROUP BY, or JOIN processing. It is non-deterministic unless REPEATABLE is used and does not guarantee inclusion of specific rows, so use it only for exploratory analysis, statistics, or approximate aggregates—never for transactional queries that require completeness.
BERNOULLI
(keyword) - Row-level sampling; each row is checked independently.SYSTEM
(keyword) - Block-level sampling; entire data pages are selected, faster but less uniform.sample_size
(numeric) - Percentage (0.000001-100) or number of rows* depending on dialect.REPEATABLE
(keyword) - Optional clause that makes sampling deterministic.seed
(integer) - Seed value for the pseudo-random generator.SAMPLE (Oracle), LIMIT, FETCH FIRST, TOP, ORDER BY RANDOM, APPROXIMATE aggregation
SQL:2003 standard (feature T491)
BERNOULLI samples rows individually, providing a more uniform distribution but with higher overhead. SYSTEM samples entire data pages, offering speed at the cost of potential bias.
In SQL Server you can specify TABLESAMPLE (n ROWS). In PostgreSQL and BigQuery, only percentage sampling is supported; combine TABLESAMPLE with LIMIT for an approximate row count.
Add REPEATABLE (seed) after the TABLESAMPLE clause. Using the same seed recreates the identical subset on each execution.
Usually yes for very large tables, but on small tables the overhead of initiating the sampling engine may negate benefits.