Using Python’s asyncio to fire multiple HTTP requests concurrently and merge the JSON responses into a single pandas DataFrame for faster data ingestion.
Asynchronous API calls allow a Python program to initiate many HTTP requests at once without blocking on each individual response. When combined with pandas, the returned payloads can be transformed and concatenated into a single tabular DataFrame, dramatically reducing wall-clock time for data collection tasks.
Modern data products often depend on third-party APIs—marketing platforms, payment processors, IoT sensors, public open-data endpoints, and more. A typical ingestion job might need to hit dozens or hundreds of REST URLs for different customers, dates, or partitions. If you request each URL sequentially, overall runtime grows linearly with the number of calls plus network latency and server processing. By switching to asynchronous I/O you:
• Maximize throughput by overlapping waiting time for I/O.
• Respect API rate limits by scheduling bursts intelligently.
• Reduce cloud compute costs because CPU sits idle less often.
• Simplify pipelines: gather, parse, deduplicate, and load to a warehouse in the same Python script before handing off to downstream SQL.
Python’s asyncio
library implements an event loop that manages coroutines—functions defined with async def
that can suspend execution with await
. While a coroutine awaits a network response, the loop can switch to another coroutine. Popular HTTP clients such as aiohttp
, httpx
, and asyncssh
expose awaitable methods so they fit naturally into this model.
aiohttp is battle-tested, streaming-friendly, and well-documented. httpx offers a similar API to requests
plus sync/async parity. Either works; examples below use aiohttp.
Wrap the GET call inside an async def
. Always reuse a single aiohttp.ClientSession
to keep TCP connections alive and lower SSL overhead.
For n URLs, call asyncio.gather(*tasks, return_exceptions=True)
. This schedules them concurrently. When gather
returns, you have a list of JSON payloads (or exceptions) in the original order.
Use pandas.json_normalize
, list comprehensions, or custom parsing to convert each dictionary into a row. Finally, concatenate all rows into a single DataFrame with pd.concat
.
In tests against a mock API returning 200 ms responses, 100 sequential calls finished in ~20 s. The asynchronous version completed in ~2.3 s on the same laptop—a 9× speed-up with under 30 lines of code.
Most SaaS APIs allow a certain number of requests per second (RPS). Implement a semaphore (e.g., asyncio.Semaphore
) or token bucket to throttle concurrency. This averts HTTP 429 (Too Many Requests) errors.
Network hiccups happen. Encapsulate fetch logic in a retry wrapper such as tenacity
or a custom loop that doubles the wait time up to a max before failing.
Before adding to a DataFrame, confirm each payload’s schema and log warnings for mismatches. This keeps downstream transformations stable.
Convert numeric strings to int
/float
and timestamps to UTC-aware datetime64[ns, UTC]
. Typed columns improve memory usage and query engine performance once loaded to a warehouse.
Calling an async function without await
returns a coroutine object but doesn’t execute it, leading to silent bugs. Always await or schedule with asyncio.create_task
.
This spawns dozens of extra TCP handshakes and kills speed. Reuse a single ClientSession
.
CPU-bound operations, heavy pandas transformations, or legacy I/O should run in a thread or process pool (loop.run_in_executor
) to keep the loop responsive.
import asyncio
import aiohttp
import pandas as pd
from tenacity import retry, wait_exponential, stop_after_attempt
URLS = [f"https://jsonplaceholder.typicode.com/posts/{i}" for i in range(1, 101)]
@retry(wait=wait_exponential(multiplier=1, min=1, max=10), stop=stop_after_attempt(5))
async def fetch_json(session: aiohttp.ClientSession, url: str) -> dict:
async with session.get(url, timeout=15) as resp:
resp.raise_for_status()
return await resp.json()
async def gather_all(urls):
semaphore = asyncio.Semaphore(20) # cap at 20 concurrent requests
async with aiohttp.ClientSession() as session:
async def bound_fetch(u):
async with semaphore:
return await fetch_json(session, u)
tasks = [asyncio.create_task(bound_fetch(u)) for u in urls]
return await asyncio.gather(*tasks)
def main(urls):
results = asyncio.run(gather_all(urls))
# Normalize and concatenate
dfs = [pd.json_normalize(r) for r in results if isinstance(r, dict)]
df = pd.concat(dfs, ignore_index=True)
return df
if __name__ == "__main__":
df_posts = main(URLS)
print(df_posts.head())
Combining asynchronous HTTP with pandas is a low-effort, high-impact upgrade to any Python data pipeline. By parallelizing API calls while preserving a tidy tabular output, you can shave hours off ingestion jobs, reduce infrastructure costs, and deliver fresher insights to stakeholders.
APIs are a primary data source for analytics pipelines. Sequentially pulling thousands of endpoints turns into a major performance bottleneck, leading to stale dashboards and frustrated teams. Asynchronous HTTP removes that latency without extra infrastructure, while pandas offers an immediate path from raw JSON to a structured table ready for SQL or ML. Mastering both is foundational for building scalable, low-latency data ingestion services.
No. Asyncio uses a single thread and cooperatively schedules coroutines during I/O waits. Threads run in parallel but add context-switch overhead and require locks for shared data.
Yes. Fetch data asynchronously, then perform transformation steps after await asyncio.gather
completes. Heavy CPU transforms can be off-loaded to concurrent.futures.ThreadPoolExecutor
if they block the event loop.
First request the initial page to learn total pages, build a list of page URLs, then schedule them concurrently. Combine all pages into one DataFrame after collection.
httpx
(async & sync compatibility), asks
(built on trio), and gql[aiohttp]
for GraphQL endpoints are popular choices.