Bounded Concurrency First
Bounded concurrency first
When fanning out async work, default to bounded concurrency. Unbounded spawn is the lazy answer that turns "many tasks" into "many simultaneous failures." The fix is small: a Semaphore plus a JoinSet, with the permit released as the task completes.
The pattern
use std::sync::Arc;
use tokio::sync::Semaphore;
use tokio::task::JoinSet;
let limit = Arc::newnew(16);
let mut set = JoinSet::new();
for item in items {
let permit = limit.clone().acquire_owned().await?;
set.spawn(async move {
let _permit = permit;
process_item(item).await
});
}
while let Some(result) = set.join_next().await {
handle(result)?;
}
At most 16 items are in flight at any moment. The acquire_owned call blocks new spawns once the limit is reached. The permit drops when the task completes, freeing the slot.
What unbounded fan-out actually breaks
Spawning one task per item from a 5,000-item list looks innocent until you run it on a real workload:
- Provider throttling. Most external APIs rate-limit. Five thousand concurrent calls produces five thousand simultaneous 429s, and now you've got a retry storm on top of the original load.
- Lock contention. If any task touches a shared resource (database, search index, log writer), all five thousand tasks are now competing for the same lock. Throughput goes down, not up.
- Queue blow-up. Channels, response queues, and result accumulators that worked fine for the typical case become unbounded when you flood them.
- Latency on the interactive surface. The user's keypress competes with the bulk work. Suddenly the UI is laggy because a background batch decided everyone is equally important.
A semaphore caps every one of these failure modes by capping the input.
Picking the bound
The right limit depends on the bottleneck. Some heuristics:
- Provider sync fan-out: match the provider's rate budget, divided by typical latency. If the API allows 60 req/min and each request takes ~1s, that's about 1 concurrent request per slot.
- Attachment extraction / decode: match the CPU core count, or a small multiple. CPU work doesn't benefit from more parallelism than cores.
- Semantic chunking / embedding prep: match the underlying executor's capacity. If the model serves N requests at a time, the queue depth should match.
- Expensive transforms (reader-mode, HTML clean): match available memory / per-task footprint. Twenty concurrent transforms that each allocate a megabyte is fine; twenty thousand isn't.
The bound is a knob, not a magic number. The point is having a knob.
The complement: back-pressure all the way down
A semaphore on spawn alone isn't enough if the spawned work feeds into an unbounded channel downstream. Back-pressure works only when every step in the pipeline has a limit. Otherwise the bottleneck just moves.
If you have a mpsc::channel between stages, give it a finite capacity. If you have a result accumulator, cap its size. If you have a retry queue, bound it. The semaphore is the entry gate; the rest of the pipeline needs gates too.
When unbounded is acceptable
There are real cases where unbounded is fine:
- The input list is small and known (e.g., 4 accounts to sync, not 5,000 messages)
- Each task is genuinely independent and short
- The downstream consumers can absorb arbitrary fan-in
For those, JoinSet without a semaphore is the right call. The rule is "bounded first," not "bounded always" — but the default should bias toward the bound, because the cases that need it are common and the cases that need the bound removed are rare.
See also
- Tokio — the synthesis hub
- Classify Async Work Before Refactoring — bounded concurrency is the default once you're in bucket 5 (heavy CPU) or any fan-out
- Concurrent Is Not Parallel — picking
spawnoverjoin!is the upstream decision