Autoscaling FAQs

1

When do I actually need autoscaling?

Architecture

This is the most common starting point. Developers want to know if autoscaling is worth the operational complexity — or if a few fixed servers will do just fine.

Autoscaling is easy to enable but hard to design well. The real challenge isn't turning it on.

— Common sentiment on r/devops

The honest answer is: it depends almost entirely on your traffic pattern and reliability requirements.

Use autoscaling when…

Traffic fluctuates significantly
Reliability matters and downtime is expensive
Infrastructure costs are meaningful
Demand is unpredictable or bursty

Skip autoscaling when…

Workloads are stable and predictable
The system is small (internal tools, prototypes)
Downtime risk is genuinely low
Scaling complexity outweighs benefits

If your app always needs exactly 3 servers, autoscaling probably won't add value. But if you're running a SaaS product, an e-commerce store, or an API with bursty traffic — it's almost always worth it. The real question isn't "should I autoscale?" but "is my architecture ready to autoscale?"

The engineering tradeoff

Autoscaling trades operational simplicity for elasticity. Fixed servers are easier to reason about. Autoscaling gives you cost efficiency and resilience — but it requires you to think carefully about statelessness, metrics, and policies. Think of it as elastic infrastructure rather than over-provisioning servers to handle worst-case load.

2

What metrics should trigger autoscaling?

Metrics & Signals

Developers quickly discover that CPU is not always the best signal. This is one of the most debated topics in autoscaling — and getting it right is what separates reliable systems from ones that jitter and thrash.

Metric	Best for	Reliability
Queue depth Best	Background workers, job queues, async processing	Extremely clear. No guessing — if 200 jobs are waiting, you need more workers.
Request latency (p95/p99) Best	Web APIs, user-facing services	Directly tied to user experience. Catches I/O-bound overload that CPU misses.
Request rate (RPS) Best	Web services, APIs, microservices	Predictable scaling curves. Matches real user demand.
Queue time Best	Any workload with a request queue	Detects saturation early. Highly correlated with latency.
Concurrency Good	Serverless, Cloud Run, Lambda	Good for burst traffic. Easy to reason about.
CPU utilization Use carefully	Compute-bound workloads only	Noisy. Many apps are I/O bound — high CPU doesn't always mean high demand.
Memory usage Avoid	Rarely appropriate as a scale signal	Memory grows slowly and rarely reflects live workload demand.

The mental model shift

Experienced engineers stop asking "how stressed is the machine?" and start asking "how much work is waiting?" That shift changes everything. Demand metrics almost always produce better scaling behavior than machine stress metrics.

Scalar uses queue depth and request signals — not raw CPU — because these metrics map directly to what users are actually experiencing.

3

How do I prevent scaling loops or jitter?

Reliability

Autoscaling oscillation is one of the most common real-world problems teams run into. The system scales up, then back down, then up again — creating a thrashing cycle that's expensive and destabilizing.

How the loop happens

t=0s

CPU / metric spikes

Traffic surge causes a sharp metric spike above threshold.

t=30–90s

New instances launch

Autoscaler fires and new instances begin warming up — but this takes time.

t=90–120s

Metric drops — but new instances are now online

Traffic has spread across too many instances. Metrics drop below threshold.

t=120s

Scale-down fires — too soon

Autoscaler terminates instances. Traffic spikes again. The loop repeats.

Mitigation strategies

Cooldown periods Smoothed / rolling averages Scale on queue depth (not CPU) Minimum instance count Step scaling (not aggressive thresholds)

Cooldown periods are the most common fix — prevent scale-down for a set time window after a scale-up event. Smoothing metrics (using rolling averages instead of instantaneous readings) prevents a single CPU spike from triggering a cascade. And switching from CPU to queue depth or request latency reduces jitter significantly because these signals are more stable and more meaningful.

Scaling can cause oscillation or jitter if the CPU is jumpy — use queue length or latency instead.

— Developer on r/aws

4

Can stateful applications autoscale?

Architecture

This is one of the biggest architectural discussions in the autoscaling world. The short answer: autoscaling works best with stateless services. If your app stores session data in memory or on local disk, spinning instances up and down becomes risky.

Why stateful apps are harder

If an instance is terminated while holding a user's session in memory, that session is lost. If a worker holds a job's progress locally, terminating it mid-process means that work disappears. The more state your app instance holds, the more dangerous dynamic scaling becomes.

The standard solution: externalize state

The architectural pattern that makes autoscaling safe is moving all state out of your application instances and into external systems:

State type	Move it to
User sessions	Redis / Elasticache
Job progress / queues	Redis, SQS, RabbitMQ
Application data	Postgres, DynamoDB, MySQL
File uploads / assets	S3 / object storage
In-memory caches	Redis / Memcached

Once instances are stateless, they can start and stop freely without breaking user sessions or losing data. This is the foundation of cloud-native architecture — and it's what allows Scalar to safely scale your Heroku dynos, Render services, and Fly.io machines without any risk of data loss.

Anything stateful is harder to autoscale. The solution is to move state out of your instances entirely.

— Developer on r/devops

5

How fast does autoscaling actually work?

Performance

Developers often expect autoscaling to react instantly. In reality, there is always lag — and understanding where that lag comes from is key to designing a system that scales before users notice problems.

Where the delay comes from

0–10s

Metric collection window

The autoscaler polls metrics. Most systems sample every 10–60 seconds. Scalar polls every 10 seconds.

10–30s

Decision & API call

Autoscaler evaluates the metric against the policy and calls the hosting API to add capacity.

30–120s

Instance / container launch

EC2: 30–120s. Heroku dyno: ~30s. Fly.io machine: ~5–15s. Container scheduling: seconds.

+warm-up

Application warm-up

Your app starts, connects to the database, loads caches. Typically 5–30s depending on the stack.

How teams compensate for lag

Buffer capacity (run N+1 instances) Predictive / scheduled scaling Pre-warming before known traffic spikes Minimum instance floor

Scheduled scaling is the most common practical solution: if you know traffic spikes every Monday at 9am, scale up at 8:45am. No reaction lag at all. Scalar supports schedule-based scaling exactly for this reason.

For reactive scaling, faster polling = faster response. Scalar's algorithm runs every 10 seconds — fast enough to add capacity before most users notice a slowdown. Autoscaling is reactive by default unless you configure it to be predictive.

Autoscaling is reactive unless configured otherwise. For predictable spikes, use a schedule.

— Developer on r/sysadmin