Holding 99% Uptime With 100k Users on a Single VPS

A lot of "scaling" advice optimizes for problems most teams will never have. Yuva Portal taught me the opposite — keep it boring, watch the right things, and a single VPS will cheerfully serve six figures of users.

The shape of the load

Bursty. Most of the day was idle. Logins came in waves around scholarship deadlines. A naive read of CPU graphs would say the box was lightly used; a 5-second p99 would tell you the truth.

We tracked p99 latency per route from day one. If you only watch averages, the long tail will eat you alive without warning.

Indexes solve more problems than caches do

The 30% drop in data-processing time mentioned in the resume came almost entirely from one diagnostic: turning on MongoDB's slow query log and indexing the three queries that showed up most often. No Redis, no read replicas — just compound indexes that matched the actual access patterns.

A useful rule: before you reach for a cache, look at the query plan. Caches add a coherence problem. A correct index doesn't.

PM2 is fine, actually

The reflexive answer to "Node.js in production" is now Kubernetes. For a single VPS with two services, that is a 50-fold complexity increase for zero new capability. PM2 with a cluster mode of max and a pm2 startup systemd integration ran for 24 months without an incident that wasn't self-inflicted.

What I'd do differently

Two things:

Structured logs from day zero. We had freeform console.log for the first year. When a real bug landed, grepping logs across PM2 instances was painful.
A single dashboard, not three. I had one for Mongo, one for the host, one for app metrics. Inevitably you only check the one that's currently green.

Boring infra is a feature. The work I'm proudest of is the work that didn't page anyone.