Featured Project

Silver Jems — E-Commerce Infrastructure on AWS

Designed and owns the production AWS infrastructure for a family jewellery business — containerised multi-service architecture, automated CI/CD pipeline, and full-stack observability with dynamic alerting across every component.

AWS ECSAWS ECRAWS EC2AWS LambdaCloudFrontRoute 53VPCCloudflare R2DockerGitHub ActionsGrafanaZabbixPostgreSQLRedis

The Problem

A family jewellery business needed an online store that could handle real traffic, real orders, and real consequences if it went down. The ask wasn't just to deploy an application — it was to own the infrastructure so the business never had to think about it. That meant designing for isolation, automating every release, and building observability before the first customer ever landed on the site.

Infrastructure Design

Service Isolation by Default

Every component of the application runs as an independent container on AWS ECS — storefront, backend API, database, and email service are fully decoupled. This was a deliberate reliability decision: a database issue cannot take down the storefront, and an API deployment cannot affect the email service. Blast radius is contained at the container boundary.

Network Architecture

All services are deployed inside a private VPC with controlled ingress. Public traffic enters only through CloudFront, which handles SSL termination, edge caching, and DDoS mitigation. Route 53 manages DNS routing with health checks — if an origin becomes unhealthy, traffic fails over automatically. No service is directly exposed to the public internet.

Storage Decoupled from Compute

Product images and static assets are stored in AWS S3, proxied through CloudFront. Storage scales independently of the application layer. No infrastructure changes needed as asset volume grows. Sub-100ms delivery at the edge via CloudFront's global CDN. Replacing the origin bucket never requires a frontend change.

CI/CD Pipeline — Zero Manual Deploys

The Deployment Problem

Manual deployments introduce human error into every release. For a business where a failed deployment means lost orders, that risk is unacceptable. The goal was to make deployment a process owned by the pipeline, not a person.

Pipeline Design

Developer pushes to dev branch
Promotion to test branch triggers deployment to a dedicated test server
Test server validates the build in an environment identical to production
Merge to main triggers GitHub Actions automatically
Actions builds the Docker image and pushes it to AWS ECR
ECS pulls the new image and rolls out the update — zero SSH, zero manual steps

Outcome

Every production release is traceable, repeatable, and reversible. The deployment history lives in Git. Rollback is a revert and a push.

Security Posture

Admin panel protected by Multi-Factor Authentication — privileged access requires a second factor, always
End users authenticate via Google OAuth or standard login — managed at the API layer, not the UI
Every API route enforces server-side session validation — the API rejects unauthorised requests regardless of origin
All services run inside a private VPC — no direct public internet exposure
CloudFront acts as the sole public entry point — origin is shielded behind the CDN layer

Observability Stack

PhilosophyObservability was instrumented before the platform went live. The first incident was caught by an alert — not a customer complaint. Every monitored component has a meaningful signal tied to it, not a generic CPU threshold.

What Is Monitored

ECS container state — all services watched for restarts, crashes, and unhealthy states
API response time and error rate — latency tracked via Grafana
PostgreSQL slow queries, active connections, and replication lag — Zabbix agent on DB host
Redis memory usage and cache hit ratio — early warning on eviction pressure
Email service delivery events — failed sends trigger alerts before users notice
SSL certificate expiry — automated watch, alert fires 30 days before expiry
CloudFront cache hit ratio — drop in ratio signals origin pressure or misconfiguration
Host-level metrics on EC2 — CPU, memory, disk I/O, network via Zabbix

Alerting Design

All alerts are dynamic — thresholds are calculated relative to baseline behaviour, not fixed numbers. A spike in API error rate that lasts 30 seconds is noise. One that sustains for 3 minutes is an incident. The alerting layer knows the difference.

SRE Principles Applied

Toil Elimination

Manual deployments, manual image builds, and manual server configuration were all eliminated. GitHub Actions owns the build. ECS owns the rollout. The engineer owns the pipeline design — not the execution.

Blast Radius Containment

Service isolation at the container level means failures stay local. An email service crash does not affect order processing. A cache eviction event does not bring down the API. Each component fails independently and recovers independently.

Observability Before Incidents

Monitoring was not added after something broke. Every component was instrumented before go-live. The system was observable before it was public.

Infrastructure as a Reliability Contract

The business does not think about infrastructure. That is the goal. Uptime, deployments, and alerts are owned by the SRE layer — not delegated back to the product team.

Let's Connect

Navigate

Connect

Ahmedabad, India · Open to Remote & Global Opportunities

On this page