Silver Jems — E-Commerce Infrastructure on AWS
Designed and owns the production AWS infrastructure for a family jewellery business — containerised multi-service architecture, automated CI/CD pipeline, and full-stack observability with dynamic alerting across every component.
The Problem
A family jewellery business needed an online store that could handle real traffic, real orders, and real consequences if it went down. The ask wasn't just to deploy an application — it was to own the infrastructure so the business never had to think about it. That meant designing for isolation, automating every release, and building observability before the first customer ever landed on the site.
Infrastructure Design
Service Isolation by Default
Every component of the application runs as an independent container on AWS ECS — storefront, backend API, database, and email service are fully decoupled. This was a deliberate reliability decision: a database issue cannot take down the storefront, and an API deployment cannot affect the email service. Blast radius is contained at the container boundary.
Network Architecture
All services are deployed inside a private VPC with controlled ingress. Public traffic enters only through CloudFront, which handles SSL termination, edge caching, and DDoS mitigation. Route 53 manages DNS routing with health checks — if an origin becomes unhealthy, traffic fails over automatically. No service is directly exposed to the public internet.
Storage Decoupled from Compute
Product images and static assets are stored in AWS S3, proxied through CloudFront. Storage scales independently of the application layer. No infrastructure changes needed as asset volume grows. Sub-100ms delivery at the edge via CloudFront's global CDN. Replacing the origin bucket never requires a frontend change.
CI/CD Pipeline — Zero Manual Deploys
The Deployment Problem
Manual deployments introduce human error into every release. For a business where a failed deployment means lost orders, that risk is unacceptable. The goal was to make deployment a process owned by the pipeline, not a person.
Pipeline Design
Developer pushes to dev branch
Promotion to test branch triggers deployment to a dedicated test server
Test server validates the build in an environment identical to production
Merge to main triggers GitHub Actions automatically
Actions builds the Docker image and pushes it to AWS ECR
ECS pulls the new image and rolls out the update — zero SSH, zero manual steps
Outcome
Every production release is traceable, repeatable, and reversible. The deployment history lives in Git. Rollback is a revert and a push.
Security Posture
Admin panel protected by Multi-Factor Authentication — privileged access requires a second factor, always
End users authenticate via Google OAuth or standard login — managed at the API layer, not the UI
Every API route enforces server-side session validation — the API rejects unauthorised requests regardless of origin
All services run inside a private VPC — no direct public internet exposure
CloudFront acts as the sole public entry point — origin is shielded behind the CDN layer
Observability Stack
PhilosophyObservability was instrumented before the platform went live. The first incident was caught by an alert — not a customer complaint. Every monitored component has a meaningful signal tied to it, not a generic CPU threshold.
What Is Monitored
ECS container state — all services watched for restarts, crashes, and unhealthy states
API response time and error rate — latency tracked via Grafana
PostgreSQL slow queries, active connections, and replication lag — Zabbix agent on DB host
Redis memory usage and cache hit ratio — early warning on eviction pressure
Email service delivery events — failed sends trigger alerts before users notice
SSL certificate expiry — automated watch, alert fires 30 days before expiry
CloudFront cache hit ratio — drop in ratio signals origin pressure or misconfiguration
Host-level metrics on EC2 — CPU, memory, disk I/O, network via Zabbix
Alerting Design
All alerts are dynamic — thresholds are calculated relative to baseline behaviour, not fixed numbers. A spike in API error rate that lasts 30 seconds is noise. One that sustains for 3 minutes is an incident. The alerting layer knows the difference.
SRE Principles Applied
Toil Elimination
Manual deployments, manual image builds, and manual server configuration were all eliminated. GitHub Actions owns the build. ECS owns the rollout. The engineer owns the pipeline design — not the execution.
Blast Radius Containment
Service isolation at the container level means failures stay local. An email service crash does not affect order processing. A cache eviction event does not bring down the API. Each component fails independently and recovers independently.
Observability Before Incidents
Monitoring was not added after something broke. Every component was instrumented before go-live. The system was observable before it was public.
Infrastructure as a Reliability Contract
The business does not think about infrastructure. That is the goal. Uptime, deployments, and alerts are owned by the SRE layer — not delegated back to the product team.
On this page


