What kind of teams do you work best with?

Teams that take reliability seriously as a product discipline — not teams that treat it as a firefighting rotation. I thrive in environments where on-call is structured, post-mortems are blameless, and there's appetite to reduce toil systematically.

You have just over a year of experience. Why should we consider you for an SRE role?

Fair question. In that year I moved from intern to engineer in four months, reduced incident detection time by 40%, cut unplanned downtime by 35%, and eliminated 80% of manual deployment operations through automation. I don't measure experience in years — I measure it in systems improved and outcomes delivered.

What's your approach to on-call?

On-call should be boring. If it isn't, the monitoring stack has too much noise, the runbooks are incomplete, or the automation hasn't caught up to the failure modes yet. My goal in any role is to make on-call progressively quieter.

Do you work better as an individual contributor or within a team?

Both — at different phases. I operate independently when building systems, architecting monitoring stacks, or writing automation. I work closely with development teams when reliability practices need to be embedded into deployment workflows.

What does your ideal next role look like?

A platform or SRE role where reliability is treated as an engineering discipline, not a support function. A team that measures MTTD, tracks error budgets, and is actively working to reduce toil — not just manage it.

Site Reliability Engineer

SRE & Platform Roles

Systems That
Heals
Themselves

Q: Are you open to remote roles?

I'm open to all working arrangements - remote, hybrid, or in-office. My focus is on contributing meaningfully to the team and the systems we're responsible for, wherever that work happens best.

Core Disciplines

Reliability Engineering.

Practiced with Precision.

Observability, automation, and incident engineering - applied to systems that businesses stake their reputation on. This is the work that prevents headlines, not the work that makes them.

Observability Architecture

Grafana · Zabbix · Custom Alert Pipelines

I don't build dashboards. I build intelligence layers that surface anomalies before they become incidents. Every alert is designed to be actionable. Every metric is chosen to eliminate noise, not add to it.

Incident Engineering

Runbooks · SLA Compliance · Blameless Post-Mortems

From first signal to post-mortem — I own the full incident lifecycle. Structured runbooks, escalation trees, and blameless reviews that extract the failure pattern permanently. Every incident leaves the system stronger than it found it.

Toil Elimination

Ansible · CI/CD Pipelines · Configuration-as-Code

Manual processes are reliability debt that compounds silently. I audit operational workflows for toil density, then automate them out of existence — converting engineer hours into system intelligence.

View Reliability Cases

How I Work

Reliability Engineered

by Design.

Four principles that govern how I approach every system, every incident, and every automation decision. Not methodology for its own sake - engineering with measurable outcomes.

OBSERVABILITY STACK · LIVE

metrics ingested:14,200/min

active alerts:3 of 847 rules

noise ratio:< 2%

Observe Everything. Alert on What Matters.

Most monitoring systems drown engineers in noise. I build observability stacks that distinguish signal from symptom — so when an alert fires, it demands action, not investigation.

GrafanaZabbixCustom Threshold Engineering

Manual

Automated

Manual Task

Ansible Playbook

4 hours

12 minutes

Error-prone

Version-controlled

Undocumented

Runbook + CI/CD

Automate the Toil. Engineer the Exception.

I audit workflows for toil density and replace repetition with automation, freeing the team to focus on problems machines can't solve yet.

AnsibleCI/CDInfrastructure-as-Code

DETECT4m 12s

TRIAGE3m 21s

RESOLVE17m 44s

POST-MORTEM✓ Closed

Blameless post-mortem · loop permanently closed

Every Incident Closes a Loop.

Incidents aren't failures — they're the system revealing its own blind spots. I treat every outage as a structured learning event that permanently eliminates the failure mode.

Incident RunbooksSLA ComplianceBlameless Reviews

SLO Dashboard

Active

Availability SLO

99.9%✔ Within budget

Latency P95

97.8%✔ Within budget

Error Rate

94.1%⚠ Approaching limit

Error Budget Left

68%Burn rate: normal

SLOs Are Contracts. I Honor Them.

SLOs aren't internal metrics — they're promises to the business. I define error budgets, track burn rates in real time, and make the case to engineering leadership when reliability is at risk.

SLO FrameworksError Budget TrackingReliability Reporting

Reliability Stack

Built on Tools That

Run Production.

Every tool in this stack has been used in live environments - not tutorials, not sandboxes. This is the ecosystem I operate in daily to keep infrastructure observable, automated, and resilient.

Reliability Maturity

Where Is Your Infrastructure

Right Now?

Most teams don't have a reliability problem. They have a visibility problem. Here's how I diagnose where you are - and exactly what changes when I'm involved.

Stage 01

Reactive

Your team finds out about incidents when users do.

Monitoring exists but alerts are ignored

No defined SLOs or error budgets

Runbooks are undocumented or missing

Every incident starts from zero

High toil · High stress · Low trust in systems

Stage 02 · Most Common

Fragile

Your systems work - until they don't. And no one knows why.

Deployments are manual and anxiety-inducing

Configuration drift across environments

Incidents are resolved but never closed

Toil is normalized, not measured

Unpredictable · Unscalable · Unsustainable

Stage 03

Scaling

Your infrastructure is growing faster than your reliability practices.

SLOs defined but not enforced

Error budgets exist on paper, not in decisions

No reliability roadmap aligned to business goals

Dev velocity and ops stability in constant tension

Growing fast · Breaking often · Cost of failure rising

Credentials

Validated by the Industry.

Not Just Claims.

Every certification here was earned through hands-on practice, not passive study. These represent the technical foundations I apply daily in production environments.

AZ-104: Microsoft Azure AdministratorMicrosoft

Key Skills

Identity ManagementResource GovernanceHigh Availability

AWS Solutions Architect AssociateAmazon Web Services

System DesignHigh AvailabilityDisaster Recovery

Verify Credential

AWS Certified CloudOps Engineer - AssociateAmazon Web Services

Key Skills

Monitoring & ObservabilityIncident ResponseInfrastructure Automation

Experience

The Work Behind

The Metrics.

Two roles. One company. A clear trajectory - from building the observability foundation as an intern to owning reliability outcomes across production cloud environments as an engineer.

Feb 2024 – Present

CURRENT

Parkar

Platform Operations Engineer

SREAWSAnsibleIncident EngineeringGrafanaZabbixCI/CDRunbook Design

Full ownership of monitoring architecture, incident lifecycle management, and automation strategy across production cloud environments. Not a support function — a reliability engineering role with measurable outcomes and direct impact on SLA compliance and operational efficiency.

Impact

-40%Incident DetectionMTTD improvement across all environments

-35%Unplanned DowntimePost observability overhaul

+50%Operational EfficiencyAutomated deployment and config workflows

-80%Toil EliminatedAnsible automation framework

Jan 2024 – Apr 2024

Parkar

Platform Operations Intern

GrafanaZabbixAWSLinux AdministrationTechnical Documentation

Started with zero production access. Left with Zabbix monitoring deployed across Linux server fleets, Grafana dashboards live for development teams, and an AWS Cloud Practitioner certification earned mid-internship. Built the observability foundation the team still operates on. Promoted to full-time in four months.

Sep 2020 – Jun 2024

Gujarat Technological University

B.E. Computer Engineering

Computer EngineeringAnsiblePythonLinuxInfrastructure Automation

The foundation. Final year project: an Ansible-based Linux server automation framework that cut manual deployment time by 80%. Not a student project — a production principle, prototyped early. Everything that followed was built on this thinking.

Cases

Work That Actually

Ships.

Real infrastructure problems, real solutions. Each project here has run in production and solved a problem that mattered.

Featured

CloudFordge - Free Cloud Certification Platform

Free, scenario-based practice platform for AWS, Azure, and GCP certifications. Features 411+ questions, instant explanations, and secure user progress tracking at zero cost.

View Case Study →

Featured

Silver Jems — E-Commerce Infrastructure on AWS

Designed and owns the production AWS infrastructure for a family jewellery business — containerised multi-service architecture, automated CI/CD pipeline, and full-stack observability with dynamic alerting across every component.

AWS ECSAWS ECRAWS EC2AWS Lambda+10

View Case Study →

Real infrastructure problems, real production solutions

View all cases

Quick Answers

Things You're Probably

Wondering.

The questions hiring managers and engineering leads ask most. Answered directly, without the interview performance.