Site Reliability Engineer
SRE & Platform Roles

Systems That
Heals
Themselves

SRE Dashboard
Pipeline Run Successful
To-do list
Core Disciplines

Reliability Engineering.

Practiced with Precision.

Observability, automation, and incident engineering - applied to systems that businesses stake their reputation on. This is the work that prevents headlines, not the work that makes them.

Observability Architecture

Grafana · Zabbix · Custom Alert Pipelines

I don't build dashboards. I build intelligence layers that surface anomalies before they become incidents. Every alert is designed to be actionable. Every metric is chosen to eliminate noise, not add to it.

Incident Engineering

Runbooks · SLA Compliance · Blameless Post-Mortems

From first signal to post-mortem — I own the full incident lifecycle. Structured runbooks, escalation trees, and blameless reviews that extract the failure pattern permanently. Every incident leaves the system stronger than it found it.

Toil Elimination

Ansible · CI/CD Pipelines · Configuration-as-Code

Manual processes are reliability debt that compounds silently. I audit operational workflows for toil density, then automate them out of existence — converting engineer hours into system intelligence.

View Reliability Cases
How I Work

Reliability Engineered

by Design.

Four principles that govern how I approach every system, every incident, and every automation decision. Not methodology for its own sake - engineering with measurable outcomes.

OBSERVABILITY STACK · LIVE
metrics ingested:14,200/min
active alerts:3 of 847 rules
noise ratio:< 2%

Observe Everything. Alert on What Matters.

Most monitoring systems drown engineers in noise. I build observability stacks that distinguish signal from symptom — so when an alert fires, it demands action, not investigation.

GrafanaZabbixCustom Threshold Engineering
Manual
Automated
Manual Task
Ansible Playbook
4 hours
12 minutes
Error-prone
Version-controlled
Undocumented
Runbook + CI/CD

Automate the Toil. Engineer the Exception.

I audit workflows for toil density and replace repetition with automation, freeing the team to focus on problems machines can't solve yet.

AnsibleCI/CDInfrastructure-as-Code
DETECT4m 12s
TRIAGE3m 21s
RESOLVE17m 44s
POST-MORTEM✓ Closed
Blameless post-mortem · loop permanently closed

Every Incident Closes a Loop.

Incidents aren't failures — they're the system revealing its own blind spots. I treat every outage as a structured learning event that permanently eliminates the failure mode.

Incident RunbooksSLA ComplianceBlameless Reviews
SLO Dashboard
Active
Availability SLO
99.9%✔ Within budget
Latency P95
97.8%✔ Within budget
Error Rate
94.1%⚠ Approaching limit
Error Budget Left
68%Burn rate: normal

SLOs Are Contracts. I Honor Them.

SLOs aren't internal metrics — they're promises to the business. I define error budgets, track burn rates in real time, and make the case to engineering leadership when reliability is at risk.

SLO FrameworksError Budget TrackingReliability Reporting
Reliability Stack

Built on Tools That

Run Production.

Every tool in this stack has been used in live environments - not tutorials, not sandboxes. This is the ecosystem I operate in daily to keep infrastructure observable, automated, and resilient.

AWS
Ansible
Prometheus
GitHub
Zabbix
Terraform
Grafana
Reliability Maturity

Where Is Your Infrastructure

Right Now?

Most teams don't have a reliability problem. They have a visibility problem. Here's how I diagnose where you are - and exactly what changes when I'm involved.

Stage 01

Reactive

Your team finds out about incidents when users do.

Monitoring exists but alerts are ignored
No defined SLOs or error budgets
Runbooks are undocumented or missing
Every incident starts from zero
High toil · High stress · Low trust in systems
Stage 02 · Most Common

Fragile

Your systems work - until they don't. And no one knows why.

Deployments are manual and anxiety-inducing
Configuration drift across environments
Incidents are resolved but never closed
Toil is normalized, not measured
Unpredictable · Unscalable · Unsustainable
Stage 03

Scaling

Your infrastructure is growing faster than your reliability practices.

SLOs defined but not enforced
Error budgets exist on paper, not in decisions
No reliability roadmap aligned to business goals
Dev velocity and ops stability in constant tension
Growing fast · Breaking often · Cost of failure rising
Credentials

Validated by the Industry.

Not Just Claims.

Every certification here was earned through hands-on practice, not passive study. These represent the technical foundations I apply daily in production environments.

AZ-104: Microsoft Azure Administrator
AZ-104: Microsoft Azure AdministratorMicrosoft
Key Skills
Identity ManagementResource GovernanceHigh Availability
AWS Solutions Architect Associate
AWS Solutions Architect AssociateAmazon Web Services
System DesignHigh AvailabilityDisaster Recovery
Verify Credential
AWS Certified CloudOps Engineer - Associate
AWS Certified CloudOps Engineer - AssociateAmazon Web Services
Key Skills
Monitoring & ObservabilityIncident ResponseInfrastructure Automation
Experience

The Work Behind

The Metrics.

Two roles. One company. A clear trajectory - from building the observability foundation as an intern to owning reliability outcomes across production cloud environments as an engineer.

Feb 2024 – Present
CURRENT

Parkar

Platform Operations Engineer

SREAWSAnsibleIncident EngineeringGrafanaZabbixCI/CDRunbook Design

Full ownership of monitoring architecture, incident lifecycle management, and automation strategy across production cloud environments. Not a support function — a reliability engineering role with measurable outcomes and direct impact on SLA compliance and operational efficiency.

Impact

-40%Incident DetectionMTTD improvement across all environments
-35%Unplanned DowntimePost observability overhaul
+50%Operational EfficiencyAutomated deployment and config workflows
-80%Toil EliminatedAnsible automation framework
Jan 2024 – Apr 2024

Parkar

Platform Operations Intern

GrafanaZabbixAWSLinux AdministrationTechnical Documentation

Started with zero production access. Left with Zabbix monitoring deployed across Linux server fleets, Grafana dashboards live for development teams, and an AWS Cloud Practitioner certification earned mid-internship. Built the observability foundation the team still operates on. Promoted to full-time in four months.

Sep 2020 – Jun 2024

Gujarat Technological University

B.E. Computer Engineering

Computer EngineeringAnsiblePythonLinuxInfrastructure Automation

The foundation. Final year project: an Ansible-based Linux server automation framework that cut manual deployment time by 80%. Not a student project — a production principle, prototyped early. Everything that followed was built on this thinking.

Cases

Work That Actually

Ships.

Real infrastructure problems, real solutions. Each project here has run in production and solved a problem that mattered.

CloudFordge - Free Cloud Certification Platform

CloudFordge - Free Cloud Certification Platform

Free, scenario-based practice platform for AWS, Azure, and GCP certifications. Features 411+ questions, instant explanations, and secure user progress tracking at zero cost.

Real infrastructure problems, real production solutions

View all cases
?
Quick Answers

Things You're Probably

Wondering.

The questions hiring managers and engineering leads ask most. Answered directly, without the interview performance.

I'm open to all working arrangements - remote, hybrid, or in-office. My focus is on contributing meaningfully to the team and the systems we're responsible for, wherever that work happens best.

Blog

Insights and

updates

Thoughts on reliability engineering, infrastructure automation, and building systems that last.

Why n8n is the Game-Changer Your Workflow Automation Needs in 2025
Kartik Patel19 May 2025
OSSWhy n8n is the Game-Changer Your Workflow Automation Needs in 2025

Discover how n8n’s open-source, AI-powered, and self-hosted workflows are transforming business automation in 2025 giving you full control, flexibility, and cost-efficiency.

Read more

Learn more about SRE practices by reading my blog

View all blogs
Kartik Patel
Let's Connect

Ahmedabad, India · Open to Remote & Global Opportunities
© 2026 Kartik Patel · Built with intention, not just code.