Role: SRE/Triage Engineer
Locations: Plano, TX and Bothell, WA (need local in person Interview)
Duration: 6 months plus
Mode of Interview: In- Person
Visa: H1-B, Citizen, GC, GC-EAD, US Citizen
Job Description:
- Monitor production commerce applications to proactively identify issues and ensure high availability.
- Perform first-level triage and validation of production incidents, assessing impact and urgency.
- Analyze and interpret application and infrastructure logs (ELK, Dynatrace, Kubernetes) to isolate and diagnose problems.
- Collaborate closely with development and platform teams to escalate and resolve issues efficiently.
- Maintain observability dashboards and alerts; fine-tune thresholds for optimal signal-to-noise ratio.
- Contribute to root cause analysis (RCA) and post-incident reviews to improve system resiliency.
- Document triage runbooks, known issues, and SOPs for faster recovery cycles.
- Support performance tuning, service availability metrics, and reliability improvement initiatives.
Required Skills and Experience:
- Experience in system reliability, production support, or application monitoring for large-scale enterprise systems.
- Familiarity with microservices and API-driven ecosystems.
- Strong proficiency with ELK Stack, Dynatrace, Kubernetes observability tools.
- Working knowledge of Java-based application architectures and Cassandra database operations.
- Experience with Azure monitoring tools and Kafka monitoring for distributed systems.
- MuleSoft monitoring experience is a valuable optional skill.
- Familiarity with CI/CD pipelines, automated alerting, and reliability testing frameworks.
- Demonstrated experience with production triaging, log analysis, and root cause identification.
- Excellent communication skills and ability to collaborate across teams.