Job Role : AI Ops Engineer
Location : Fremond, CA Local profiles only || Onsite
Experience Requirements:
• 5+ years in IT operations or L1 support roles.
• Exposure to AIOps environments or automated monitoring solutions is a plus.
Qualifications:
• Bachelor’s or master’s degree in computer science, Engineering, or a related field.
Key Skills:
Splunk, PowerShell, or Python, Logs Monitoring, Confluence and SharePoint
Skill Requirements:
• Hands-on experience with IT monitoring tools (e.g., Nagios, Zabbix, Prometheus, Splunk, or similar).
• Understanding of scripting (PowerShell, Python, or Shell) for basic automation tasks.
• Understanding of AIOps concepts and automation frameworks.
• Proficiency in Confluence and SharePoint for status updates and documentation.
• Ability to interpret logs and detect anomalies proactively.
• Familiarity with ITIL processes for incident, problem, and change management.
• Experience using ticketing systems (e.g., ServiceNow, Jira, Remedy).
• Skilled in creating and updating runbooks and SOPs.
• Ability to follow documented procedures accurately.
• Strong attention to detail for maintaining health check reports and incident updates.
• Analytical thinking for quick problem identification and escalation.
• Excellent communication and documentation skills.
• Proactive mindset with a passion for reliability and automation.
• Strong problem-solving and debugging skills.
Preferred:
• ITIL Foundation Certification.
• Experience with anomaly detection, time-series forecasting, and log analysis.
• Basic certifications in monitoring tools or cloud platforms (AWS, Azure).
Key Responsibilities:
• Proactive Monitoring of alerts and detect anomalies from logs.
• Perform daily health checks until full automation and application monitoring are implemented.
• Follow status checks as per existing runbooks.
• Create and update runbooks as needed to reflect current processes.
• Update system health status every 2 hours during the shift in Confluence or SharePoint.
• Acknowledge incidents promptly and route them to the correct team.
• Update incident status every 4 hours for P1/P2 tickets.
• Communicate with users and provide timely updates on their requests.
• Ensure timely acknowledgment, follow-up, and closure of incidents within SLA.
• Complete service tasks on time as per SLA to release queues quickly.
• Work strictly as per SOPs documented by the team.
• Familiarity with incident management processes and ITIL principles.
• Ability to follow documented procedures and create/update runbooks.
• Strong communication and coordination skills.
• Understanding of Confluence, SharePoint, and ticketing systems.
• Implement best practices in ML operations and productionization.
• Ensure compliance with enterprise data security, governance, and regulatory requirements.
• Collaborate with data engineers, analysts, DevOps/SRE teams and business teams to ensure reliability and security