Job Description

Role - Site Reliability Engineer (SRE /GenAI Infrastructure / Kubernetes / IaC)

Location - Montreal, QC

Production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming/scripting skills (Python, Go, Java, or equivalent)
Deep experience with containerization ( Docker ), orchestration ( Kubernetes , etc.)
Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog , etc.)
Networking & systems engineering knowledge (TCP/IP, DNS, routing, load bal-ancing, distributed storage)
Solid experience in capacity planning, performance tuning, scaling, and incident response
Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improve-ments
Experience in regulated environments (financial services, compliance, audit, se-curity) is a strong plus
Excellent communication, documentation, and cross-team collaboration skills
Proven track record of reducing operational toil via automation

Experience: 8+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineer-ing knowledge.
Roles and Responsibilities:
Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
Design and build automation for core platform capabilities, reducing manual toil
Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
Optimize cost vs. performance tradeoffs in large-scale compute environments
Harden systems for security, compliance, auditability, and data governance
Collaborate across teams (cloud engineers, data engineers, infrastructure, secu-rity) to ensure safe deployment, rollout, rollback, and integration of new systems
Define disaster recovery (DR) strategies, backup/restore practices, fault toler-ance mechanisms
Maintain runbooks, operational playbooks, documentation, and training materials
Participate in on-call rotations and respond to production incidents 24/7 as needed
Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

Job Tags

Similar Jobs

GW (Jiangsu) Power Supply Technology Co., Ltd

Dutch (Holland) Technical Service Manager Needed Job at GW (Jiangsu) Power Supply Technology Co., Ltd

Dutch (Holland) Technical Service Manager NeededAbout Us:GW (Jiangsu) Power Supply Technology Co., Ltd. is a strategic emerging enterprisefunded by world renowned electronics industry JXT Group with $20 million capital; it isfunded in part by the Chinese government. JXT...

Girl Scouts of Colorado

Summer Camp Dishwasher/Kitchen Assistant - Tomahawk Ranch Job at Girl Scouts of Colorado

...Are you looking for a full-time, part-time, or temporary summer job? Come work for Tomahawk Ranch in our kitchen! We are hiring hard working... ...kitchen staff from May - August 2023 for our summer camp season! Tomahawk serves roughly 200 campers/staff per week. This is...

Newport Pediatric Dentistry

Registered Dental Assistant (RDA) Job at Newport Pediatric Dentistry

...children and fostering lifelong healthy smiles. We are seeking a reliable, energetic, and patient-focused Registered Dental Assistant (RDA) to join our pediatric team. The ideal candidate will have a passion for working with children, a positive attitude, and the ability...

Boston Medical Center

Patient Transporter (40 Hours, Day) Job at Boston Medical Center

...our communities by delivering exceptional, personalized health care with dignity, compassion and respect. Our continued focus on the patient experience informs our caregivers on how to provide care that is respectful of and responsive to individual patient and family...

Centricity Research

Research Assistant Job at Centricity Research

...Research Assistant (RA) Join Us at Centricity Research! Centricity Research is one of the largest clinical research networks in North America. We are a fully centralized Integrated Research Organization (IRO) specializing in conducting Phase I-IV clinical trials in...

Site Reliability Engineer (SRE / GenAI Infrastructure / Kubernetes / IaC) Job at Atlantis IT group, Canada

b2E5SFMvR2F6ZEJGdmUzaTRyZ0VMY3NaTVE9PQ==