Role: Senior AI Platform / LLM Infrastructure Engineer - C2C - Charlotte, NC (Hybrid)
Job Description:
Must-Have Skills
• LLM Inference Frameworks: vLLM, TensorRT-LLM, Triton Inference Server, SGLang
• Model Optimization: Continuous Batching, Speculative Decoding, KV Cache / Prefix Caching, FP8 / AWQ / GPTQ
• Distributed/Parallel Systems: Tensor Parallelism
• Platform & Orchestration: Kubernetes, KServe, OpenShift AI, Helm / Operators
• GPU & Performance: CUDA, NCCL, MIG, GPU Orchestration (Run:AI)
• Monitoring: Prometheus, Grafana, ML Observability
• Programming: Python
• GenAI Tools: Arize AI, Claude (CoWork)
• Load / performance testing: GuideLLM, Locust
Key Responsibilities
• Build and manage LLM inference platforms on on-prem GPU infrastructure
• Optimize model performance using advanced inference techniques (batching, caching, quantization)
• Deploy and operate ML workloads on Kubernetes (KServe/OpenShift AI)
• Enable GPU scheduling and orchestration for large-scale workloads
• Implement monitoring and performance benchmarking frameworks
• Drive SRE practices for platform reliability and scalability (observability, incident handling)
• Collaborate with AI/ML teams to enable production-grade GenAI deployments