Apply Now
Location: Charlotte, North Carolina (NC)
Contract Type: C2C
Posted: 2 hours ago
Closed Date: 06/08/2026
Skills: Python, frameworks, GenAI Tools: Arize AI, Claude (CoWork)
Visa Type: H1B, H4 EAD, Other

Role: Senior AI Platform / LLM Infrastructure Engineer - C2C - Charlotte, NC (Hybrid)

Job Description:

Must-Have Skills

•               LLM Inference Frameworks: vLLM, TensorRT-LLM, Triton Inference Server, SGLang

•               Model Optimization: Continuous Batching, Speculative Decoding, KV Cache / Prefix Caching, FP8 / AWQ / GPTQ

•               Distributed/Parallel Systems: Tensor Parallelism

•               Platform & Orchestration: Kubernetes, KServe, OpenShift AI, Helm / Operators

•               GPU & Performance: CUDA, NCCL, MIG, GPU Orchestration (Run:AI)

•               Monitoring: Prometheus, Grafana, ML Observability

•               Programming: Python

•               GenAI Tools: Arize AI, Claude (CoWork)

•               Load / performance testing: GuideLLM, Locust

 Key Responsibilities

•               Build and manage LLM inference platforms on on-prem GPU infrastructure

•               Optimize model performance using advanced inference techniques (batching, caching, quantization)

•               Deploy and operate ML workloads on Kubernetes (KServe/OpenShift AI)

•               Enable GPU scheduling and orchestration for large-scale workloads

•               Implement monitoring and performance benchmarking frameworks

•               Drive SRE practices for platform reliability and scalability (observability, incident handling)

•               Collaborate with AI/ML teams to enable production-grade GenAI deployments