AI Computing Infrastructure Engineer – GPU & High-Performance
Computing
Role Overview:
We are looking for a highly capable AI Infrastructure Engineer to design,
implement, and optimize GPU-accelerated compute environments that power
advanced AI and machine learning workloads. This role is critical in building
and supporting scalable, high-performance infrastructure across data centers
and hybrid cloud platforms, enabling training, fine-tuning, and inference of
modern AI models.
Key Responsibilities:
• AI Infrastructure Design & Deployment with multi-GPU clusters using
NVIDIA or AMD platforms.
• Configure GPU environments using CUDA, DGX Systems, and NVIDIA
Kubernetes Device Plugin.
• Deploy and manage containerized environments with Docker,
Kubernetes, and Slurm.
• AI Model Support & Optimization for training, fine-tuning, and inference
pipelines for LLMs and deep learning models.
• Enable distributed training using DDP, FSDP, and ZeRO, with support
for mixed precision.
• Tune infrastructure to optimize model performance, throughput, and
GPU utilization.
• Design and operate high-bandwidth, low-latency networks using
InfiniBand and RoCE v2.
• Integrate GPUDirect Storage and optimize data flow across Lustre,
BeeGFS, and Ceph/S3.
• Support fast data ingestion, ETL pipelines, and large-scale data
staging.
• Leverage NVIDIA’s AI stack including cuDNN, NCCL, TensorRT, and
Triton Inference Server.
Cisco Confidential
• Conduct performance benchmarking with MLPerf and custom test
suites.
Required Skills & Qualifications
• Bachelor’s or Master’s degree in Computer Science, Engineering, or
related field.
• 3–6 years of experience in AI/ML infrastructure engineering or high-
performance computing (HPC).
• Solid experience with GPU-based systems, container orchestration, and
AI/ML frameworks.
• Familiarity with distributed systems, performance tuning, and large-
scale deployments.
• Expertise in modern GPU architectures (e.g., NVIDIA A100/H100, AMD
MI300), multi-GPU configurations (NVLink, PCIe, HBM), and accelerator
scheduling for AI training and inference workloads.
• Good understanding of modern AI model architectures, including LLMs
(e.g., GPT, LLaMA), diffusion models, and multimodal encoder-decoder
frameworks, with awareness of their compute and scaling
requirements.
• Knowledge of leading AI/ML frameworks (e.g., TensorFlow, PyTorch),
NVIDIA’s AI stack (CUDA, cuDNN, TensorRT), and open-source tools like
Hugging Face, ONNX, and MLPerf for model development and
benchmarking.
• Familiarity with AI pipelines for supervised/unsupervised training, fine-
tuning (PEFT/LoRA/QLoRA), and batch or real-time inference, with
expertise in distributed training, checkpointing, gradient strategies,
and mixed precision optimization.
Preferred Certifications:
NVIDIA Certified Professional – Data Center AI
Kubernetes Administrator (CKA)
Cisco Confidential
CCNP or CCIE Data Center
Cloud Certification (AWS, Azure, or GCP)
Cisco Confidential