0% found this document useful (0 votes)
67 views3 pages

AI Computing Infrastructure Engineer GPU and High Performance Computing

The document outlines a job description for an AI Infrastructure Engineer focused on GPU and high-performance computing. Key responsibilities include designing and optimizing GPU-accelerated environments, managing containerized systems, and supporting AI model training and inference. Required qualifications include a degree in a related field, experience in AI/ML infrastructure, and familiarity with modern GPU architectures and AI frameworks.

Uploaded by

Amer Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views3 pages

AI Computing Infrastructure Engineer GPU and High Performance Computing

The document outlines a job description for an AI Infrastructure Engineer focused on GPU and high-performance computing. Key responsibilities include designing and optimizing GPU-accelerated environments, managing containerized systems, and supporting AI model training and inference. Required qualifications include a degree in a related field, experience in AI/ML infrastructure, and familiarity with modern GPU architectures and AI frameworks.

Uploaded by

Amer Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

AI Computing Infrastructure Engineer – GPU & High-Performance

Computing

Role Overview:

We are looking for a highly capable AI Infrastructure Engineer to design,


implement, and optimize GPU-accelerated compute environments that power
advanced AI and machine learning workloads. This role is critical in building
and supporting scalable, high-performance infrastructure across data centers
and hybrid cloud platforms, enabling training, fine-tuning, and inference of
modern AI models.

Key Responsibilities:

• AI Infrastructure Design & Deployment with multi-GPU clusters using


NVIDIA or AMD platforms.

• Configure GPU environments using CUDA, DGX Systems, and NVIDIA


Kubernetes Device Plugin.

• Deploy and manage containerized environments with Docker,


Kubernetes, and Slurm.

• AI Model Support & Optimization for training, fine-tuning, and inference


pipelines for LLMs and deep learning models.

• Enable distributed training using DDP, FSDP, and ZeRO, with support
for mixed precision.

• Tune infrastructure to optimize model performance, throughput, and


GPU utilization.

• Design and operate high-bandwidth, low-latency networks using


InfiniBand and RoCE v2.

• Integrate GPUDirect Storage and optimize data flow across Lustre,


BeeGFS, and Ceph/S3.

• Support fast data ingestion, ETL pipelines, and large-scale data


staging.

• Leverage NVIDIA’s AI stack including cuDNN, NCCL, TensorRT, and


Triton Inference Server.

Cisco Confidential
• Conduct performance benchmarking with MLPerf and custom test
suites.

Required Skills & Qualifications

• Bachelor’s or Master’s degree in Computer Science, Engineering, or


related field.

• 3–6 years of experience in AI/ML infrastructure engineering or high-


performance computing (HPC).

• Solid experience with GPU-based systems, container orchestration, and


AI/ML frameworks.

• Familiarity with distributed systems, performance tuning, and large-


scale deployments.

• Expertise in modern GPU architectures (e.g., NVIDIA A100/H100, AMD


MI300), multi-GPU configurations (NVLink, PCIe, HBM), and accelerator
scheduling for AI training and inference workloads.

• Good understanding of modern AI model architectures, including LLMs


(e.g., GPT, LLaMA), diffusion models, and multimodal encoder-decoder
frameworks, with awareness of their compute and scaling
requirements.

• Knowledge of leading AI/ML frameworks (e.g., TensorFlow, PyTorch),


NVIDIA’s AI stack (CUDA, cuDNN, TensorRT), and open-source tools like
Hugging Face, ONNX, and MLPerf for model development and
benchmarking.

• Familiarity with AI pipelines for supervised/unsupervised training, fine-


tuning (PEFT/LoRA/QLoRA), and batch or real-time inference, with
expertise in distributed training, checkpointing, gradient strategies,
and mixed precision optimization.

Preferred Certifications:

 NVIDIA Certified Professional – Data Center AI

 Kubernetes Administrator (CKA)

Cisco Confidential
 CCNP or CCIE Data Center

 Cloud Certification (AWS, Azure, or GCP)

Cisco Confidential

You might also like