At STN, we don't just adapt to the digital future, we engineer it. Our mission is to help organizations thrive in a rapidly evolving technology landscape through strategic insight, cutting-edge solutions, and a security-first mindset. We provide end-to-end services spanning cloud consulting, AI infrastructure, and enterprise security, enabling secure, scalable, and future-ready transformation.

As trusted advisors, we align IT investments with business outcomes that drive performance and growth, starting with deep strategic engagement and delivering tailored solutions built for long-term impact.

Our approach is innovation-led and rooted in cybersecurity, with a focus on leveraging the right technologies to solve real-world challenges. We invest in our people and foster a culture of growth, inclusion, and purpose because we believe empowered teams build transformative technology.

Overview
The AI Infrastructure Engineer will be responsible for designing, deploying, and maintaining robust infrastructure systems tailored for AI and machine learning operations. This role focuses on ensuring seamless performance, scalability, and reliability in distributed computing environments. You'll collaborate with data scientists, ML engineers, and DevOps teams to support large-scale AI training and inference pipelines.

Key Responsibilities

  • Design and implement AI infrastructure solutions, including cluster management, resource allocation, and workload orchestration for high-performance computing (HPC) environments.
  • Deploy, configure, and troubleshoot containerized applications using Kubernetes across various flavors (e.g., vanilla Kubernetes, Amazon EKS, Google GKE, Azure AKS, and on-premises setups).
  • Manage job scheduling and resource management using Slurm for efficient utilization of GPU clusters in AI training workflows.
  • Optimize Ubuntu-based systems for AI workloads, including kernel tuning, security hardening, and performance monitoring.
  • Integrate and maintain NVIDIA GPU technologies, ensuring compatibility with AI frameworks like TensorFlow, PyTorch, and CUDA.
  • Monitor system performance, identify bottlenecks, and implement automation scripts for infrastructure provisioning and scaling.
  • Collaborate on disaster recovery planning, security compliance, and cost optimization for cloud and on-premises AI infrastructure.
  • Stay updated on emerging technologies in AI infrastructure and contribute to best practices documentation.

Experience & Qualifications

Required

  • Bachelor’s degree in computer science, Engineering, or a related field (or equivalent experience).
  • Proven expertise as an Ubuntu specialist, with hands-on experience in system administration, networking, and
  • Scripting (e.g., Bash, Python) on Ubuntu servers.
  • Extensive experience with Kubernetes in all major flavors, including cluster setup, scaling, networking (e.g., CNI plugins), and security (e.g., RBAC, Pod Security Policies).
  • Strong proficiency in Slurm for managing HPC clusters, including job submission, queue configuration, and integration with GPU resources.
  • 3+ years of experience in infrastructure engineering, preferably in AI/ML or HPC environments.
  • Familiarity with cloud platforms (AWS, GCP, Azure) and container orchestration tools.
  • Excellent problem-solving skills and ability to work in a fast-paced, collaborative environment.

Preferred

  • NVIDIA certifications (e.g., NVIDIA Certified Professional in Data Center GPU Management or CUDA Programming) are a strong plus.
  • Experience with other HPC schedulers (e.g., PBS, LSF) or AI-specific tools like Kubeflow.
  • Knowledge of infrastructure-as-code tools (e.g., Terraform, Ansible) and CI/CD pipelines.
  • Background in AI model deployment, monitoring tools (e.g., Prometheus, Grafana), or edge computing.

Compensation

  • Full-Time, Exempt
  • Salary: $145K-195K, DOE

Benefits

  • Health Coverage – Medical, Dental & Vision
  • FSA Health and Dependent Care available
  • 401(k) Plan
  • Unlimited Paid Time Off (PTO)
  • Observed Holidays Paid
  • Cell Phone Allowance
  • Collaborative, growth-driven culture

Apply for this Position

If you’re interested in this role and believe your skills are a good match, we’d love to hear from you. Please complete the application form below and submit your details for consideration.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.