Software Engineering

ML Infra Engineer

Type

Full-time

Location

Zürich

,

Switzerland

Department

Software Engineering

Description

About Flexion

At Flexion, we're building the intelligence layer powering the next generation of humanoid robots. Our mission is to accelerate the transition from fragile prototypes to real-world humanoid deployment. We are founded by leading scientists in robot reinforcement learning (ex-Nvidia, ex-ETH Zürich), and backed by leading international VC firms. In just months, we’ve gone from our first line of code to deploying real humanoid capabilities.

The Role

As an ML Infrastructure Engineer at Flexion, you will help us build out our core compute and data platforms. We’re building the brain for humanoid robots, which involves training large-scale foundational models with vast amounts of data. You'll design training clusters, architect the pipelines that move data from simulators and robots into model training, and create the tooling that lets our AI engineers train, evaluate, and iterate fast.

You'll join Flexion's experienced Infrastructure team (ex-Google, Meta, Amazon) and take significant ownership of the systems behind our data collection, training and experimentation workflows: from strategic infrastructure decisions, cluster orchestration and distributed training optimization to data platforms, CI, and experiment tooling. This is a senior, on-site role at our Zürich office.

Key Responsibilities

  • Build and operate training infrastructure: design, deploy, and maintain GPU compute clusters for large-scale model training across multiple cloud providers, including job scheduling (Slurm, Kubernetes).

  • Architect data platforms and pipelines: build the storage, processing, and serving layers that handle the full data lifecycle: from simulator output and robot telemetry to training datasets. This includes building infrastructure with object storage (S3), parallel filesystems (Lustre), and common data formats (Parquet, WebDataset, LeRobot). Use distributed processing frameworks (Ray, Spark) to transform and validate data at scale.

  • Optimize distributed training: work with our AI engineers to scale workloads across multi-node GPU clusters, profiling and improving throughput, device utilization, and communication efficiency. This includes optimizing our distributed IsaacLab-based sim-to-real training.

  • Evaluate and adopt new platforms: compare cloud providers, GPUaaS platforms, and emerging tooling, owning the decisions on what we adopt as we grow our compute footprint.

Requirements

  • 3+ years of professional experience building and operating infrastructure for large-scale deep learning systems.

  • Hands-on experience training or supporting the training of large models (billions of parameters) in distributed multi-node GPU setups, and deep understanding of the underlying concepts (DDP, FSDP, NCCL).

  • Strong experience with at least one major cloud platform (AWS or GCP), including compute provisioning and networking.

  • Experience with job scheduling and orchestration tools: Slurm, Kubernetes, or both.

  • Experience building data pipelines and managing large-scale storage — including object stores (S3 or equivalent) and familiarity with high-performance or parallel filesystems (e.g., Lustre).

  • Proficiency in Python and working knowledge of PyTorch.

  • Ownership mindset: comfortable making architectural decisions, setting direction, and delivering independently in a fast-moving environment.

Nice to have

  • Experience with distributed data processing frameworks (Ray, Spark).

  • Familiarity with common data formats (Parquet, WebDataset, LeRobot).

  • Experience with additional GPU cloud providers (Lambda Labs, CoreWeave, RunPod, Nebius, or similar).

  • Experience managing on-premise compute infrastructure.

  • Familiarity with robotics simulation environments (IsaacLab, IsaacGym, MuJoCo).

  • Experience with infrastructure-as-code and configuration management (Terraform, Ansible).

  • Familiarity with experiment tracking platforms (Weights & Biases, MLflow).

  • Experience with GPU programming and profiling (CUDA, Nsight)

Benefits

  • Competitive compensation package

  • A front-row seat at one of Europe’s most ambitious robotics companies

  • An energetic, collaborative team with a bias for action

Software Engineering

ML Infra Engineer

Type

Full-time

Location

Zürich

,

Switzerland

Department

Software Engineering

Description

About Flexion

At Flexion, we're building the intelligence layer powering the next generation of humanoid robots. Our mission is to accelerate the transition from fragile prototypes to real-world humanoid deployment. We are founded by leading scientists in robot reinforcement learning (ex-Nvidia, ex-ETH Zürich), and backed by leading international VC firms. In just months, we’ve gone from our first line of code to deploying real humanoid capabilities.

The Role

As an ML Infrastructure Engineer at Flexion, you will help us build out our core compute and data platforms. We’re building the brain for humanoid robots, which involves training large-scale foundational models with vast amounts of data. You'll design training clusters, architect the pipelines that move data from simulators and robots into model training, and create the tooling that lets our AI engineers train, evaluate, and iterate fast.

You'll join Flexion's experienced Infrastructure team (ex-Google, Meta, Amazon) and take significant ownership of the systems behind our data collection, training and experimentation workflows: from strategic infrastructure decisions, cluster orchestration and distributed training optimization to data platforms, CI, and experiment tooling. This is a senior, on-site role at our Zürich office.

Key Responsibilities

  • Build and operate training infrastructure: design, deploy, and maintain GPU compute clusters for large-scale model training across multiple cloud providers, including job scheduling (Slurm, Kubernetes).

  • Architect data platforms and pipelines: build the storage, processing, and serving layers that handle the full data lifecycle: from simulator output and robot telemetry to training datasets. This includes building infrastructure with object storage (S3), parallel filesystems (Lustre), and common data formats (Parquet, WebDataset, LeRobot). Use distributed processing frameworks (Ray, Spark) to transform and validate data at scale.

  • Optimize distributed training: work with our AI engineers to scale workloads across multi-node GPU clusters, profiling and improving throughput, device utilization, and communication efficiency. This includes optimizing our distributed IsaacLab-based sim-to-real training.

  • Evaluate and adopt new platforms: compare cloud providers, GPUaaS platforms, and emerging tooling, owning the decisions on what we adopt as we grow our compute footprint.

Requirements

  • 3+ years of professional experience building and operating infrastructure for large-scale deep learning systems.

  • Hands-on experience training or supporting the training of large models (billions of parameters) in distributed multi-node GPU setups, and deep understanding of the underlying concepts (DDP, FSDP, NCCL).

  • Strong experience with at least one major cloud platform (AWS or GCP), including compute provisioning and networking.

  • Experience with job scheduling and orchestration tools: Slurm, Kubernetes, or both.

  • Experience building data pipelines and managing large-scale storage — including object stores (S3 or equivalent) and familiarity with high-performance or parallel filesystems (e.g., Lustre).

  • Proficiency in Python and working knowledge of PyTorch.

  • Ownership mindset: comfortable making architectural decisions, setting direction, and delivering independently in a fast-moving environment.

Nice to have

  • Experience with distributed data processing frameworks (Ray, Spark).

  • Familiarity with common data formats (Parquet, WebDataset, LeRobot).

  • Experience with additional GPU cloud providers (Lambda Labs, CoreWeave, RunPod, Nebius, or similar).

  • Experience managing on-premise compute infrastructure.

  • Familiarity with robotics simulation environments (IsaacLab, IsaacGym, MuJoCo).

  • Experience with infrastructure-as-code and configuration management (Terraform, Ansible).

  • Familiarity with experiment tracking platforms (Weights & Biases, MLflow).

  • Experience with GPU programming and profiling (CUDA, Nsight)

Benefits

  • Competitive compensation package

  • A front-row seat at one of Europe’s most ambitious robotics companies

  • An energetic, collaborative team with a bias for action

Software Engineering

ML Infra Engineer

Type

Full-time

Location

Zürich

,

Switzerland

Department

Software Engineering

Description

About Flexion

At Flexion, we're building the intelligence layer powering the next generation of humanoid robots. Our mission is to accelerate the transition from fragile prototypes to real-world humanoid deployment. We are founded by leading scientists in robot reinforcement learning (ex-Nvidia, ex-ETH Zürich), and backed by leading international VC firms. In just months, we’ve gone from our first line of code to deploying real humanoid capabilities.

The Role

As an ML Infrastructure Engineer at Flexion, you will help us build out our core compute and data platforms. We’re building the brain for humanoid robots, which involves training large-scale foundational models with vast amounts of data. You'll design training clusters, architect the pipelines that move data from simulators and robots into model training, and create the tooling that lets our AI engineers train, evaluate, and iterate fast.

You'll join Flexion's experienced Infrastructure team (ex-Google, Meta, Amazon) and take significant ownership of the systems behind our data collection, training and experimentation workflows: from strategic infrastructure decisions, cluster orchestration and distributed training optimization to data platforms, CI, and experiment tooling. This is a senior, on-site role at our Zürich office.

Key Responsibilities

  • Build and operate training infrastructure: design, deploy, and maintain GPU compute clusters for large-scale model training across multiple cloud providers, including job scheduling (Slurm, Kubernetes).

  • Architect data platforms and pipelines: build the storage, processing, and serving layers that handle the full data lifecycle: from simulator output and robot telemetry to training datasets. This includes building infrastructure with object storage (S3), parallel filesystems (Lustre), and common data formats (Parquet, WebDataset, LeRobot). Use distributed processing frameworks (Ray, Spark) to transform and validate data at scale.

  • Optimize distributed training: work with our AI engineers to scale workloads across multi-node GPU clusters, profiling and improving throughput, device utilization, and communication efficiency. This includes optimizing our distributed IsaacLab-based sim-to-real training.

  • Evaluate and adopt new platforms: compare cloud providers, GPUaaS platforms, and emerging tooling, owning the decisions on what we adopt as we grow our compute footprint.

Requirements

  • 3+ years of professional experience building and operating infrastructure for large-scale deep learning systems.

  • Hands-on experience training or supporting the training of large models (billions of parameters) in distributed multi-node GPU setups, and deep understanding of the underlying concepts (DDP, FSDP, NCCL).

  • Strong experience with at least one major cloud platform (AWS or GCP), including compute provisioning and networking.

  • Experience with job scheduling and orchestration tools: Slurm, Kubernetes, or both.

  • Experience building data pipelines and managing large-scale storage — including object stores (S3 or equivalent) and familiarity with high-performance or parallel filesystems (e.g., Lustre).

  • Proficiency in Python and working knowledge of PyTorch.

  • Ownership mindset: comfortable making architectural decisions, setting direction, and delivering independently in a fast-moving environment.

Nice to have

  • Experience with distributed data processing frameworks (Ray, Spark).

  • Familiarity with common data formats (Parquet, WebDataset, LeRobot).

  • Experience with additional GPU cloud providers (Lambda Labs, CoreWeave, RunPod, Nebius, or similar).

  • Experience managing on-premise compute infrastructure.

  • Familiarity with robotics simulation environments (IsaacLab, IsaacGym, MuJoCo).

  • Experience with infrastructure-as-code and configuration management (Terraform, Ansible).

  • Familiarity with experiment tracking platforms (Weights & Biases, MLflow).

  • Experience with GPU programming and profiling (CUDA, Nsight)

Benefits

  • Competitive compensation package

  • A front-row seat at one of Europe’s most ambitious robotics companies

  • An energetic, collaborative team with a bias for action

Software Engineering

ML Infra Engineer

Type

Full-time

Location

Zürich

,

Switzerland

Department

Software Engineering

Description

About Flexion

At Flexion, we're building the intelligence layer powering the next generation of humanoid robots. Our mission is to accelerate the transition from fragile prototypes to real-world humanoid deployment. We are founded by leading scientists in robot reinforcement learning (ex-Nvidia, ex-ETH Zürich), and backed by leading international VC firms. In just months, we’ve gone from our first line of code to deploying real humanoid capabilities.

The Role

As an ML Infrastructure Engineer at Flexion, you will help us build out our core compute and data platforms. We’re building the brain for humanoid robots, which involves training large-scale foundational models with vast amounts of data. You'll design training clusters, architect the pipelines that move data from simulators and robots into model training, and create the tooling that lets our AI engineers train, evaluate, and iterate fast.

You'll join Flexion's experienced Infrastructure team (ex-Google, Meta, Amazon) and take significant ownership of the systems behind our data collection, training and experimentation workflows: from strategic infrastructure decisions, cluster orchestration and distributed training optimization to data platforms, CI, and experiment tooling. This is a senior, on-site role at our Zürich office.

Key Responsibilities

  • Build and operate training infrastructure: design, deploy, and maintain GPU compute clusters for large-scale model training across multiple cloud providers, including job scheduling (Slurm, Kubernetes).

  • Architect data platforms and pipelines: build the storage, processing, and serving layers that handle the full data lifecycle: from simulator output and robot telemetry to training datasets. This includes building infrastructure with object storage (S3), parallel filesystems (Lustre), and common data formats (Parquet, WebDataset, LeRobot). Use distributed processing frameworks (Ray, Spark) to transform and validate data at scale.

  • Optimize distributed training: work with our AI engineers to scale workloads across multi-node GPU clusters, profiling and improving throughput, device utilization, and communication efficiency. This includes optimizing our distributed IsaacLab-based sim-to-real training.

  • Evaluate and adopt new platforms: compare cloud providers, GPUaaS platforms, and emerging tooling, owning the decisions on what we adopt as we grow our compute footprint.

Requirements

  • 3+ years of professional experience building and operating infrastructure for large-scale deep learning systems.

  • Hands-on experience training or supporting the training of large models (billions of parameters) in distributed multi-node GPU setups, and deep understanding of the underlying concepts (DDP, FSDP, NCCL).

  • Strong experience with at least one major cloud platform (AWS or GCP), including compute provisioning and networking.

  • Experience with job scheduling and orchestration tools: Slurm, Kubernetes, or both.

  • Experience building data pipelines and managing large-scale storage — including object stores (S3 or equivalent) and familiarity with high-performance or parallel filesystems (e.g., Lustre).

  • Proficiency in Python and working knowledge of PyTorch.

  • Ownership mindset: comfortable making architectural decisions, setting direction, and delivering independently in a fast-moving environment.

Nice to have

  • Experience with distributed data processing frameworks (Ray, Spark).

  • Familiarity with common data formats (Parquet, WebDataset, LeRobot).

  • Experience with additional GPU cloud providers (Lambda Labs, CoreWeave, RunPod, Nebius, or similar).

  • Experience managing on-premise compute infrastructure.

  • Familiarity with robotics simulation environments (IsaacLab, IsaacGym, MuJoCo).

  • Experience with infrastructure-as-code and configuration management (Terraform, Ansible).

  • Familiarity with experiment tracking platforms (Weights & Biases, MLflow).

  • Experience with GPU programming and profiling (CUDA, Nsight)

Benefits

  • Competitive compensation package

  • A front-row seat at one of Europe’s most ambitious robotics companies

  • An energetic, collaborative team with a bias for action

Affolternstrasse 42
8050 Zurich, Switzerland

Shape the Future

Whether you're interested in our product, partnerships, or joining our team, we'd love to hear from you

Shape the Future

Whether you're interested in our product, partnerships, or joining our team, we'd love to hear from you

Shape the Future

Whether you're interested in our product, partnerships, or joining our team, we'd love to hear from you

Shape the Future

Whether you're interested in our product, partnerships, or joining our team, we'd love to hear from you