Featherless AI Logo

Featherless AI

Machine Learning Engineer — Training Optimization

Posted 12 Hours Ago
Be an Early Applicant
In-Office or Remote
Hiring Remotely in World Golf Village, FL
Mid level
In-Office or Remote
Hiring Remotely in World Golf Village, FL
Mid level
The ML Engineer will optimize large-scale model training pipelines, improve distributed training strategies, build robust infrastructure, and collaborate on training techniques and performance metrics.
The summary above was generated by AI
About the Role

We’re looking for an ML Engineer focused on training optimization to help us scale and improve large-scale model training. You’ll work at the intersection of research and production, optimizing training pipelines for speed, stability, and cost—while collaborating closely with researchers pushing model architecture and capability forward.

This is a high-impact role with real ownership: your work directly affects how fast we can iterate, how large we can scale, and how efficiently we deploy new models.

What You’ll Do
  • Optimize large-scale model training pipelines (throughput, convergence, stability, and cost)

  • Improve distributed training strategies (data, model, and pipeline parallelism)

  • Tune optimizers, schedulers, batch sizing, and precision (bf16 / fp16 / fp8)

  • Reduce training time and compute cost via profiling, bottleneck analysis, and systems-level improvements

  • Collaborate with researchers on architecture-aware training strategies

  • Build and maintain robust training infrastructure (checkpointing, fault tolerance, reproducibility)

  • Evaluate and integrate new training techniques (e.g. gradient checkpointing, ZeRO, FSDP, custom kernels)

  • Own training performance metrics and continuously push them forward

What We’re Looking For
  • Strong experience training large neural networks (LLMs or similarly large models)

  • Hands-on experience with training optimization (not just model usage)

  • Solid understanding of:

    • Backpropagation, optimization algorithms, and training dynamics

    • Distributed systems for ML training

  • Experience with PyTorch (required)

  • Comfort working close to hardware (GPUs, memory, networking constraints)

  • Ability to move fluidly between research ideas and production-ready code

Nice to Have
  • Experience with large-scale distributed training (multi-node, multi-GPU)

  • Familiarity with DeepSpeed, FSDP, Megatron, or custom training stacks

  • Experience optimizing training on AMD or NVIDIA GPUs

  • Contributions to open-source ML infrastructure or research codebases

  • Exposure to non-Transformer architectures (RNNs, hybrid models, etc.)

Why Join Us
  • Real ownership at Series-A stage — your work shapes the company’s trajectory

  • Work on cutting-edge models and training systems at scale

  • Small, highly technical team with fast feedback loops

  • Strong emphasis on engineering quality and research rigor

  • Competitive compensation + meaningful equity

Top Skills

PyTorch

Similar Jobs

An Hour Ago
Remote or Hybrid
USA
Senior level
Senior level
Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
As a Full Stack Software Engineer, you will develop tools for cloud management, build APIs and UIs, and maintain cloud systems.
Top Skills: AWSCi/CdCloudFormationGoGrpcJavaScriptKubernetesLinuxReactRestful ApisTerraformTypescript
An Hour Ago
Remote or Hybrid
USA
Senior level
Senior level
Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Lead AI-driven insights and predictive analytics for enterprise business applications through data manipulation and visualization, enhancing revenue forecasting and automation.
Top Skills: AWSAzureGCPPythonRSnowflakeSQLTableau
An Hour Ago
Remote or Hybrid
USA
Senior level
Senior level
Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
The role involves managing Privileged Access Management systems, ensuring secure credential management, and integrating PAM solutions with existing infrastructures while supporting audit and compliance initiatives.
Top Skills: 1PasswordActive DirectoryAWSAzureAzure AdBashConfluenceDelinea Secret ServerDevOpsGCPJIRALogscaleNeo4JPowershellPythonRest ApisSIEM

What you need to know about the Toronto Tech Scene

Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account