Kalibre logo

Together

Latest active Together jobs

logo

Together

Product

•

Startup

AI Tech

$533.50M

Total funding

$305M · Feb 2025

Latest funding

51-200

Employees

2022

Founded

Together AI is a cloud-based platform that empowers developers to create open-source generative AI models, enhancing AI research and infrastructure through decentralized services and innovative contributions.


Together jobs

AI Developer

DevOps Engineer

Automation Test Engineer


Visit Together

Together

Latest active Together jobs

AI Developer

DevOps Engineer

Automation Test Engineer

Backend Developer

Data Engineer

Director

Hardware Engineer

Machine Learning Developer

QA Engineer

ReactJS Developer

System Engineer

GPU Cluster Resource Scheduling and Optimization Engineer


San Francisco

5+ years in resource scheduling, distributed systems, or large-scale machine learning infrastructure.Experience in distributed computing frameworks (e.g., Kubernetes, Slurm, Ray).Experience in designing and implementing resource allocation algorithms and scheduling frameworks.Experience with cloud platforms (e.g., AWS, GCP, Azure) and GPU orchestration.Experience with AI/ML frameworks (e.g., TensorFlow, PyTorch, JAX).Experience with AI-specific workloads like DDP, sharded training, or reinforcement learning.

Customer Support Engineer


San Francisco

Hybrid

5+ years in a customer-facing technical role with at least 1 year in a support function in AIExperience in AI, ML, GPU technologies and their integration into high-performance computing (HPC) environmentsExperience with infrastructure services (e.g., Kubernetes, Slurm)Experience with infrastructure as code solutions (e.g., Ansible)Experience with high-performance network fabrics, NFS-based storage management, container infrastructure, and scripting and programming languagesExperience with operating storage systems in HPC environments such as Vast and WekaExperience with inspecting and resolving network-related errorsExperience with Python, TypeScript, and/or JavaScript with testing/debugging experience using curl and Postman-like toolsExperience in the installation, configuration, administration, troubleshooting, and securing of compute clustersExperience in complex technical problem solving and troubleshooting, with a proactive approach to issue resolution

Senior Product Engineer


San Francisco

5+ years of professional software development experienceExperience building web applications using React, Next.js, and TypeScriptExperience operating, configuring and running services at scaleExperience designing technical solutions/systemsExperience with developer tool web applicationsExperience with designing data intensive and highly responsive web applicationsExperience building web APIs with Node.js, especially with document databases such as MongoDB

QA Engineer


San Francisco

7+ years in QA engineering.Experience in release management.Experience in automated testing.Experience working with SDETs and engineering teams.

LLM Training Frameworks and Optimization Engineer


San Francisco

5+ years in deep learning frameworks, distributed systems, or machine learning infrastructure.Experience in distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA).Experience with parallelism techniques (e.g., data, tensor, pipeline, and zero-based parallelism).Experience with GPU/TPU hardware and deep learning performance optimizations.Experience with memory optimization techniques (e.g., activation checkpointing, gradient sharding).Experience with training dynamics for large-scale LLMs, including hyperparameter tuning and optimization.

Machine Learning Operations (MLOps) Engineer


San Francisco

5+ years in production level ML training or inference system.Experience in DevOps practices like CI/CD, automation, containerization (Docker), and orchestration (Kubernetes).Proficiency in cloud platforms like AWS, Google Cloud, or Azure.Expertise in programming (Python, Go, etc.) and frameworks for ML (TensorFlow, PyTorch, scikit-learn).

Senior AI Infrastructure Engineer


San Francisco

Remote

5+ years in professional software development.Experience in infrastructure-as-code.Experience in Terraform.Experience in Ansible.Experience in Kubernetes.Experience in VMs.Experience in Bare Metal Compute.Experience in Edge Deployments.Experience in AI workloads.Experience in Blockchain based protocols.Experience in GPU programming.Experience in NCCL.Experience in CUDA.Experience in PyTorch.Experience in TensorFlow.Experience in High Performance or Distributed Cloud Microservices Architectures.Experience in AWS.Experience in Azure.Experience in GCP.

LLM Training Resilience Engineer


San Francisco

5+ years in distributed systems, cloud infrastructure, or large-scale machine learning training.Experience in distributed computing frameworks (e.g., PyTorch DDP, TensorFlow, Horovod).Experience with resilience strategies in distributed systems (e.g., leader election, consensus, retry mechanisms).Experience with observability tools (e.g., Prometheus, Grafana, ELK stack).Experience with cloud platforms (e.g., AWS, GCP, Azure) and Kubernetes for workload orchestration.

Machine Learning Engineer


San Francisco

5+ years in writing high-performance, well-tested, production quality codeExperience in LLM inference ecosystem, including frameworks and enginesExperience in building large scale, fault tolerant, distributed systemsExpert level programmer in one or more of Python, Go, Rust, or C/C++Experience in implementing runtime inference services at scale

Systems Research Engineer, GPU Programming


San Francisco

Remote

Experience in GPU programming and parallel computing, such as CUDA and/or Triton.Knowledge of ML/AI applications and models.Knowledge of performance profiling and optimization tools for GPU programming.

Senior Backend Engineer - Commerce


San Francisco

5+ years in building large scale, fault tolerant, distributed systems and API microservicesExperience in designing, analyzing and improving efficiency, scalability, and stability of various system resourcesExpert-level programmer in one or more of Golang, Rust, Python, Java, or TypeScriptProficiency in writing and maintaining infrastructure as code (IaC) using tools like Terraform, AWS CDK, or PulumiProficiency in version control practices and integrating IaC with CI/CD pipelinesExperience with payment processors (e.g. Stripe) and billing systems a plusExperience with Kubernetes, or containers a plusExperience building and operating data infrastructure (Kinesis, Airflow, Kafka, etc) a plus

Senior Software Development Engineer in Test


San Francisco

5+ years of industry experience.Proven experience as an SDET or similar role.Experience in Cypress, REST API testing, or k6.Experience with CI/CD, Argo CD, or GitHub Actions.Experience testing sites running on AWS and EKS.

LLM Training Dataset and Checkpoint Optimization Engineer


San Francisco

5+ years in data engineering, distributed systems, or ML infrastructure.Experience in high-performance data processing libraries (e.g., PyTorch DataLoader, TensorFlow Data, DALI).Experience in distributed storage systems and data formats (e.g., Parquet, HDF5).Experience in checkpointing frameworks and file systems (e.g., POSIX, Lustre, GPFS).Experience with I/O optimization techniques (e.g., asynchronous data loading, prefetching).Experience with compression and serialization for large datasets and checkpoints.Experience with ML frameworks (e.g., PyTorch, TensorFlow, JAX) and distributed training.Experience with hardware accelerators (e.g., GPUs, TPUs) and storage optimizations.Experience with incremental and real-time checkpointing solutions.

Events Coordinator, Developer/AI Events (Contract)


San Francisco

2+ years in event coordination, marketing, or a related fieldExperience managing or supporting tech events, hackathons, webinars, or developer-focused programsExperience with event registration platforms and tools, including managing attendee lists and troubleshooting registration issuesExperience communicating with guests, including drafting event invitations, coordinating RSVPs, and providing clear, timely updates to attendees