Location: Open to candidates across Latin America & West Africa
Experience Required: 6+ Years
We’re a coding-first research team working as a trusted partner for a Frontier AI Lab. Our mission is to build high-quality coding tasks, evaluations, datasets, and tooling that directly improve how large language models (LLMs) think, reason, and write code.
This is a hands-on engineering role where precision, correctness, and reproducibility truly matter. You’ll work on real production-grade code, investigate subtle model failures, and design rigorous evaluations that shape next-generation AI systems.
If you enjoy solving non-obvious technical problems, breaking systems to understand them, and working in developer-centric environments—this role is for you.
What You’ll Be Working On
- Writing, reviewing, and debugging production-quality code across multiple languages
- Designing coding, reasoning, and debugging tasks for LLM evaluation
- Analyzing LLM outputs to identify hallucinations, regressions, and failure patterns
- Building reproducible dev environments using Docker and automation tools
- Developing scripts, pipelines, and tools for data generation, scoring, and validation
- Producing structured annotations, judgments, and high-signal datasets
- Running systematic evaluations to improve model reliability and reasoning
- Collaborating closely with engineers, researchers, and quality owners
What We’re Looking For
Must-Have Skills
- Strong hands-on coding experience (professional or research-based) in:
◦ Python
◦ JavaScript / Node.js / TypeScript - Experience using LLM coding tools (Cursor, Copilot, CodeWhisperer)
- Solid knowledge of Linux, Bash, and scripting
- Strong experience with Docker, dev containers, and reproducible environments
- Advanced Git skills (branching, diffs, patches, conflict resolution)
- Strong understanding of testing & QA (unit, integration, edge-case testing)
- Ability to overlap reliably with 8:00 AM – 12:00 PM PT
Nice to Have
- Experience with dataset creation, annotation, or evaluation pipelines
- Familiarity with benchmarks like SWE-Bench or Terminal Bench
- Background in QA automation, DevOps, ML systems, or data engineering
- Experience with additional languages (Go, Java, C++, C#, Rust, SQL, R, Dart, etc.)
Who Will Thrive Here
- Engineers who enjoy breaking things and understanding why
- Builders who like designing tasks, running experiments, and debugging deeply
- Detail-oriented developers who catch subtle bugs and model issues
- Engineers who prefer clean, reusable workflows over quick hacks
Why Join Us?
- Work directly on systems that improve state-of-the-art AI models
- Solve unique, non-routine engineering problems
- Collaborate with smart, quality-driven engineers and researchers
- Build tools and datasets that have real impact at scale