
Senior Software Engineer — AI Evaluation & Benchmarks (Python)
Jobgether • US
No Relocation
Posted: May 15, 2026
Additional Content
Job Description
- This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Software Engineer — AI Evaluation & Benchmarks (Python) in United States. In this highly specialized engineering role, you will help define how frontier AI systems are evaluated on real-world software engineering tasks. You will design and build the benchmarks, datasets, and evaluation pipelines used to measure coding ability across debugging, reasoning, and production-grade development scenarios. This position sits at the intersection of software engineering and AI research, where your work directly influences how next-generation models are trained and improved. You will develop scalable systems to run evaluations across large and complex codebases, analyze model outputs for correctness and edge-case failures, and translate findings into structured improvements in benchmark design. The role requires deep technical rigor, strong Python expertise, and a product-minded approach to experimentation and iteration. You will operate in a fast-moving, remote-first environment focused on innovation, precision, and impact on the future of AI systems.
- Accountabilities: Design and build coding benchmarks that evaluate frontier AI models on real-world software engineering tasks, including debugging, reasoning, and production-level coding challenges. Develop and maintain scalable evaluation pipelines and data infrastructure to support large-scale model testing workflows. Analyze AI-generated code for correctness, robustness, performance issues, and edge-case failures across diverse programming scenarios. Construct structured evaluation environments across large repositories and multi-language codebases to ensure rigorous model assessment. Provide detailed technical feedback on model behavior, failure modes, and performance patterns to improve benchmarking frameworks. Contribute to the design and evolution of evaluation methodologies that define standards for measuring coding capability in AI systems. Collaborate with research and engineering stakeholders to refine benchmarks and integrate findings into iterative model improvement cycles. Ensure evaluation systems are reliable, reproducible, and optimized for scale and accuracy. Requirements: 4+ years of professional software engineering experience in high-quality production environments. Expert-level Python development skills with strong emphasis on clean, performant, and well-tested code. Hands-on experience working within large, complex, and production-grade codebases. Proven experience building or contributing to LLM evaluation systems, coding benchmarks, or AI model testing pipelines. Strong understanding of Git workflows, software engineering best practices, and modern development processes. Experience working in high-growth technology companies or top-tier engineering organizations. Excellent analytical and problem-solving skills with strong attention to detail. Strong written communication skills in English with the ability to articulate technical insights clearly. Experience with CI/CD systems and unit testing frameworks is highly valued. Familiarity with additional programming languages such as JavaScript, Go, or C++ is a plus. Background in ML evaluation methodologies, open-source contributions, or security engineering is considered an advantage. Benefits: Competitive hourly compensation ranging from $80 to $100 per hour based on experience and location. Fully remote contract opportunity with global flexibility across approved locations. Weekly payments via PayPal or Stripe. Short-term 3-month contract with potential for extension based on performance and project needs. Opportunity to work on cutting-edge AI systems shaping frontier model evaluation standards. High-impact technical role influencing how future AI coding capabilities are measured and improved. Exposure to advanced AI research workflows, benchmarking methodologies, and large-scale evaluation systems. Flexible, project-based engagement within a fast-evolving AI engineering environment.
- How Jobgether works: We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best! Why Apply Through Jobgether? Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time. #LI-CL1
- We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.
- apply for this job