Benchmarkthing Memo - Xiangyi Li and Moritz Wallawitsch
Intro
Benchmarkthing begins as an eval and workflow platform with a standardized cloud environment for execution. Our goal is to become a consumer-facing agent cloud for orchestrated AI workflows, enabling both developers and non-technical users to execute tasks easily.
The path from an eval platform (A) to an agent cloud (B) may seem abrupt, but it's a natural progression. Below, we explain why we start from A and how we transition to B.
Key Insights (Lemmas)
AI will make traditional SaaS past-tense. Models become the product, and people will hire outcome-generating agents instead of buying tools.
Generic, NPC-like AI won't suffice. The future lies in customizable, domain-specific agents tailored to user needs. People want chef-cooked risottos rather than off-the-shelf frozen meals.
Agents are like code. Every human being will own several agents because we all use things in a slightly different way, and with the agent orchestration cloud everyone will be able to modify the agents. For the 340 million people in the US, there will be at least 3.4 billion agents.
But before we get there, we need AI evals to establish a foothold for both LLM systems and AI infra. Current agent marketplace (Coze AI, LlamaIndex) will die before the ‘Agent is the future’ becomes reality.
Product Strategy
Step 1: Host and execute evals and workflows. We start with an eval platform to address the urgent demand for AI evaluation tools. It forms the foundation for workflow standardization, infra building, and gets us to PMF before we die building something 5 years too early.
Step 2: Orchestrate workflows and evals for developers. Next, we add a master AI node to identify and execute workflows based on user input, creating a seamless experience. Users can simply describe tasks, and our platform will find, iterate, and execute the right workflows.
Step 3: Distribute agents to everyone. Once workflows are reliable (98% success rate), we launch a consumer-facing agent app store. Users will have personalized agents to perform complex tasks, such as planning a multi-city trip by integrating multiple data sources. Agents will communicate like colleagues, delivering solutions beyond traditional software capabilities.