Thesis: Generalist agents could replace the bulk of manual data‑curation work
Data curation remains one of the most time‑consuming steps in building modern AI systems. The hypothesis driving the latest research is simple: if a single, general‑purpose coding agent can run the same scripts a data engineer writes, it could close the loop that today requires repeated human intervention.
Evidence from the field
Researchers at arXiv cs.AI introduced Curation‑Bench, an agent‑centric benchmark designed to test that very hypothesis. The benchmark locks the model, the training recipe, and the evaluation suite, then hands agents a command‑line interface to propose, implement, evaluate, and revise data policies. By fixing everything except the agent’s actions, the test isolates how well a generalist coding agent can manage the noisy feedback that normally drives iterative data‑policy changes.1
The paper’s abstract makes clear that the authors view the data‑curation loop as a series of repeatable steps that a capable agent could automate. No other work has yet produced a public benchmark that treats data curation as an agent problem, making Curation‑Bench the first concrete yardstick for measuring progress.
Context: Agents are already moving beyond research labs
Enterprise adoption of AI agents is accelerating. OpenAI’s blog details how Endava has woven ChatGPT Enterprise and Codex into its software‑delivery pipelines, automating routine workflows and fostering an “AI‑native” culture across the company.2 The case study shows that teams are already trusting agents with code generation, test orchestration, and continuous‑integration tasks—activities that share many characteristics with data‑curation scripts.
On the consumer side, TechCrunch reports that Poke, a startup that enables AI agents via simple text messages, became the first AI agent approved for Apple’s Messages for Business platform.3 While Poke targets conversational interactions, its approval signals that platform owners are ready to vet and surface third‑party agents to end users, expanding the ecosystem in which agents can operate.
From a hardware and research perspective, NVIDIA announced breakthroughs in robot grasping, autonomous‑driving perception, and large‑scale agent training. Their work demonstrates that the compute infrastructure needed to train and run sophisticated agents at scale is now widely available.4 The same scaling tricks that power robot grasping could be repurposed to train agents that sift through billions of training examples.
Counter‑arguments: Limits of current agents
Even with Curation‑Bench, the benchmark still assumes a fixed model and training recipe. Real‑world projects often switch architectures, add new data sources, or change loss functions on the fly. A generalist agent that only knows how to edit scripts for a single pipeline may struggle when the underlying stack shifts.
Another practical hurdle is the quality of feedback. The benchmark’s “noisy benchmark feedback” mirrors the imperfect signals engineers see when evaluating data quality, but it does not capture the business‑level judgment calls—privacy concerns, regulatory compliance, or domain‑specific bias—that still demand human insight.
Enterprise rollouts like Endava’s illustrate that agents are valuable helpers, yet the blog emphasizes a cultural shift toward AI‑native practices rather than a full replacement of engineers. The narrative suggests that agents augment, not eliminate, human expertise.
Prediction: Incremental automation leading to near‑full autonomy
Given the evidence, the most plausible path is a staged takeover. In the next 12‑18 months, generalist coding agents will likely automate the repetitive parts of the data‑curation loop—running cleaning scripts, generating policy diffs, and submitting evaluation jobs—while humans verify edge cases.
As benchmarks like Curation‑Bench mature and as platforms such as Apple’s Messages for Business continue to certify third‑party agents, the market will reward tools that can plug into existing pipelines with minimal friction. The scaling tricks showcased by NVIDIA will lower the cost of training agents that understand larger, more varied data‑policy spaces.
Eventually, when agents can reason about policy trade‑offs, negotiate with stakeholders, and adapt to shifting model architectures, the data‑curation workflow could become almost entirely self‑serving. Until that point, the human‑in‑the‑loop will remain a safety valve, especially for high‑stakes domains.
📎 Related Articles
AI Agents Explained: What They Can Do and Where They Fail • Endava’s Codex‑Driven Shift to an Agentic Organization • OpenAI’s Codex Takes the Lead in Enterprise Coding Agents • Claude Opus 4.8 lands on AWS, reshaping coding agents and cost strategy • OpenAI’s Gartner Lead Shows AI Coding Agents Are Now Core Enterprise Tools • Gemini 3.5 Turns Language Models Into Action‑Oriented Agents • OpenAI Tops Gartner’s Coding Agent Quadrant • Why Gartner’s Coding Agent Crown Signals a Shift in Enterprise Software
Explore related AI topics
AI News Today • AI Tools • ChatGPT Prompts • AI Agents • AI Models




