From Annotation to Core Infrastructure: How Human-in-the-Loop is Evolving in the Generative AI Era

· · Views: 2,414 · 6 min time to read

As AI transitions from prototype to production environment, there will always be one constraint which defines its limitations: the quality of human input. Not only does the next generation of AI depend on compute scaling and larger models, it also depends upon the systematic engineering of human intelligence.

While data annotation has been viewed historically as a manual, ad-hoc process within the industry, human-in-the-loop (HITL) workflows are evolving towards scalable, auditable systems that are integrated directly into AI architectures. This business imperative is clear; the global HITL AI market is expected to grow to $16.4 billion by 2030. As Sergey Polyashov, COO of Toloka, notes, “Human expertise must be treated as a system component – measurable, reproducible, and continuously optimised.”

One of the most notable indicators of this paradigm shift is the increasing investor confidence in human-in-the-loop infrastructure. In May 2025, Bezos Expeditions led a $72 million funding round in Toloka, a company specialising in training and evaluating AI systems through distributed human expertise. Backed by Nebius Group and working with clients such as Amazon, Microsoft, and Anthropic, Toloka’s growth indicates a broader industry realisation that scaling AI is no longer just a compute problem but rather a human systems problem; one which requires structured, high-quality human input embedded directly into the model lifecycle.

The Infrastructure Change

A traditional method for labelling data has been that human input was considered to be outside the AI pipeline. It was seen as being added on to the pipeline after it had already been built. That is no longer true today. A new generation of platforms in the industry, all of which include Scale AI, Snorkel, and Toloka, have changed how we think about this layer; they see these tools as key components of the overall architecture.

There are many reasons why the role of human input is changing. One reason is that there is now a much broader set of requirements needed to build AI systems today. For example, using either SFT or RLHF, two popular machine-learning algorithms often used together in some applications, requires an underlying structure that can process a wide variety of data types. It can be text, audio, video, action data or code. In order to do so at scale, according to Mr Polyashov, a multi-tiered architectural model must be developed so that it may occur at whatever stage of the AI development cycle that it may occur at, whether it’s developing foundational models or processing large amounts of complex physical data, such as egocentric video, for robots or wearable devices.

The Cost of Poor Data and the Automated Quality Loop

AI development has one major bottleneck – not how well the models are performing, but rather how consistent the human-generated data can be. The cost to organisations is high – on average, according to Gartner, companies lose around $12.9 million per year due to poor data quality; furthermore, data teams will spend 50-60% of their total team time simply cleaning and repairing their datasets versus creating models.

Toloka sees organisations moving from isolated, manual workflows to system-level processes. Toloka currently offers three main types of service offerings. For example, Toloka’s managed services provide end-to-end management of complex data workflows to large-scale enterprises (Anthropic, Microsoft, etc.), while at the same time offering the Toloka Platform, which allows teams to build and manage their own data projects, using an AI agent to help guide the configuration of pipelines, generation of guidelines, and structuring of tasks.

Toloka also now provides Tendem, a hybrid AI+human system designed to orchestrate all aspects of the full life cycle of data work, including task decomposition, expert match, and multi-layered validations. Tendem represents a departure from traditional marketplace solutions and/or AI-only tools because Tendem combines human judgement with AI to create a singular feedback loop for the full lifecycle of data work. Internal benchmark testing across many different complex real-world tasks indicates that Tendem produces better quality output than either pure AI-based systems or pure human labour-based systems in terms of both output quality and productivity.

Quality control within each of these systems is no longer a discrete downstream process. Instead, quality control is included within the workflow itself via automated validation layers where AI assists in evaluating output against defined criteria in real-time. The use of continuous quality loops as opposed to LLMs acting solely as judges provides continuous checks on the accuracy and consistency of data prior to being inputted into training pipelines. Continuous quality loops enable error catching and correction, enforce consistency of output, and eliminate low-quality data from ever being entered into a training pipeline.

Balancing Scale and Domain Expertise

The “scale vs quality” dilemma of balancing crowdsourcing and domain expertise has existed in HITL systems since their inception. Crowdsourcing provides large volumes but poor consistency, whereas an expert panel provides a high level of accuracy but low levels of throughput.

In order to balance these two extremes, the industry is moving toward multi-tiered talent architectures. As stated by Mr Polyashov, a hierarchical structure is now necessary for today’s workload. For example, Toloka utilises more than 20,000 verified experts, who are also comprised of individuals with postgraduate education along with AI tutoring staff and general annotation workers. Tasks will be routed through adaptive routing depending upon the worker’s past performance as a means to direct tasks to the best possible tier. As such, tech companies (i.e., Google and Microsoft) and research facilities (i.e., Anthropic and Hugging Face) can have both high speed and confirmed quality/accuracy at the specialised task level.

Humans as a Runtime Component: Inside the MCP Architecture

Perhaps the main difference between people being used as a backup to AI versus using people as a natural part of the same model for executing both AI and human intelligence (IHM) as a single runtime is perhaps the most significant design change. Autonomous AI systems may encounter problems with vague objectives, unclear requirements or ambiguous scenarios that require domain knowledge when they perform a task. Many systems have attempted to address these limitations through escalation processes; however, this approach has some major flaws. Tendem, built by Toloka, is neither a safety net nor simply an exception-handling mechanism. It is an integrated runtime environment combining AI capabilities and human expertise capable of performing complex tasks from start to finish, including research, analysis and generating content.

In essence, Tendem is a project management platform for tasks using an AI “project manager” to orchestrate the decomposition of large-scale objectives into smaller sub-tasks, routing those sub-tasks based upon the best available AI capabilities and/or human expertise and integrating all output into a cohesive final product. All aspects of the execution follow a standard plan-act-observe-verify cycle, yet this cycle is an integral part of the Tendem’s own agent architecture, not an outside system calling for human escalation.

In contrast to many other hybrid intelligent platforms, human involvement within Tendem is not relegated to the realm of exceptions. Rather, it is considered an essential component of the execution workflow and is introduced at each juncture that requires judgement, context or validation. In addition, quality control/verification is included within the execution process to ensure that every output meets the necessary standards prior to proceeding to additional steps in the workflow.

While Tendem does allow users to develop interfaces allowing for external systems to request human assistance via the Model Context Protocol (MCP) to access human intelligence on-demand, this feature provides only one method of interfacing with Tendem’s functionality and not the entire purpose of developing Tendem. As noted above, the key differentiator of Tendem is that it views hybrid intelligence as the base level for executing workflows and not as an exception pathway.

Demonstrated Impact and Scientific Validation

The transition to infrastructure-level HITL is showing tangible results in throughput and efficiency. According to Mr Polyashov, recent deployments illustrate the scale of this shift. In a large-scale product variant attribute annotation project, he notes that Toloka mobilised over 3,300 annotators to process thousands of items in just 28 hours.

At the highly specialised end, Mr Polyashov highlights adversarial red-teaming exercises. During a recent stress test of the OpenClaw AI agent, a deployed team of cybersecurity experts didn’t just review text; they actively attempted to manipulate the agent into executing unauthorised API calls, extracting sensitive data, and bypassing system access controls. Within a single working shift, the experts successfully identified and documented over a dozen complex attack vectors.

Beyond commercial throughput, the engineering rigour underpinning these new HITL frameworks is gaining traction in the broader scientific community. These architectural shifts are already being validated and presented at top-tier AI conferences, including NeurIPS, ICML, and VLDB.

Ongoing Challenges and a Systems-Level View

The transition toward a new model is, however, accompanied by obstacles. It continues to be challenging to scale to the level of a true area of expertise (for example, PhD-level mathematicians or experienced attorneys), because the number of qualified professionals available in each discipline is by definition limited. Although MCP has proven itself to be a high-security way to dynamically escalate in real time, introducing a “human in the loop” at run time introduces latency, which represents a significant obstacle to very fast, consumer-orientated AI-based solutions.

Although there are still many challenges ahead, the path laid out by major players in the industry is altering our thinking about human input. Mr Polyashov’s broader argument embodies an emerging consensus: instead of something unpredictable to manage, human intelligence should be viewed as a programmable, measurable resource treated like other pieces of software.

As AI systems continue to become more independent and self-sufficient, reliance upon human input to assess their reliability and safety (and determine if they have been aligned to meet a particular objective) will rely upon the quality and purity of the human signal that shaped them. The industry is making a significant transition, moving from viewing human intelligence as a constraint/bottleneck to finally seeing it as a fundamental building block of AI infrastructure.

Share
f 𝕏 in
Copied