From Demo to Production: What It Really Takes to Make Computer Vision Work in a Warehouse

· · Views: 2,006 · 6 min time to read

A computer vision model might look great in a test environment but still fail just days after being deployed in a real warehouse.

This gap is one of the most misunderstood parts of applied AI. In a controlled setting, the process seems simple: gather images, train a classifier, compare results with catalog data, and return a match score. But in a real warehouse, things get messy fast. Lighting changes during the day. Operators snap photos from whatever angle is quickest. Packages arrive damaged. Labels might be hidden by shrink wrap or extra tape. Catalog data often has gaps that only show up when the model struggles. And the person using the system is not in a lab—they have a truck waiting, items piling up, and no time for an AI tool that slows them down.

This human context is still where most production deployments fail. Teams usually focus on the model itself. The model is important, but the system built around it is what really decides if the business gets value.

I learned this lesson the hard way during a 2022–2023 deployment of a computer vision verification system for warehouse receiving in an industrial supply chain. The client ran a central distribution warehouse near Moscow, handling MRO spare parts. This meant processing hundreds of line items daily, where a missed mismatch at receiving could cause a production stoppage weeks later. The goal sounded simple but was far from it: compare the part that arrived with what the warehouse expected, using images, catalog data, and technical references, and flag mismatches before they entered the warehouse management system.

The project ended up reducing receiving mismatches by 99%. But that headline number isn’t the most important part. What really matters is what we had to build around the model to make that result possible.

The Problem Was Never Just Recognition

Industrial spare parts are tough for computer vision because many look very similar, even though they have important differences. Bearings, valves, fasteners, sensor housings, connectors, and other parts might look almost the same but differ in size, thread type, connector position, supplier code, or technical specs. From a few feet away or through a phone camera, they can seem identical even when they are not interchangeable.

For a receiving clerk, this creates a daily risk. They usually have just seconds to check an item against a delivery note. If a mismatch gets through, the error spreads into inventory, maintenance, procurement, and sometimes even causes production disruptions. In these settings, the real cost isn’t the part itself—it’s finding out too late that the wrong part was accepted.

The first version of the solution seemed simple: capture an image, compare it to reference images and catalog data, calculate a confidence score, and decide if it’s a match. In a notebook, this is mostly about classification and similarity. In real operations, though, it turns into a workflow problem. It’s not just a harder version of the same task—it’s a completely different challenge.

My main job was handling the data and integration layer: tying together receiving events, SKU data, supplier info, confidence scores, operator feedback, exception handling, and post-deployment reporting into one smooth workflow. The hardest part wasn’t just identifying the right part—it was making the verification result actually usable in the real receiving process.

The production setup followed what seems like a standard pattern on paper: capture images from a handheld device or workstation, run image preprocessing and quality checks, classify and search for similarities, compare with SKU, supplier, and catalog data, score confidence, let operators review uncertain cases, write results back into the workflow, log exceptions, monitor with dashboards, and set up a feedback loop for retraining.

What I didn’t realize at first was that every one of those layers was essential.

Pilots Are Cleaner Than Reality

The pilot worked. Production didn’t—at least not right away.

This isn’t a knock on pilots. Their job is to prove technical feasibility, and ours did that. The problem is that pilot conditions are usually much easier than the real world. They hide the kinds of failures that show up later.

The first live warehouse tests revealed problems the pilot dataset never showed. Some items were photographed under mixed lighting from lamps and daylight. Some packages had shiny plastic wrap that made images almost useless. Some operators took photos too close, while others photographed the whole shipping box instead of the part that needed checking. Sometimes the part was correct, but the supplier label had an old code that didn’t match the current catalog. In other cases, the reference image was a perfect manufacturer photo, but the received item was scratched, shrink-wrapped, or half-covered by a logistics label.

One incident early in the rollout summed up the problem. Two sensor housings arrived on the same morning. The model saw the second as a strong match to the first, and visually they did look very similar. But Artem, a senior receiving operator, stopped the second item because the connector orientation didn’t match the technical spec. He couldn’t explain it in model terms, but he knew something was wrong. That override was more valuable than a thousand polished training examples because it showed the real issue: we were still treating image similarity as the main truth, when the real business need was to verify against a specific expected transaction.

That difference is important. Recognition asks, “What does this look like?” Verification asks, “Is this exactly what we expected for this receiving event?” These questions are related, but not the same. A system built for recognition can fail badly at verification.

Confidence Thresholds Are Business Policy

One of the most important design choices was how to handle confidence.

It’s easy to see confidence thresholds as just technical settings. Raise the threshold to cut false positives, lower it to automate more. But in real use, that view is too narrow. A confidence threshold is also a policy choice because it shapes how the business deals with uncertainty.

We settled on three operating bands. High-confidence results were auto-confirmed. Medium-confidence results went to the operator, along with the system’s best guess, key catalog details, and a short note explaining the uncertainty. Low-confidence results weren’t forced into a yes-or-no answer—they went into an assisted review flow with possible matches and supporting data.

In practice, auto-confirmation happened above about 0.92 when key metadata matched. Assisted confirmation covered the 0.75 to 0.92 range. Below 0.75, or when the image and metadata didn’t agree, the item went to manual review. These numbers weren’t magic—they came from calibrating against the warehouse’s real review capacity and the actual cost of mistakes.

This is where warehouse AI projects often go off track. When someone asks for a big automation rate, there’s pressure to lower thresholds to make the system look efficient. But more automation can also let through mistakes the operation can’t handle. The real question isn’t “How do we automate 80% of cases?” It’s “Which mistakes are okay to automate, and which need a human to see them?”

This distinction also mattered for accountability. When a high-confidence result was auto-confirmed, the system owned the decision. When an operator confirmed a medium-confidence suggestion, it was logged as human-assisted. Overrides were tracked separately. Without this separation, analyzing incidents becomes guesswork. You might see something went wrong, but not whether the problem was in the model, the catalog, the interface, the threshold policy, or the operator’s choice.

The Human in the Loop Was Not a Temporary Patch

In AI discussions, human review is often seen as a temporary step—something to use until the model is good enough to remove it.

I no longer think that’s a realistic way to approach industrial settings.

The warehouse operators weren’t just making up for a weak model. They were essential because some receiving decisions depend on context that isn’t in the dataset. A part might be damaged in a way that matters for operations. A supplier might send a valid substitute that isn’t yet in the catalog. A label might use an old code. A vendor might be under extra scrutiny that month for quality issues. None of this shows up in a clean training image.

The model could help with those decisions, but it shouldn’t try to handle all of them.

This influenced how we designed the interface. It had to show uncertainty clearly without overwhelming the operator. Just showing a confidence number wasn’t enough. Operators needed the suggested SKU, alternatives, key details, and a hint about why the system was unsure. The workflow also had to stay quick. If the tool was hard to use, the team would find ways around it, and even a technically correct system that slows down the dock is still a failure.

We tried several versions of the UI before it felt right. In the end, overrides, low-confidence cases, and exception reasons were logged as separate event types. This wasn’t in the original plan. We added it after realizing that mixed logs were almost useless for debugging.

That feedback loop became one of the system’s most valuable features. It let us retrain on the specific SKUs, suppliers, and visual conditions that caused problems, instead of just adding more images and hoping the model would generalize.

The Integration Work Nobody Puts in the Demo

A lot of the project’s hardest work was in the integration layer—the part almost nobody talks about in AI demos.

The verification service had to fit right into the receiving workflow, not just sit alongside it as an extra check. It needed to get transaction details from the warehouse system, look up the right SKU and supplier info, return results fast enough not to slow operators down, write decisions back into the process, and keep a full audit trail. Our target was under 900 milliseconds at the 95th percentile for standard checks. After tuning, we aimed to keep the operator-assisted review queue at about 8% to 12% of cases, since anything higher would slow down the team.

Getting to those numbers meant making choices that weren’t about the model itself. Network stability was a real issue because some parts of the warehouse had dead zones, so we needed fallback steps if the service was down. Deciding between synchronous and asynchronous writes took longer than expected, since audit rules meant some events had to be written right away, even if it slowed things down. Manual overrides had to be stored in a way that could actually be searched. We had to tell model errors apart from catalog errors. The exception queue needed an owner, and while giving it to the warehouse operations lead worked here, that might not work everywhere.

None of these decisions are glamorous, but together they decide if the system will still work well six months after launch.

Going Live Is the Start of a Different Job

In production, monitoring was less about tracking accuracy and more about watching how the system behaved.

Accuracy by itself wasn’t enough to judge system health. We tracked things like confidence distributions, low-confidence rates, operator overrides, image quality problems, latency, exception patterns by SKU, and catalog data issues. Problems never showed up in a neat way. Sometimes the first sign was a shift in confidence scores. Other times, it was a spike in assisted reviews for one supplier. Sometimes the issue wasn’t the model at all, but a new packaging type or a catalog update that hadn’t gone through yet.

The monitoring dashboard needed to answer real operational questions, not just academic ones. Which SKUs are causing the most exceptions? Are low-confidence cases going up? Are operators overriding the system more than last month? Is the problem focused on one supplier, product family, receiving zone, or capture device?

Post-deployment reporting became a core part of the product. It guided retraining priorities, catalog cleanup, operator support, and workflow improvements. At that stage, my experience in BI and data engineering was just as valuable as my model development skills.

What I Would Do Differently Now

If I started a similar deployment today, I’d focus on data quality much earlier. Profiling SKU data, supplier codes, past substitutions, and receiving exceptions from the start would have saved weeks of debugging issues that seemed like model problems but were really catalog issues. We thought the data was cleaner than it was.

I’d also involve receiving operators much sooner—not just managers, and not only during user testing. The warehouse team knew which items got mixed up, which suppliers labeled things inconsistently, and which packages were almost impossible to photograph. We eventually sat with the receiving shift and watched them work. That should have happened in week two, not week fourteen.

I’d also set the threshold policy earlier, before anyone started aiming for a specific automation number. Once someone says “80% auto-confirmation,” it’s harder to talk honestly about the cost of errors behind that goal. It’s better to start by asking: which mistakes are okay to automate, and which need a human involved?

The Real Gap Between Demo and Production

The main lesson is simple, even if the work wasn’t. A production computer vision system isn’t just a model behind an API. It’s a decision system built into an operational process.

In warehouse receiving, this means the system has to handle imperfect images, messy catalog data, impatient users, old workflows, network issues, audit needs, and changing product mixes. The 99% drop in mismatches happened not because the model was perfect, but because the system was designed to handle uncertainty instead of hiding it. Automatic and assisted decisions were kept separate. Operator feedback was collected in a useful way. Visual recognition was tied to catalog data and real transaction context. Monitoring was built in as a core feature, not an afterthought.

For teams stuck between a promising demo and a real solution on the warehouse floor, the hard truth is this: the model probably isn’t your biggest problem.

The real challenge is everything you still need to build around it.

Share
f 𝕏 in
Copied