Catalog Agents Deep Dive

When Your Supplier Data Looks Complete, But Your Listings Still Fail

Every supplier formats the same data differently. Here's how catalog agents normalize product data by understanding meaning — not matching patterns.

7 min read

The spreadsheet has 847 rows. Every required field is populated. Someone ran the import, nothing errored. But three weeks later you find out a third of those products aren't appearing in filtered search results — and two of your top SKUs are in the wrong category on two major channels.

The data looked complete. It wasn't normalized. Those are different problems, and only one of them is visible in an import log.

What normalization actually means

A field being populated isn't the same as a field being correct. Normalization is the process of making sure the same real-world value maps to the same structured representation — regardless of how the supplier described it, what unit they used, or what category they filed it under.

Every retailer deals with dozens or hundreds of suppliers. Each one has their own conventions. You don't get to tell Supplier A to match Supplier B's format. You have to absorb both and produce consistent output.

The three normalization problems

Units. Voltage is a simple example: one supplier writes "110V," another writes "110 volts," a third writes "110-volt operation," a fourth uses "110VAC." These are identical. A system that pattern-matches for "110V" misses three of them. A channel that does an exact match on "110 volts" accepts only the second. What looks like complete data — every product has a voltage — actually has four different representations that your channel's filter treats as four different values.

The same problem appears with dimensions, weights, thread counts, chemical concentrations, temperature ratings, and every other numeric attribute with a unit. The specific unit strings your suppliers use are not the unit strings your channels expect.

Taxonomy. Supplier categories don't map to your categories. A supplier calls it "Pipe Fittings — Compression Type — Metric." Your channel needs "Plumbing > Fittings > Compression Fittings." A different channel needs "Industrial > Pipe & Fittings > Compression." The supplier's taxonomy is not yours. Mapping between them requires understanding what the supplier means, not just what they wrote.

Taxonomy mismatches are the most expensive normalization errors because they control which channel category your product appears in. A product in the wrong category doesn't just rank poorly — it's invisible to buyers browsing the right category and excluded from filters that narrow by category.

Controlled vocabularies. Many fields have an approved value set: yes/no, color names, compliance certifications, hazmat classes, country of origin codes. Suppliers fill these with their own variants. "Stainless Steel" vs. "stainless" vs. "SS" vs. "304 SS" — which is your controlled vocabulary's term? Is it a color or a material? Is it a separate field or part of the product name?

How agents normalize vs. how rules normalize

Supplier input	Rules-based output	Agent output
"110-volt operation"	(no match — field left empty)	110V AC
"Pipe Fittings — Compression Type — Metric"	Pipe Fittings (partial match)	Plumbing > Fittings > Compression Fittings
"UL listed, cCSAus"	(no match — field left empty)	UL, CSA (two separate certification fields)
"marine-grade aluminum alloy"	Aluminum (partial match)	Aluminum alloy, Marine grade — Material + Grade (two fields)
"5kg / 11 lbs"	5kg (first value only)	5 kg (metric canonical), 11 lbs (imperial alternate)

Rules handle the cases the rules were written for. Agents handle the cases they weren't written for — because agents understand what the supplier meant, not just whether the supplier used the expected string.

The hidden cost of unnormalized data

Unnormalized data creates problems that compound over time. A buyer who filters by "110V" doesn't see your "110-volt" product. They buy a competitor's. Your analytics show zero impressions for that filter — you assume the channel just doesn't surface your category. Meanwhile the real problem is a voltage string that's three characters off from your channel's format.

Across a catalog of 50,000 SKUs with dozens of attribute fields per SKU, the number of these silent mismatches can be enormous. They accumulate invisibly. No error log, no failed import, no suppression notice. Products just don't show up where they should.

Normalization at scale

The reason normalization is an agent problem rather than a rules problem is scale and variety. A small catalog with three suppliers might be manageable with a carefully maintained mapping table. A catalog with 200 suppliers, 80,000 SKUs, and 40+ required fields per category needs something that can handle novel inputs — suppliers you haven't seen before, formats you didn't anticipate, fields that changed meaning between supplier updates.

Catalog agents normalize by reasoning about meaning. When a new supplier sends data in a format that's never appeared before, the agent figures out what the values mean and maps them to the right fields. The rules table doesn't need to be updated. The normalization just works.

Read: How catalog agents extract attributes in the first place →

When Your Supplier Data Looks Complete, But Your Listings Still Fail

What normalization actually means

The three normalization problems

How agents normalize vs. how rules normalize

The hidden cost of unnormalized data

Normalization at scale

Related articles

What Are Catalog Agents?

How Catalog Agents Extract Product Attributes at SKU Scale

Catalog Agents vs. PIM