Every Way to Source Product Data: What Works, What Doesn't, and Where to Start

‍

Most teams asked how they source product data can't really answer the question. Not because it isn't happening, but because there's no strategy behind it. It's a survival process. Something breaks, someone fixes it, and nobody revisits it until the next crisis.

The result is a patchwork of methods, some deliberate, some accidental, none of them fully owned. In this episode we ran through each approach honestly: what it's actually good for, where it falls apart, and what we'd pick if we had to choose just one.

1. Manual copy and paste

Still the most common method in smaller catalogs, and for good reason. It works. A human goes to a supplier website, a PDF, or a data sheet, extracts what they need, and enters it into the system. There's genuine value in that human judgment. Someone is making a call about quality, relevance, and accuracy at every step.

The problems are obvious at scale. It's slow. It's expensive in terms of team time, even if it doesn't feel that way because the cost is buried in headcount. Quality depends entirely on who's doing it. And it simply doesn't scale without adding more people.

There's also a version of this that often gets overlooked: when suppliers submit data in Excel onboarding templates, someone still has to copy that into the PIM or ecommerce platform. Designing mapping rules between hundreds of different supplier spreadsheets and a target data model is itself a significant, often underestimated, operational cost.

2. Data scraping

The idea: build or buy scripts that pull data from manufacturer or competitor websites automatically. Faster than manual, particularly useful for filling gaps when a supplier hasn't sent anything.

The reality is messier. There are legal gray areas, particularly around images, where watermarking and copyright disputes have caught businesses out. More fundamentally, scraped data isn't normalized. It arrives in whatever structure the source site uses, which means you still need a layer of mapping and cleaning to get it into your target model.

And the economics often don't work out. Building scrapers takes time. Maintaining them as source sites change takes more. Adding the quality control layer needed to trust what you're getting adds further. When you run the numbers, the cost of scraping properly is frequently comparable to just doing it manually with a cheaper resource.

There's also a data quality risk that's easy to miss: scraping from competitor sites means trusting that their data is accurate. You have no way of knowing where they sourced it from. You can end up moving bad data around faster, not fixing it.

3. Data pools

The concept is appealing: suppliers publish their product data once, into a shared pool, and distributors and retailers pull it out as needed. It eliminates duplication of effort across the supply chain.

In practice, it works well in a narrow set of circumstances. Grocery, particularly in the EU, has good GS1-based infrastructure. The electrical distribution sector has done solid work through the EDA. Where there is a regulated industry body orchestrating things and suppliers are mature enough to participate properly, data pools can be a genuine asset.

Outside of those conditions, the picture is significantly worse. Coverage is the primary issue. One DIY retailer looking at a data pool found around 15% of their supplier base represented in it. That means 85% still need a separate process anyway, which undermines the whole argument.

Even where coverage is adequate, the data still has to be mapped from the pool's schema into your own. The translation layer doesn't disappear, it just starts from a different place. And most commercially run data pools charge ongoing license fees to access the data, often on both the supplier and merchant side. There's a reasonable argument that this model charges businesses to access information about their own products.

4. AI ingestion and structuring

This is the one attracting the most interest right now, and for legitimate reasons. Taking unstructured inputs like PDFs, spreadsheets, data sheets, and product descriptions, running them through AI extraction, and mapping the outputs to a target schema is genuinely fast and genuinely scalable. Productivity gains of 80 to 90% compared to manual processing are achievable.

But two things are often misunderstood about it.

First, humans cannot be removed from the loop. Anyone selling a fully automated AI enrichment solution with no human review is selling something that will produce bad data at speed. AI is only as good as the sources you feed it and the model you're mapping to. Gaps in source data produce gaps in output. Weak schema design produces weak results. The human role shifts from doing the work to validating and correcting it, which is a meaningful improvement, but it is not elimination.

Second, AI doesn't fix bad structure. It processes it faster. If your attribute model is unclear, your taxonomy is inconsistent, or your source data is unreliable, AI will confidently map all of that mess into your target system at scale. The foundation has to be right first.

When it does work well, the scale argument is compelling. A prospect with two and a half million SKUs and three people in their ecommerce team is a realistic candidate, provided their products have good source data and a manageable attribute model. That is genuinely not achievable any other way.

5. Supplier feeds, templates, and portals

This one deserves a more nuanced treatment than it usually gets, because the right answer depends entirely on which suppliers you're talking about.

Large, digitally mature suppliers, the Siemens and Schneider Electrics of the world, already have structured data in ETim files. Some have APIs. They don't need a form-based portal. They need a mechanism to deliver a file and have it ingested cleanly.

Small and mid-sized suppliers, particularly those without dedicated product data teams, need something much simpler. A form they can load and fill in, with clear field definitions, guidance on what's mandatory, and validation that tells them when something is wrong.

Most supplier portals on the market are built around the merchant's data model, not the way suppliers actually work. When a supplier has to spend hours figuring out how to use a portal before they can submit 50 products, you'd be better off capturing the data yourself from their website. One business we spoke to had invested in a supplier onboarding tool that after nine months had three suppliers using it, from a base of several hundred. The tool wasn't the problem. The experience wasn't designed around the supplier.

The solution is to offer a range of submission methods that meet suppliers at their maturity level, and to invest in adoption and training as seriously as you invest in the tooling itself.

6. Internal reuse

This is the one that most businesses overlook, often completely. Before going anywhere else, it's worth taking stock of what you already have.

Years of accumulated data sit in folders, SharePoint drives, legacy systems, and old spreadsheets. Product descriptions often contain structured attribute information that was never extracted into the right fields. Back-of-pack information exists but was never digitised. Category managers have knowledge that never made it into the PIM.

One customer ran an extraction pass over their existing product descriptions and achieved around 50% attribute fill rate without sourcing a single new piece of data. Another walked their warehouse photographing backs of packs.

This should almost always be the starting point. It's fast, it's free, and it tells you exactly how big the actual gap is before you go and spend money trying to fill it from external sources.

The mistakes to avoid

Two patterns come up repeatedly in businesses that aren't getting results from their enrichment work.

The first is measuring success by volume rather than quality. Enriching a million SKUs with two attributes each will make no difference to your filters, your navigation, or your conversion. The measure that matters is quality against your actual channel-facing taxonomy, the one your customers interact with.

The second is treating enrichment as a one-off project. A big push, followed by nothing, followed by another crisis. Product data is not a project. It's an ongoing operational process that needs to account for new product introductions, supplier updates, and catalog changes continuously.

If you had to pick one

Start with internal reuse. Understand what you already have before investing in any external sourcing method. It's the fastest route to a baseline, and it tells you exactly where the gaps are.

From there, if you need to scale and your structure is solid, AI ingestion is the most powerful lever available. Not as a replacement for human judgement, but as a way to make a small team capable of handling a very large catalog.

‍

Get this in your inbox

Fortnightly. The best thinking on product data ops, straight to you.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Episode 7: Every Way to Source Product Data: What Works, What Doesn't, and Where to Start

1. Manual copy and paste

2. Data scraping

3. Data pools

4. AI ingestion and structuring

5. Supplier feeds, templates, and portals

6. Internal reuse

The mistakes to avoid

If you had to pick one

Get this in your inbox

More Resources

Episode 7: Every Way to Source Product Data: What Works, What Doesn't, and Where to Start

Episode 6: Why Enriching Your Top Sellers First Is the Wrong Move

Episode 5: Missing Images Are a Data Problem. Start Treating Them Like One.

Solutions

Solutions

Company