The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Product Normalization Library

Dear Lazyweb,

Suppose you import a product feed.

This feed has many identical products but they're all listed slightly differently.  On purpose, of course.

When you do a search against the database, you no doubt have product listings like this:

iPod Nano Black 2GB
Apple - iPod Nano Black 2GB
iPod Nano Black 2GB (Apple)

It seems like an easy enough problem to solve, but I can imagine spending years of my life tuning a Bayesian or neural network product normalization system.  Before I go into hermit-mode on this, I thought I'd ask if I really have to be doing this.

Anon for this one
Tuesday, July 31, 2007
This is just like address matching. It will require some preprocessing to get it set up. So you'll have some tables like:



ID (which is the PK of the above table)
Hash (cannot be PK)

So you get something like

34..."Apple - iPod Nano Black 2GB"
2..."Apple - iPod Nano Purple 2GB"

34...Hash(2GB Apple Black iPod Nano) (note alpha sorted words, you should make them all upper/lower case)
2...Hash(2GB Apple iPod Nano Purple)

So you get some new candidate to match:
"iPod Nano Black 2GB"
Step 1 - alpha sort the words, removing "noise" words like "the" or "a". Becomes "2GB Black iPod Nano"
Step 2 - hash and look for matches. No match
Step 3 - if you're not confident enough in the match, then hash each word, and you'll get sets of matches for each word. The closest match should have larger numbers of hits on the words.

Lookup(Hash(iPod Nano Black 2GB)) returns null.
Lookup(Hash(2GB)) returns 34, 2
Lookup(Hash(Black)) returns 34
Lookup(Hash(iPod)) returns 34,2
Lookup(Hash(Nano)) returns 34,2

Largest number of hits: #34. So you should be reasonably confident that the input matches item #34 in your TableOfInternalPartNumbers.

For a slightly longer worked out sample of this, look in the post office CASS technical guide.
Peter Send private email
Tuesday, July 31, 2007
The two approaches: Manual matching of common misspellings vs. AI and/or regex each have drawbacks -- you just need to decide which approach is least worst.

Manual : Is, uhh, manual.  Someone has to enter in the original list, plus keep it updated as people find new creative ways to mangle product names.

AI/Regex : Will occasionally have a false hit or extra negative.  You need to decide what your accuracy tolerance is.
xampl Send private email
Thursday, August 02, 2007

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz