The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

simple algorithm for weighting search results

i am looking for information on a simple system for weighting search results

i am not looking to make the next killer websearch
just scan an sql database and get the most relevant matches

the fields are mainly text, but might not be dictionary words. there are between 5-20k rows in each table.
i only need to search one table at a time

at the moment we are using LIKE keyword% and LIKE %keyword%
on preselected columns in each table

they get the right results, just not in the best order

thanks
bumperbox
Wednesday, June 13, 2007
 
 
You need a definition of 'most relevant'
For google's page rank it is the number of other pages linking to this one.
For my CRM system it is the most complete match to this customers name / email
For my todo list it's the loudness of the person screaming for this feature.
Martin Send private email
Wednesday, June 13, 2007
 
 
Try SQL full text indexing or Lucene.NET
Ruatara P Send private email
Wednesday, June 13, 2007
 
 
the database i am using doesn't support full text search

i had look at lucence.net, it is not an option at the moment as we not developing in dotnet, but i have bookmarked it for later use, thanks
bumperbox
Wednesday, June 13, 2007
 
 
You can approach this like an algorithms problem (Google) or an AI problem.  If you think of it as an AI problem, the best solution is likely to be a simple and clever hack highly specific to your circumstances. 

For example, lets say I work for an online brokerage and I'm implementing search for company press releases.  Lets say all I've got is the title of the press release, the date, and the text.  If I wanted to spend an hour and greatly increase customer satisfaction from the search results, I would create an index for the table of all tickers mentioned in the release (trivial) and then, since I have the luxury of peeking at your portfolio since you're logged into my site, priortize releases from companies in your portfolio.
Patrick McKenzie (Bingo Card Creator) Send private email
Wednesday, June 13, 2007
 
 
You can use a point system to guage the importance, and sort descending on number of points garnered by each matched row.

1. All search terms occur in a single field - 3 points
2. Search term found in name field - 1 point
3. Search term found in title field - 2 points
4. two search terms are contiguous in a field - 1 point* number of such pairs
Donald Duck
Wednesday, June 13, 2007
 
 
Yes, the simple algorithm just calculates a relevancy weight, based on whatever rules are appropriate for your audience.  This is really a requirements analysis problem.
Mike Stockdale Send private email
Thursday, June 14, 2007
 
 
The "suite link" chapter of this pdf should give you some ideas:
http://ribbs.usps.gov/FILES/CASS/TECHNICAL_GUIDES/CASSTECH.PDF
(pages 133/134 of the printed doc, which adobe might represent as pages 138/139)
This can give a "weighting" for data that doesn't change much (the index table will be larger than the original data table). When doing address/name matching, many of the words won't be dictionary words.
Peter Send private email
Thursday, June 14, 2007
 
 
You are asking a question that has been the subject of many PhD theses over the last thirty years. Something based on the  relative frequency of relevant words is recommended, but after that, your guess is probably as good as anyone elses.
DJ Clayworth
Thursday, June 14, 2007
 
 
Here's a technique that I've used many times in the past:

http://en.wikipedia.org/wiki/Vector_space_model

Though, from what I gather, it's a little more complex than what you're looking for.
BenjiSmith Send private email
Thursday, June 14, 2007
 
 
Step 1: Define clearly what "most relevant" means; is the term occuring in field1 of more relevance than in field2? Is the term occuring in full of more relevance than as part of another word? Is the term occuring earlier in the text of more relevance than right at the end? Are there other indicators that influence the decision?

Step 2: Create a points scoring algorythym based upon your conclusions from step 1, so perhaps the relevance "score" can start at 10 for the first character in field1, dropping by 1 point down to zero for each three characters in; the score is multiplied by 1.3 if the term is found as a word (i.e. whitespace to either side) in field2 and loses 5 points if the term is not present at all in field4.

Step3: Write a (probably mind bogglingly complicated) query to return the results with a relevance column worked out as in step 2. Make sure that, no matter where the query is held, it is commented very clearly as to how AND why the relevance is calculated. Miss out the comment and you WILL regret it later, I promise you.
Paul Brown Send private email
Thursday, June 14, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz