The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Make SQL Full Text search behave like Google

Hi all,

Our software indexes a large database of text content using SQL full text index.  It works, but the search results aren't as good as they could be.

A lot of users have high expectations for search based on their experience with Google.  Clearly something like Pagerank isn't going to work on this kind of database, but it would be great to be able improve the rankings

In addition, better search term behaviour would be good as well.  Stemming, quoted phrases etc.

SQL Full text search does parts of this, but it will require more functionality to make it more user accessible.  Does anyone know of any libraries (or approaches) that build on and enhance sql full text search to provide better results?

I'm aware of third party search tools such as Lucene.NET, but our preference is to avoid dependening on external indexing solutions.

Suggestions?


Derek
Derek Send private email
Tuesday, April 01, 2008
 
 
Search is a bit of a tall order my friend. Lucene is open source and IMO pretty effective. If you have to tightly integrate it, perhaps analyse the source code.

However, I have done exactly what your doing now and I ended up using LDAP to store large volumes of text as opposed to a database. The overhead of those queries grows exponentially as time goes on and the text volumes increase. use SQL to store indexes to the various content. and most importantly  you can keep the content zipped to save disk space.
Richard Corsale Send private email
Tuesday, April 01, 2008
 
 
I'm pretty sure SQL Server text searching supports both matching of phrases and stemming.

http://www.simple-talk.com/sql/learn-sql-server/sql-server-full-text-search-language-features,-part-2/

You have to build the query appropriately but that's not too difficult.
Arethuza Send private email
Wednesday, April 02, 2008
 
 
One thing I've learned is that search is very domain-specific. 

I built a generic engine to handle term vectorspace searching (in use in several of our products, and customer websites), and I can tell you that for effective search in a given domain, you have to tweak the results. You can see it at: http://www.bigattichouse.com/vectorspace.html

I know google has some algorithmic magic, but much of what they do is based on normalizing the data between page links.  Searching a database, unless you have something to provide contextual links between the data - will never look quite like google.  In a way you only get "one layer" of information as each record is discreet, if you have context (like using some of your other referential tables to provide context), you can add a layer of meaning on the data - and get more relevant searching.

Pop me an email, and check out our online example using VSDB and Ajax for knowledgebase (it has a subset of the microsoft knowledgebase in it) at:

http://www.bigattichouse.com/knowledgebase_demo/login.php
user: ash
pass: boomstick
Start typing a problem like "Error message when user clicks..." and it will start finding solutions.
Michael Johnson Send private email
Wednesday, April 02, 2008
 
 
@Michael Johnson

Awesome!

so I'm wondering. do you query after each keystroke or wait for a break in key strokes?
Richard Corsale Send private email
Wednesday, April 02, 2008
 
 
When you return results, do you have a preview (sample text from around the search term) displayed? If so, you could "vote" for a document if it gets selected, and return the commonly selected documents first (so popular documents bubble to the head of the queue).

You could also use some heuristics (document word count, document owner), to populate the initial order, but I would imagine you hit diminishing returns pretty quick.

You won't get page-rank like behavior (google algorithm) unless documents link to each other, and even then it could be tricky to implement. Besides, google has a patent.
PeterR
Thursday, April 03, 2008
 
 
@Richard

  I actually set up a timer(1 second), but I check to see if the text changed before I ping it.  I tried various other ways, but they were either ridiculously "expensive" over the wire (on change), or choppy.  1 second seems to be reasonable for most typing.  Its not perfect from a programming standpoint, but it seems to be reliable enough for production use.
Michael Johnson Send private email
Thursday, April 03, 2008
 
 
Lucene is really good at this sort of thing, and fast too.
Ruatara P Send private email
Thursday, April 03, 2008
 
 
I would suggest you look at our product. We are fast Windows SQL server, with t-sql and free-text (using Lucene) build in.

Our integration with Lucene and how to use it is documented here:
http://www.scimore.com/doc2/Full-Text_Indexing.html
Because we rely on Lucene, all the standard query functionality that you are used to from Google, is also present.

You can use it as embedded, standalone or distributed.
We have an embedded sample, that uses Lucene to search your local media files. It comes with the installer.

And using the distributed version of Scimore, you can massively scale freetext search over many machines to handle either massive requests or massive data.
Its mentioned here:
http://www.scimore.com/doc2/Scale-out_Lucene_full-text.html

Regards,
Thomas
Scimore Send private email
Tuesday, April 08, 2008
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz