The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Full text indexing - how's it done?

Just curious if anyone knows of white papers (the ACM site is $pay$ and I don't belong) or documented open source implementations of full text indexing and search and retrieval.

A thumbnail overview of the subject would be appreciated.

And would it be totally stupid/wasteful to simply use a database engine to index occurrences of non trivial words in a large body of text?
Bored Bystander Send private email
Friday, October 29, 2004
 
 
As a start you can take a look at http://sourceforge.net/projects/lucenedotnet/

That's the open source indexing system used as the back-end to the Outlook plug-in "Lookout"
Dennis Forbes Send private email
Friday, October 29, 2004
 
 
Joel Spolsky Send private email
Friday, October 29, 2004
 
 
You can kind of fake it, and fake it enough to make it look as if its full text indexing. 

For any application we do that has any kind of textual fragments attached to records, Memos on Orders, Invoices, Notes, user entered free text in other words; we add those words (over and above some size limit, usually 4 letters) into a keywords table which records the table and key value that word appears in.

Then its straightforward to query based on words (or parts of words) and get back those records, relations. 

Its fake because if the user goes back and edits the record, new words are added, but deletions are ignored.  The major reason is performance, but the justification is that if they used a word in the past for that record and it was large enough its probably still relevant even if they deleted it.

Its also fake because we don't try and highlight the word or find it within the text, or even a specific field (since a particular record might have more than one free text field).  Again in our applications that doesn't matter because its the record that matters not the word.

But its certainly good enough as a fake and if the user's data really was dynamic enough that edits of free text made a significant difference I'd add the handling of deletes as well.

If the free text is the point of the application then you have to store the offsets of the word within the field or document and you have to index every occurrence of the word within the same tuple (in the fake version we just index each word once per record).

But all in all we've had success with the technique, for information management apps (which have lots of free text), it still has considerable utility though you tend to also have to have an allowable keyword list for short jargon words and acronyms.

We've never bothered blacklisting words other than by size (missing out because, therefore and so on), for a couple of reasons, but mostly because it would be more expensive to look up a blacklist before adding each word and if users want to look for connective syntax why should we care about the slight overhead in storage space.
Simon Lucy Send private email
Friday, October 29, 2004
 
 
PostgreSQL has all-open-source full-text indexing and searching.
Flamebait Sr.
Friday, October 29, 2004
 
 
==>And would it be totally stupid/wasteful to simply use a database engine to index occurrences of non trivial words in a large body of text?

Not at all. Many SQL DB engines have built-in full text indexing and sell that as a feature. If you're already using the database for something else, why reinvent the wheel and roll your own? On the other hand, I wouldn't install a database server just for it's full text indexing features. There's too many other ways to skin that cat that don't involve installing, configuring and supporting a full blown database server platform.
Sgt. Sausage
Friday, October 29, 2004
 
 
"If you're already using the database for something else, why reinvent the wheel and roll your own?"

Features.  I did it.  It wasn't my idea -- came down from management.  They wanted more features / control than the built-in full text indexing provides.  So I rolled by own.

It uses two tables: SearchWords and SearchIndex.  The words table is just a mapping of search words to an autonumber for performance reasons.  The SearchIndex maps together the content, the word id, and a score. 

Whenever content is added, it is separated into words.  The common words are filtered out.  The remaining words are stemmed (resolved to their root form), "synonymed", and then counted up.  Using the SearchWords table, the word ids are found for the words (and new words are added).  The count is turned into a score "round((log10(Count)+1)*10)" and that's used for ranking.  Then for each word, an entry is added in the SearchIndex table.

Searching involves a similar process.  The search terms are stemmed, "synonymed", and turned into word ids.  You can also have required terms (prefix with "+") and not allowed terms (prefix with "-").  The search query is pretty complicated -- mine involves grouping bit operations and temporary tables -- but is pretty fast.  The temporary table is used so I can then join with the data table and pull up meaningful result information.

I preferred using the built in full text indexing but it's pretty limited.  You'd have to be VERY exact with the terms or the search wouldn't find it.  The addition of the stemming algorithm makes searching more fuzzy -- producing better results.
Almost Anonymous Send private email
Friday, October 29, 2004
 
 
Hey guys!

As always, I get the best help here. Thanks. Even the fellow purgatory dweller in Ohio helped out. :-)

No, I'm actually quite nasty, so I am definitely not sucking up. :-)
Bored Bystander Send private email
Friday, October 29, 2004
 
 
Try CiteSeer:

http://www.neci.nj.nec.com/homepages/lawrence/citeseer.html

Most recent papers are freely available. I've actually been able to find quite a bit of information abou tthat particular
topic.
Just Me
Saturday, October 30, 2004
 
 
Bored Bystander, Google for Lucene (Java) or Plucene (its Perl port), for a decent modular OSS  implementation of full-text search engine. Also check out xapitan.org (a more advanced one in C++).

Usind an RDBMS to store inverted index is generally considered suboptimal, but there are successful implementations that do just that (Mnogosearch and ASPseek, for example). Speaking of rolling out your own engine, it's difficult for me to imagine a scenario when it would be really justified.
Egor
Saturday, October 30, 2004
 
 
Sorry, http://xapian.org, not xapitan.org
Egor
Saturday, October 30, 2004
 
 
These guys have scored some runs - been going since CP/M days:

http://www.isysusa.com/products/whitepapers/
trollop
Sunday, October 31, 2004
 
 
I've been down this rabbit hole indexing chromosomes. Without knowing the specific application its hard to give too much practical advice. Are you stemming, case folding, are you indexing words or the whole document etc.?  But... for me at least.

1) As many others have already said, use SQL & as many canned solutions as possible. Get creative with them because creatively using known solutions is usually a better strategy than new "brilliant" ideas.  Sure they'll waste some gigs but at least you'll get SOMETHING up asap. This should be SOP right?

2) BUT when the canned solutions get bogged down, don't get too wrapped up in citeseer or ACM articles unless you've got some serious time to write up & test the alternatives. I got burned on this one bad. All those papers tend not to tell you when the good idea fails. I bit on a pretty algo & the hype around it, then the [blankity] thing just wouldn't scale from 10s of megs into 100s of megs.  Page thrash till you crash. Weeks down the drain. My boss was, rightly, pissed & so was I for being so naive.

3) Because this suff was important to my job I went to amazon.com & got a text book: "Managing Gigabytes" by Witten, Moffat & Bell. For my work, it was only tangentially relevant to my immediate need but still a good primmer that got me thinking in the right direction.  Sure I continue to read citeseer but now I look at what _isn't_ being said in the papers too & that's just as important.
Twisted Dweeb Send private email
Sunday, October 31, 2004
 
 
Just to put some interim closure on this topic, here is what I am doing.

I want to find an open source (FreeBSD or LGPL style license) implementation that I could build into my pet project, which is a new breed of email client that I may try to sell some day.

The language would have to be Win32 compatible - this is not a dot net or Java project.

After checking out all the links everyone posted, CLucene (a C++ port of Lucene) seems to fit my parameters the best. The add on to PostgreSQL seemed similar in capabilities but I didn't feel like peeling the text indexing out of it.

Now fighting to get the (CLucene) sumbitch to build in *some* environment. It wouldn't build in Visual Studio 8 Express. It builds but the test executable crashes in VC++ Studio 6. So now I'm trying the Gnu toolchain in Fedora.

If this doesn't work, a fallback attempt will be to hand roll my own and to try the general strategy of using the DB engine to create an index of significant words.

Reading about the terminology, like "stemming", helps me understand the landscape.

There is a commercial embeddable engine: the "dtSearch" engine, but the price is completely out of reach for my purposes.

Thanks again, everyone.
Bored Bystander Send private email
Sunday, October 31, 2004
 
 
Another option for you to look at: lqtext [1], a full text indexing engine in C available under LGPL or a quirky default license (I particularly like clause 5 :-). I don't know how it compares to the various Lucene ports or other toolkits you've looked at but it certainly stacked up well against the competition at one time. Liam (the author) loves talking about indexing and text retrieval so he'd be happy to help you understand the topic more.

Disclaimer: I should probably disclose that Liam is my brother!

[1] http://www.holoweb.net/~liam/lq-text/
Laurie Harperr
Monday, November 01, 2004
 
 
There's one caveat I'd give about stemming.  If you want to support it, support it in the query rather than the storage if you're doing your own keyword index approach.

Its much simpler to let the user decide what they're looking for rather than try and work some language magic to decide that a word has a root that's worth storing instead of the usage of the word itself.
Simon Lucy Send private email
Monday, November 01, 2004
 
 
Simon,

I think the overhead of stemming at the query level would make it too slow to be of any use. When stemming at the index is you get a much quicker lookup and it's actually faster (smaller) to stem than to not.  Stemming at the query is more work.

I think ultimately such a option would just end up being confusing to users anyway.
Almost Anonymous Send private email
Monday, November 01, 2004
 
 
I'm with Simon here - the precision is crap when the stemming is done in the index rather than the query - you tend to get things like 'communism' stemming the same as 'communications' (and when you run a search engine your boss will get phone calls to complain about this).

In the query, rather than stemming you need word expansion, so that a search for 'society' matches 'societal' and 'socities' (but with less strength than a full match - this really helps the ranking). As far as I know, a dictionary-based system is better here than a purely algo system like stemming.
Jamie Send private email
Monday, November 01, 2004
 
 
Bored- I've been fighting with CLucene for a month or so.  There are a number of bugs in the current code, so feel free to drop me a line if you get stuck.  I'm using it with VC++ 6.0.  Also the CLucene-dev mailing list is somewhat helpful.
Ken Send private email
Monday, November 01, 2004
 
 
By stemming at the query I meant let the user figure out what proportion of the word they want to search for.  Again its more about specific implementations of knowledge domains but a very generally applied free text retrieval will be outperformed by a restricted, fake, full text indexing system when you have a relatively small number of words in the vocabulary.

A small number would be of the order of ten thousand or so, a completely general syntactical algorithm would need hundreds of thousands of words in the vocabulary to begin to be more useful.
Simon Lucy Send private email
Tuesday, November 02, 2004
 
 
Also, doesn't ACM allow you to search for documents anyway? If you happen to hit a particularly interesting document in ACM, trying Google with the document name (with or without quotation). Many papers are available elsewhere too, often including Citeseer.
Mystran
Thursday, November 04, 2004
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz