The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Indexing large amount of text

In a recent interview question I was asked what is the best way to index the contents of a book.

I couldn't think of anything better than a hashmap mapping a word to a collection of page numbers, but the guy didn't seem impressed and I didn't think to get an answer out of him. 

Obviously the hashing slows down index creation, and I could have used the hash as a key to a multimap - but I got the feeling that it was a much more involved question.

What are the issues here?
Chris
Wednesday, November 02, 2005
 
 
Question 1: what type of book?
Question 2: what type of query are you going to have?
Question 3: what kind of performance is needed?
etc
new nick, new rep
Wednesday, November 02, 2005
 
 
I picked up on the 'what type of book' question.

He said to presume Moby Dick.  I guess the fact that it's a childs book with probably a number of repeated words from a small vocabulary has some impact on this but I still don't know how and if this points towards doing it differently.
Chris
Wednesday, November 02, 2005
 
 
some thoughts (you've probably already had):
probably whole word and phrase queries
not going to rank pages since arbitrary
the whole book is like a single page (page numbers not significant)
index to book text offsets not page numbers
common words (is the this) not included
speed of building index separate issue from speed of querying
definitely a hash table, but what size?
average book vocabulary 1500 words? less?
so hash table size some prime around 500 to average just a few collisions, or 3000 to try to reduce collisions
issues of the best hash algorithm for speed and spread
rank by concentration of words in vicinity?
more weight for chapter titles?
issues of word forms: water, waters, watering
output a book style index with only proper nouns?
fun stuff
Ben Bryant
Wednesday, November 02, 2005
 
 
Building a hash of words:page numbers is the fastest and most reasonable approach to take given such a vague set of requirements.
Michael B
Wednesday, November 02, 2005
 
 
If you wanted to generate a printed index, for the rear of the book, then you'd want to sort it alphabetically.  And you'd want to remove 'common words' -- either through using a 'common word' list, or through a histogram analysis (don't index words that appear more than 'X' times, etc).

Otherwise, I would think a hash algorithm would be a pretty good answer.  Though you'd still want to do something to remove 'common words' from that implementation, too.

And Moby Dick is NOT a 'child's book', the original is quite wordy and long-winded, full of philosophical wanderings.  That could have been the problem.
AllanL5
Wednesday, November 02, 2005
 
 
You've never read Moby Dick, have you?
Jonas Grumby Send private email
Wednesday, November 02, 2005
 
 
Maybe the fact you assumed Moby Dick is a children's book lead the interviewer to believe that you're illiterate.
a2800276 Send private email
Wednesday, November 02, 2005
 
 
The usual approach (there are others) for building a search engine is to store an inverted index. So for each unique word  you store a list of occurrences, probably right down to the location of the word.
Arethuza
Wednesday, November 02, 2005
 
 
Agree with the others that you probably lost more credibility for your ignorance of Moby Dick than your approach to the problem.
Developer #13
Wednesday, November 02, 2005
 
 
"In a recent interview question I was asked what is the best way to index the contents of a book."

The correct answer, if this was the exact wording of the question, most likely would have been: 'There is no BEST way, it depends on the circumstances and requirements. Nonetheless, a hash map may be good enough for most cases of simple word indexings as actually found in most books blah fast blah simple blah blah... while on the other hand, a binary tree gives the words in alphabetic order without a further sorting step blah blah...'
Secure
Wednesday, November 02, 2005
 
 
And BTW, a hash map is a bad approach when you want to search the index, allowing a fuzzy search with spelling errors. There is more than just build the index. You eventually want to use it, too.
Secure
Wednesday, November 02, 2005
 
 
I agree with everyone else, you sir, are an idi0t.
bh
Wednesday, November 02, 2005
 
 
bh,

excellent example. You meant idiot and mistyped idi0t instead. Thus instead of finding the word on first or second try, you would have to search the whole hash table entries, with the final result of "idi0t not found" if no fuzzy search is available.

With a binary tree approach, at least all words beginning with "idi" could be given as an option quite easy:

"Did you mean one of the following?
idiocy
idiolect
idiom
idiosyncrasy
idiot"
Secure
Thursday, November 03, 2005
 
 
The kind of trees normally used to store indices are called tries. They're a little different to binary trees, but not much so. There's a python implementation I've used successfully in the past here:

http://www.jtauber.com/blog/2005/02/10/updated_python_trie_implementation
R. C. James Harlow Send private email
Thursday, November 03, 2005
 
 
Everything you need to know about Search (which is really what you want an Index for) was written by Tim Bray in his series of articles called "On Search":

http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC
Oren Send private email
Thursday, November 03, 2005
 
 
I may be dating myself, but an 'eye'-'dee'-'ten'-'tee' error used to be useful shorthand for discussing the 'ID10T's in their presence :).
a former big-fiver Send private email
Thursday, November 03, 2005
 
 
man...I would've never seen that interview question coming.  Is this for the publishing industry or something?
grover
Thursday, November 03, 2005
 
 
By the way, Moby Dick is _not_ a large amount of text, by modern standards.
Pakter
Thursday, November 03, 2005
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz