The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Forum searching by traditional search engines (ok, Google)

Has anyone else tried specifically making their own forums Google-able?  Will this take a crippling amount of bandwidth to pull off?

In my head I'd like all forum searching to work as well as, but unfortunately I don't own a multi-million (billion?) dollar server farm.  Google also cheats with their USENET search because they only reindex the 'new posts' that come in every so often.  It's not so easy for me.

Of course everyone not in Google implements their own (typically awful) search for all of the major discussion group/forum products, but I'd really like to design something such that each post is Google-searchable.

We can do this with "fuzzy directory structures" where it essentially fakes a URL so Google sees each page as static.  We can also do a Slashdot-type HTML generation all on the server-side (in other words: build actual HTML files for each page; rebuild the file(s) when it/they need updating).  This part is doable.  I can at least envision this.

But I'm worried about every major search engine coming in and attempting to index every single page on the forum, every evening or however often they drop by.  For medium or large forums, that means tens/hundreds/thousands of thousands of pages, indexed by every search engine, every day or so.  Am I correct in assuming that this is a BIG chunk of bandwidth?

Or is there some sort of trick I can use to minimize bandwidth usage by search engine spiders?
pds Send private email
Thursday, November 03, 2005
either you want your forums spidered or you don't, right?
Sassy Send private email
Thursday, November 03, 2005
Let me take option 3, which is "I want it spidered but I don't want it using several/tens/hundreds of GBs per day to do so."
pds Send private email
Thursday, November 03, 2005
Spiders aren't interested in wasting bandwidth either. If you create static files and serve them as such, robots should respect file modification date and not re-crawl what has not changed. The same can be achieved with dynamic content if you use/check Last-Modified and If-Modified-Since HTTP headers.

As for unwanted spiders, you can block them in robots.txt or .htaccess file.
Friday, November 04, 2005
You're telling me what I want to hear--I LIKE THAT.

pds Send private email
Friday, November 04, 2005

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz