The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Web crawlers for a newbie...

I would like to create a web crawler which can scrape out search results from a number of websites.  The results I'm seeking are very specific portions of content from these sites.  It is similar to something such as www.simplyhired.com which indexes job postings across various online job sites. 

I'm very new to web crawlers with regards to how they are implemented, and search in general.  I've seen examples of free crawlers which simply follow links recursively and save content for indexing.  However, if I'm only interested in a subset of the site's content, which is generated as the result of a search query ("All jobs with java, or in zipcode 12345") how would a spider fetch these results?  Are spiders such as these issuing tons of search expressions and then indexing the results?  I'm not understanding how one could get at the search results in the first place if only following hyperlinks.

Any insights are appreciated.
mellon_helmet
Monday, June 12, 2006
 
 
http://codesnipers.com/?q=node/228

That should give you something to cut your teeth on. Basically, you're going to open a socket to the page in question and parse the data for what you need. (In the above example using regex.)
Aspiring College Developer
Monday, June 12, 2006
 
 
Not sure if it is more than you want, but the Alexa webcrawler's data is available for you to search via their APIs and can run on their server farms.  Making a custom search engine is one of their example uses.  Pretty cool, if not overkill:

http://websearch.alexa.com/welcome.html
PA Send private email
Monday, June 12, 2006
 
 
Most websites contain direct links for popular queries, which themselves contain results for similar queries.  In such cases, a spider can index most of the site's content without any human interaction.

There are also tools that can automatically create a range of query URLs for spidering, for example:

http://mysite.com/search.asp?zipcode=10001
...
http://mysite.com/search.asp?zipcode=99999


One such easy-to-use tool is the SuperBot Address Range Generator:
http://www.sparkleware.com/superbot/help/help.htm#URLRanges
Chris Marshall Send private email
Tuesday, June 13, 2006
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz