The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Large text data sets

I've been working for a while on some text analysis algorithms, working off texts from project gutenberg and wikipedia.

Does anyone have pointers to other freely available large english text bases? Something different in style would be good. Can be anything from hollywood gossip, to travel writing to news. No scientific texts though, I don't want to train in the directon of such things.

thanks!
crunch4profit
Sunday, December 31, 2006
 
 
You could try archiving a few weeks worth of articles from big newspapers/services. Combine the Washington Post, New York Times, Los Angeles Times, Chicago Tribune and the Miami Herald at a dozen articles per day, 10 KB/article, and you have over 3 MB per week. You can probably buy archival data from these newspapers on CD-ROM, but I have no idea what it would cost.

I'd say you could simply run a web spider, but I suspect you want to train on gramatically correct english, not the sort of drivel found on most web pages. By restricting your collection to major newspapers you guarantee some minimum level of editorial competence.

As for any concerns of copyright infringement, I would think that this falls under fair-use since you won't be redistributing the archived data.
Jeffrey Dutky Send private email
Monday, January 01, 2007
 
 
He is writing an engine to produce semi-coherent spam messages using markov chains.

There's some money in that, but it's been done before. The new methods are much more interesting.
Call Me Dr. Spam
Monday, January 01, 2007
 
 
It think DMOZ is often used as a training set http://dmoz.org/. Similarly yahoo.
son of parnas
Tuesday, January 02, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz