The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

converting email file into image and OCR

i was wondering if anyone has an idea about how i can resolve an email into one single image. 'screengrab' on firefox does it, if i open my email on firefox. but i was wondering if there is an easier/better way to do it. does anyone know what kind of script i need to write and where?
also, i need this image to be of decent quality, coz i need to OCR it. is there any good free ocr out there? i use gocr but the results arent great.
this is a research project of mine, that i work on in my spare time. if anyone has an idea or wants to discuss this further, please reply here or contact me. i would love to talk about this.. appreciate your help
Arvind Ashok Send private email
Friday, December 08, 2006
 
 
Wait... you want to convert something that is already text into an image, just to OCR it back to text? Let me guess... you are researching how to effectively create spam mails (with images instead of plain text) that can't be automatically detected by OCR'ing spam filters?

OK, this isn't helpful for your question. But for what do you need this? How to place something on thedailywtf? Print the code out, place it on a wooden table, take a photo of it, import the photo into Excel, put the Excel file into a database, ...

Yes, I'm curious. ;)
Secure
Saturday, December 09, 2006
 
 
No, not creating spam. spam comes with images instead of text or images and text. so, the idea is to make get a picture of everything, and then convert it back to text. the result would be an email with just text, which cant bypass a spam filter.
Arvind Ashok Send private email
Saturday, December 09, 2006
 
 
Yes, you already said that. But the original mail is already text. What's wrong with copy&paste, or do you have some deeper reason for doing this?
Secure
Saturday, December 09, 2006
 
 
Why not just replace the images with the out of OCR'ing them.  Don't introduce more complexity & sources of error than you have to.
Oscar Send private email
Saturday, December 09, 2006
 
 
Too late, those bastards are already onto you.

Take a look at the newer versions of Spamassassin - there are plugins that do OCR on attached images.
http://wiki.apache.org/spamassassin/OcrPlugin
http://antispam.imp.ch/patches/patch-ocrtext
http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

The result?  I've got an inbox of spam that has JPEG's with random noise to fool the OCR.  There is a very close analogy between spam and antibiotic resistant bacteria.  The stronger the "cure", the more resilient the parasite.
http://fm.vix.com/internet/security/superbugs.html
Cory R. King
Saturday, December 09, 2006
 
 
Er, but I guess I didn't answer your question.  I really don't understand what you are trying to do...
Cory R. King
Saturday, December 09, 2006
 
 
Just use GDI functions to print it onto a bitmap in an OCR friendly font with maybe black type on a white background. Even the worst OCR programs should be to read that.
Steve
Sunday, December 10, 2006
 
 
Fuzzy OCR does not do what I want to do. Fuzzy ocr will fail if i have VIAGRA but the IA and RA are jpegs and the other letters are letters. So, by composing them into one image, and extracting the text out of that with OCR, i will get back viagra.
Arvind Ashok Send private email
Sunday, December 10, 2006
 
 
Steve, thanks for the suggestion of GDI but I dont really know how that works. What language would I write it in, and how would it work - as a plugin to something etc?

Thanks
Arvind Ashok Send private email
Sunday, December 10, 2006
 
 
Using Arvind's latest example, the solution to VIAGRA isn't to convert the entire email into an image. You otherwise run the risk of nested images which can make it more difficult to OCR them.

As an example, say I make a xerox copy of a magazine article. Then I make a xerox of the copy. Then I make another copy of that copy and so on. After four or five such copies, the letters will be blurred enough that OCR has a difficult time reading them.

We use OCR here extensively and we automatically reject any second generation or later copies; it's just simply too expensive to deal with the potential problems that might pop up.  (We only accept originals or direct copies of the originals.)

What you really want to do is OCR the IA and RA images and then insert the resulting text back into the emails. For that, you can use a Perl script to scan for images in emails, mark their locations, extract the images, OCR them, parse the resulting text and insert them back where they belong.
TheDavid
Monday, December 11, 2006
 
 
awesome. Thanks a lot. I will try that out, though it will probably take me a while to do that. But it definitely makes sense, and should be better than my method. Again, thanks a lot "TheDavid"
Arvind
Monday, December 11, 2006
 
 
I've been using dspam for a while now, and none of the image-spams make it through the filter. The downside may be a false positive if someone sends me an email with just an image, but that hasn't happened yet.
Matthias W. Send private email
Tuesday, December 12, 2006
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz