I’m starting to work on the E-mail harvesting program now. The other day I went to myspace and took a look around. Guess what? No e-mail addresses are visible anywhere. There’s no specific place to pull e-mail addresses from. That’s when I decided to go check out facebook. These guys are crafty. They include your e-mail address but they include it as an image. That way you can’t just copy and paste the text. Well I think to think that I am craftier. I started doing a little Google research on linux-based OCR software. For those that don’t already know OCR stands for optical character recognition. This software will read an image and turn the text located within it into an actual editable text document.
I found this awesome article comparing many different OCR engines designed for linux. I’ve decided that gocr is the simplest solution that should do everything I need it too. I just need a program I can send an image too and have that program send me back text. That is exactly how gocr works. Now i just have to get it installed on CentOS.
I found the source for gocr at http://jocr.sourceforge.net. I just run the command:
wget http://prdownloads.sourceforge.net/jocr/gocr-0.45.tar.gz
Then I extract the file:
tar -xzvf gocr-0.45.tar.gz
configure, make, and install:
./configure
make
sudo make install
The image files on facebook are png images. gocr uses a utility called pngtopnm to convert the image to a format it can understand. This utility is included in the netpbm package.
sudo yum install netpbm
sudo yum install netpbm-progs
Now that everything is installed I can just try running the program with a downloaded facebook email image.
gocr -i test.png
The image I gave it contained my email address “ricosgoo@uat.edu”. The result: “ricgoouat.edu”. It seems as though gocr didn’t pick it up correctly. I’m pretty sure the reason is that the ‘o’ and the ’s’ in the image are touching each other. gocr probably thinks it is one character and cannot recognize it so it is just leaving it out. Also, it missed the @ symbol. I tried a different facebook image and the @ sign was missing from that as well. It would seem as though gocr does not support the @ sign in its dictionary. I might need to try a different OCR program.
Doing some more google research, I found that many people feel that HP’s tesseracr-ocr is one of the best open-source OCRs there is. That was my next logical step. I followed this guide again to get the software up and running.
wget http://tesseract-ocr.googlecode.com/files/tesseract-2.01.tar.gz
tar -xzvf tesseract-2.01.tar.gz
cd tesseract-2.01
./configure
make
sudo make install
Now I have to install the English language dictionary files for tesseract.
wget http://tesseract-ocr.googlecode.com/files/tesseract-2.00.eng.tar.gz
tar -xzvf tesseract-2.00.eng.tar.gz
cd tesseract-2.00.eng
sudo cp * /usr/local/share/tessdata/
I also needed to install ImageMagick so that I can convert the facebook images to tiff files. I have to do this because tesseract-ocr only supports tiff images right now.
sudo yum install ImageMagick.i386
Now I convert the image to a tiff file.
convert test.png test.tiff
Now I try out the OCR.
tesseract test.tiff test.txt
No good. I get error messages. Here is Tesseract’s output:
Tesseract Open Source OCR Engine
name_to_image_type:Error:Unrecognized image type:test.tiff
IMAGE::read_header:Error:Can’t read this image type:test.tiff
tesseract:Error:Read of file failed:test.tiff
Signal_exit 31 ABORT. LocCode: 3 AbortCode: 3
I have to take a break from all this now, so I’ll deal with these problems later.