Posts Tagged ‘linux’

New Project Completed

Thursday, May 22nd, 2008

It’s been a while since I posted on here. There are several reasons for that. The main reason is that my latest project has been taking all my spare time and it was a secret. I didn’t log any of it until just a few minutes ago because I didn’t want the secret to get out. It is an anniversary present for my girlfriend. You can check out the project page for more details on that.

The second reason is that my web server has been down and I haven’t fixed it until recently. My server rebooted one day when I lost power and Apache refused to start for some reason. Rather than sitting down to fix that, I just spent all my time working on the anniversary project. It turns out there was some other instance of httpd running in the background hogging port 81. I have no idea why this was. I’ll have to reboot the system again to see if the problem occurs again. At least I’ll know what the problem is.

In other news, I have started the Near Space class at school last week. I am really excited for this class. We will be sending a balloon equipped with computer, science experiments, and a camera into near space in just a few months. Ryan is splitting the class into teams and should have them posted on the e-shell this weekend at some point. Hopefully I’ll have access to the shell soon. I just e-mailed a local enthusiast to see if he wants to come to class to share his experiences and offer some words of wisdom. Hopefully that will go over well.

My dad should be sending me another radio, antenna and a Tiny Trak 3 module next week. I can’t wait to get that stuff. I want to start messing with APRS tracking as soon as possible to get a feel for it before we actually do a launch. I’m hoping to be on the tracking and telemetry team for the near space class.

I suppose that’s enough updating for now. I have to take some photos of the anniversary lamp to stick on that page, as well as get a schematic up. Man, I still need to get a schematic up on the graduation hacks page… I’ll get on that soon. I’ll also post a video of the lamp in action. Until then.

E-mail Harvest

Thursday, April 10th, 2008

I’m starting to work on the E-mail harvesting program now. The other day I went to myspace and took a look around. Guess what? No e-mail addresses are visible anywhere. There’s no specific place to pull e-mail addresses from. That’s when I decided to go check out facebook. These guys are crafty. They include your e-mail address but they include it as an image. That way you can’t just copy and paste the text. Well I think to think that I am craftier. I started doing a little Google research on linux-based OCR software. For those that don’t already know OCR stands for optical character recognition. This software will read an image and turn the text located within it into an actual editable text document.

I found this awesome article comparing many different OCR engines designed for linux. I’ve decided that gocr is the simplest solution that should do everything I need it too. I just need a program I can send an image too and have that program send me back text. That is exactly how gocr works. Now i just have to get it installed on CentOS.

I found the source for gocr at http://jocr.sourceforge.net. I just run the command:

wget http://prdownloads.sourceforge.net/jocr/gocr-0.45.tar.gz

Then I extract the file:

tar -xzvf gocr-0.45.tar.gz

configure, make, and install:

./configure
make
sudo make install

The image files on facebook are png images. gocr uses a utility called pngtopnm to convert the image to a format it can understand. This utility is included in the netpbm package.

sudo yum install netpbm
sudo yum install netpbm-progs

Now that everything is installed I can just try running the program with a downloaded facebook email image.

gocr -i test.png

The image I gave it contained my email address “ricosgoo@uat.edu”. The result: “ricgoouat.edu”. It seems as though gocr didn’t pick it up correctly. I’m pretty sure the reason is that the ‘o’ and the ’s’ in the image are touching each other. gocr probably thinks it is one character and cannot recognize it so it is just leaving it out. Also, it missed the @ symbol. I tried a different facebook image and the @ sign was missing from that as well. It would seem as though gocr does not support the @ sign in its dictionary. I might need to try a different OCR program.

Doing some more google research, I found that many people feel that HP’s tesseracr-ocr is one of the best open-source OCRs there is. That was my next logical step. I followed this guide again to get the software up and running.

wget http://tesseract-ocr.googlecode.com/files/tesseract-2.01.tar.gz
tar -xzvf tesseract-2.01.tar.gz
cd tesseract-2.01
./configure
make
sudo make install

Now I have to install the English language dictionary files for tesseract.

wget http://tesseract-ocr.googlecode.com/files/tesseract-2.00.eng.tar.gz
tar -xzvf tesseract-2.00.eng.tar.gz
cd tesseract-2.00.eng
sudo cp * /usr/local/share/tessdata/

I also needed to install ImageMagick so that I can convert the facebook images to tiff files. I have to do this because tesseract-ocr only supports tiff images right now.

sudo yum install ImageMagick.i386

Now I convert the image to a tiff file.

convert test.png test.tiff

Now I try out the OCR.

tesseract test.tiff test.txt

No good.  I get error messages.  Here is Tesseract’s output:

Tesseract Open Source OCR Engine
name_to_image_type:Error:Unrecognized image type:test.tiff
IMAGE::read_header:Error:Can’t read this image type:test.tiff
tesseract:Error:Read of file failed:test.tiff
Signal_exit 31 ABORT. LocCode: 3  AbortCode: 3

I have to take a break from all this now, so I’ll deal with these problems later.