Posts Tagged ‘security’

E-mail Harvest

Thursday, April 10th, 2008

I’m starting to work on the E-mail harvesting program now. The other day I went to myspace and took a look around. Guess what? No e-mail addresses are visible anywhere. There’s no specific place to pull e-mail addresses from. That’s when I decided to go check out facebook. These guys are crafty. They include your e-mail address but they include it as an image. That way you can’t just copy and paste the text. Well I think to think that I am craftier. I started doing a little Google research on linux-based OCR software. For those that don’t already know OCR stands for optical character recognition. This software will read an image and turn the text located within it into an actual editable text document.

I found this awesome article comparing many different OCR engines designed for linux. I’ve decided that gocr is the simplest solution that should do everything I need it too. I just need a program I can send an image too and have that program send me back text. That is exactly how gocr works. Now i just have to get it installed on CentOS.

I found the source for gocr at http://jocr.sourceforge.net. I just run the command:

wget http://prdownloads.sourceforge.net/jocr/gocr-0.45.tar.gz

Then I extract the file:

tar -xzvf gocr-0.45.tar.gz

configure, make, and install:

./configure
make
sudo make install

The image files on facebook are png images. gocr uses a utility called pngtopnm to convert the image to a format it can understand. This utility is included in the netpbm package.

sudo yum install netpbm
sudo yum install netpbm-progs

Now that everything is installed I can just try running the program with a downloaded facebook email image.

gocr -i test.png

The image I gave it contained my email address “ricosgoo@uat.edu”. The result: “ricgoouat.edu”. It seems as though gocr didn’t pick it up correctly. I’m pretty sure the reason is that the ‘o’ and the ’s’ in the image are touching each other. gocr probably thinks it is one character and cannot recognize it so it is just leaving it out. Also, it missed the @ symbol. I tried a different facebook image and the @ sign was missing from that as well. It would seem as though gocr does not support the @ sign in its dictionary. I might need to try a different OCR program.

Doing some more google research, I found that many people feel that HP’s tesseracr-ocr is one of the best open-source OCRs there is. That was my next logical step. I followed this guide again to get the software up and running.

wget http://tesseract-ocr.googlecode.com/files/tesseract-2.01.tar.gz
tar -xzvf tesseract-2.01.tar.gz
cd tesseract-2.01
./configure
make
sudo make install

Now I have to install the English language dictionary files for tesseract.

wget http://tesseract-ocr.googlecode.com/files/tesseract-2.00.eng.tar.gz
tar -xzvf tesseract-2.00.eng.tar.gz
cd tesseract-2.00.eng
sudo cp * /usr/local/share/tessdata/

I also needed to install ImageMagick so that I can convert the facebook images to tiff files. I have to do this because tesseract-ocr only supports tiff images right now.

sudo yum install ImageMagick.i386

Now I convert the image to a tiff file.

convert test.png test.tiff

Now I try out the OCR.

tesseract test.tiff test.txt

No good.  I get error messages.  Here is Tesseract’s output:

Tesseract Open Source OCR Engine
name_to_image_type:Error:Unrecognized image type:test.tiff
IMAGE::read_header:Error:Can’t read this image type:test.tiff
tesseract:Error:Read of file failed:test.tiff
Signal_exit 31 ABORT. LocCode: 3  AbortCode: 3

I have to take a break from all this now, so I’ll deal with these problems later.

New Idea

Monday, April 7th, 2008

Working for a company like CCBill gives me perspective on something that I hope to never have happen to me. Identity theft. I usually talk to at least a few people every day who are claiming that their credit card has been stolen. Some of these people don’t even realized it’s happened for over a year after it happens. Most of these unauthorized charges could have been prevented if the person had just checked their credit card statement more often. This makes me realize that I need to check my statement more often. Well, I’m lazy like everyone else and I often don’t think to check my statement. Hence, a new project idea was born.

Most banks or credit card companies have some feature where they will notify you if they see suspicious activity on your account. It’s obviously not always available and if it is, it’s not foolproof. I was thinking today that I should be able to write a script to check my bank account daily. This script would send me an e-mail alert or text message alert if it sees suspicious activity. I think to start out “suspicious” will equate to a sum of money over a certain amount. Eventually it could have complicated algorithms programmed in to determine if something is in my normal spending habits. Things like gas and food under certain amounts of money would be alright, whereas movies and internet subscriptions of any dollar amount would be flagged.

Also, on the topic of security I found a few websites that describe ways to secure Wordpress. I’ll have to go through these tonight to beef up my security. Here are the links:

Wordpress security plugins

Wordpress security tips