Posts Tagged ‘script’

E-mail Harvest

Thursday, April 10th, 2008

I’m starting to work on the E-mail harvesting program now. The other day I went to myspace and took a look around. Guess what? No e-mail addresses are visible anywhere. There’s no specific place to pull e-mail addresses from. That’s when I decided to go check out facebook. These guys are crafty. They include your e-mail address but they include it as an image. That way you can’t just copy and paste the text. Well I think to think that I am craftier. I started doing a little Google research on linux-based OCR software. For those that don’t already know OCR stands for optical character recognition. This software will read an image and turn the text located within it into an actual editable text document.

I found this awesome article comparing many different OCR engines designed for linux. I’ve decided that gocr is the simplest solution that should do everything I need it too. I just need a program I can send an image too and have that program send me back text. That is exactly how gocr works. Now i just have to get it installed on CentOS.

I found the source for gocr at http://jocr.sourceforge.net. I just run the command:

wget http://prdownloads.sourceforge.net/jocr/gocr-0.45.tar.gz

Then I extract the file:

tar -xzvf gocr-0.45.tar.gz

configure, make, and install:

./configure
make
sudo make install

The image files on facebook are png images. gocr uses a utility called pngtopnm to convert the image to a format it can understand. This utility is included in the netpbm package.

sudo yum install netpbm
sudo yum install netpbm-progs

Now that everything is installed I can just try running the program with a downloaded facebook email image.

gocr -i test.png

The image I gave it contained my email address “ricosgoo@uat.edu”. The result: “ricgoouat.edu”. It seems as though gocr didn’t pick it up correctly. I’m pretty sure the reason is that the ‘o’ and the ’s’ in the image are touching each other. gocr probably thinks it is one character and cannot recognize it so it is just leaving it out. Also, it missed the @ symbol. I tried a different facebook image and the @ sign was missing from that as well. It would seem as though gocr does not support the @ sign in its dictionary. I might need to try a different OCR program.

Doing some more google research, I found that many people feel that HP’s tesseracr-ocr is one of the best open-source OCRs there is. That was my next logical step. I followed this guide again to get the software up and running.

wget http://tesseract-ocr.googlecode.com/files/tesseract-2.01.tar.gz
tar -xzvf tesseract-2.01.tar.gz
cd tesseract-2.01
./configure
make
sudo make install

Now I have to install the English language dictionary files for tesseract.

wget http://tesseract-ocr.googlecode.com/files/tesseract-2.00.eng.tar.gz
tar -xzvf tesseract-2.00.eng.tar.gz
cd tesseract-2.00.eng
sudo cp * /usr/local/share/tessdata/

I also needed to install ImageMagick so that I can convert the facebook images to tiff files. I have to do this because tesseract-ocr only supports tiff images right now.

sudo yum install ImageMagick.i386

Now I convert the image to a tiff file.

convert test.png test.tiff

Now I try out the OCR.

tesseract test.tiff test.txt

No good.  I get error messages.  Here is Tesseract’s output:

Tesseract Open Source OCR Engine
name_to_image_type:Error:Unrecognized image type:test.tiff
IMAGE::read_header:Error:Can’t read this image type:test.tiff
tesseract:Error:Read of file failed:test.tiff
Signal_exit 31 ABORT. LocCode: 3  AbortCode: 3

I have to take a break from all this now, so I’ll deal with these problems later.

Another new idea and a cantenna update

Tuesday, April 8th, 2008

Today I only went to one class: Law370. Normally, I really hate the thought of going to the class, but it’s always a lot of fun. That professor really knows how to teach. I always learn something new from that class. Today, we were separated into groups and each group had to research a specific law regarding cyber-crime. This whole activity spawned a new project idea.

My group was assigned the CAN-SPAM act of 2003. This act basically has all these rules regulating how spam e-mail can be sent. I’m not going into that because it’s long, it’s complicated, and it really doesn’t matter for my project idea. My project basically will be a script that will crawl social networking sites like Facebook and MySpace to collect e-mail addresses. It gets more diabolical than that, though. The script will log onto someone’s MySpace account and get their e-mail. Then, the script will log onto each of that person’s “Top 8″ friends and get THEIR e-mail addresses. Now, the script can send a phishing e-mail to each of the friends on the “Top 8″ list and spoof the e-mail that it originates from to look like it is coming from the original person. I think this would be an awesome and fun proof of concept. I would never use actually use this for my own malicious purposes, although I would be interested to see how well it would actually work. I really just want to write this script just to do it. It would give me an excuse to brush up on my scripting and programming skills.

I think I’ll get started on this idea soon, seeing as it won’t cost me any money.

Another update here. I started working on the cantenna project some more. I bought the pigtail that I need, cut off one end and soldered it to the PCMCIA card. I’ve also soldered the piece of copper wire to the jack that attaches to the can. All I need now is a can to attach this thing too. The solder points on the PCB were so small, I’m not sure that the connections will be good enough. Hopefully I’ll find out tomorrow. I don’t have any class so I have the entire day off. I plan on getting a can either form the cafe at school or from the supermarket. I shall update the cantenna page as time permits.