Mining Craigslist

Craigslist has a huge amount of data in different subjects.

Not just subjects for sales but jobs, personals, discussion forums and more.

I believe this enormous amount of information is more powerful than general sale system such as Ebay or Amazon.

Craigslist allows you to see what human desire around the world.

Python, BeautifulSoup

04.30.2014

This time, I am going to visualize the data with multidimensional scaling.

I will be using two-dimensional representation for dataset of Craigslist job tightness.

It is basically,

1. Calculate every pair of the dataset's distance and build a reference list of the dataset.

2. Scatter the dataset randomly on a grid.

3. With given iteration, it will try to correct difference between

distance(positions of randomly placed pairs) vs

distance(referenced list).

-correction process is more complex than just "correction"

https://github.com/fkkcloud/D.M-Craigslist/blob/master/datafromcraiglist.py

I am getting some of the pair's distance as 0.0 which causes problem when dividing it from float.

float/0 --> ZeroDivisionError!!

For now, I am skipping it with try: but I will have to come back to this.

04.29.2014

More further steps have been done successfully.

Using Collaborative filtering, I was able to prepare sets of data to create Hierarchical clustering

which runs on

that is quite slow when it gets larger data.

It might be worth to switch this with K-Means-Clustering later.

To find tightness between cities, I used Pearson correlation method and added little more modification.

distance * division of common item count between 2 cities by average of 2 cities items count sum

It allows me to have general proportion for distance comparable with other city's distance.

Each subject of the thread on Craigslist will be splitted and re-merged with 2 letters.

Subject "SERVICE WRITER,TRUCK SHOP MANAGER" will become

['service writer', 'writer truck', 'truck shop', 'shop manager']

Below are list of words that is going to be discarded from subject before it split.

['for','a','of','get','the','with',\ 'per','now','hiring','included',\ 'wanted','and','you','hr','hrs','rate','to','up','are',

\ 'at','some','our','in','want','et cetera','around','your',\ 'year old', 'yrs old', 'from','needed','need','am','pm',\ 'time']

It is showing that Tokyo and Las Vegas have 15 common jobs out of 200 threads and the 15 common jobs from

Tokyo and Las Vegas have 9 common jobs from Portland and so on.

I've used PIL library to visualize the cluster with Dendrogram.

Lots of help from the mighty book "Programming Collective Intelligence" by Toby Segaran.

https://github.com/fkkcloud/D.M-Craigslist

04.27.2014

Craigslist do not provide any type of API to access their information so I decided to use BeautifulSoup to parse through Craigslist to gather useful informations and run data clustering to find what human wants in different cities.

I am doing it for 'job' category only for first test.

https://github.com/fkkcloud/CollectiveIntelligence/blob/master/Cluster/datafromcraiglist.py

Result is quite interesting.

I iterate with 6 different cities - Las Vegas, Los Angeles, SF Bay Area, Seoul, Tokyo and London.

It seems like North American Region (Las Vegas, Los Angeles and SF Bay Area) is

having similar interest in job category seeking for beverage hospitality compare to

Asian Region (Seoul, Tokyo) seeking for education teaching.

London's result was not that useful for this simulation.

I still need to tweak filtering algorithm to get more precise information but it is quite interesting.

Location : lasvegas

(u'beverage hospitality', 9)

(u'food beverage', 9)

(u'trades artisan', 7)

(u'skilled trades', 7)

(u'admin office', 5)

(u'advertising pr', 3)

(u'general labor', 3)

(u'marketing advertising', 3)

(u'retail wholesale', 2)

(u'assistant bookkeeper', 2)

Location : losangeles

(u'beverage hospitality', 12)

(u'food beverage', 12)

(u'spa fitness', 4)

(u'admin office', 4)

(u'trades artisan', 4)

(u'salon spa', 4)

(u'skilled trades', 4)

(u'retail wholesale', 3)

(u'legal paralegal', 3)

(u'line cook', 2)

Location : sfbay

(u'beverage hospitality', 14)

(u'food beverage', 14)

(u'admin office', 5)

(u'customer service', 4)

(u'retail wholesale', 3)

(u'dba etc', 3)

(u'qa dba', 3)

(u'software qa', 3)

(u'line cook', 2)

(u'juice shop', 2)

Location : seoul

(u'education teaching', 46)

(u'kinder elementary', 4)

(u'part time', 3)

(u'beverage hospitality', 3)

(u'near seoul', 3)

(u'after school', 3)

(u'food beverage', 3)

(u'pt pm', 3)

(u'wed or', 2)

(u'seoul mtthf', 2)

Location : tokyo

(u'education teaching', 22)

(u'part time', 6)

(u'beverage hospitality', 4)

(u'film video', 4)

(u'tv film', 4)

(u'video radio', 4)

(u'general labor', 4)

(u'english teacher', 4)

(u'food beverage', 4)

(u'in tokyo', 3)

Location : london

(u'et cetera', 25)

(u'student girls', 6)

(u'general labor', 6)

(u'blonde petite', 6)

(u'new blonde', 5)

(u'petite blonde', 3)

(u'yr only', 3)

(u'work tonight', 3)

(u'yr new', 3)

(u'only start', 2)

Google Sites

Report abuse