Website of Intrepid

Recent site activity

1.   Background

Sina Weibo is a Chinese microblogging site, similar to Twitter, launched by China’s biggest web portal Sina Corporation in 2009. As the earliest and the biggest microblogging platform, Sina Weibo has more than 100 million users and millions of posts per day. Its audience varies from regular users to celebrities, company representatives, politicians, and even country presidents. Therefore, it is possible to collect text posts of users from different social and interests groups. Actually the biggest challenge in this project is Chinese information processing. There have been many research conducted based on English information from Twitter, and many crawler or data retrieval tools have been invented to collect information from Twitter. But by the complexity of Chinese, only few researches are about Chinese microblogging platform. Compared to other researchers, we have a huge advantage in our native language--Chinese.  Meanwhile, we are hoping that Chinese voice can be heard in much more extensive ways by our project.

 

2.   System Components 

2.1      System Schema

       Sina Weibo API and Java SDK

Sina Weibo provides this Application Program (API) and Software Development Kit for its developers. Here we will interface its API and perform data extraction through its Java SDK. 

       Oracle Database

The data we collect from Weibo will be stored in Oracle Database. Each tweets associate with its related information will be store in one table as one record with five fields.

       Hadoop

Hadoop’s cluster computing and parallel processing techniques are employed to save time and energy as our collection data size goes to terabytes or even bigger. We also use its wordcount method to generate our own Chinese Dictionary for online network words.

       IK Analyzer

It’s a technique that performs Chinese words segments, which is a preliminary procedure on collected data before sending them to the wordcount method or natural language processing program.

       User Window and Applet

This part of design enables the user to query their interested topics or locations through the Internet. Our database will return the search results.

       Text re-structuring program

This program can process multiple non-structured txt files to the database-friendly format, and perform encoding format conversion if necessary.

       JFreeChart

Based on the returned result of our database, this program will generate Bar and Pie Charts automatically, providing graphical response of users’ query results.    

 

Fig 1: The general procedures of our system design

  

Fig2: The flowing chart of our whole system

2.2      Sina Micro Blog—Weibo

2.2.1         Why Sina Weibo

a)       Sina Weibo has more than 100 million users, and broadcasts 25 million tweets each day.

b)      Sina Weibo is the biggest Microblog platform in China (its market share close to 40%-50%. The other two platforms are TengXun and Sohu). It contains an enormous number of text posts and it grows every day.

c)       Weibo’s audience varies from regular users to celebrities, company representatives, politicians. Therefore, it is possible to collect text posts of users from different social and interests groups.

d)      Weibo’s audience is represented by Chinese users from many countries. It is possible to hear the voice from global perspectives.

2.2.2         Sina Weibo API & SDK

Sina Weibo is an open microblogging information platform, which subscribes to share and exchange information between users and developers. Open microblogging platform provides you with a mass of micro-blog information, fan relations, etc. Its easy channel access enables information communication anytime, anywhere.

 

Developers can log in Sina Weibo platform and create applications using the interface provided by it, to create interesting applications or to make your website has more social features.

 

Sina Weibo provides Software Development Kit in many programming languages such as Java, PHP, Python, and C++ etc. Each SDK comes with sample codes which makes the APP development easier and more enjoyable.

 

As the first level developers, when writing our own program to interface Sina Weibo Platform, we followed its rule of secure access and run our programs under its connection limits, which is 10 seconds per query. Each query returns 20 tweets, this affects our collection efficiency badly, and however the accuracy of each tweet is guaranteed. [10]

2.3      Oracle Database     

2.3.1         Oracle Database 11g Release 2

 A database is an organized collection of data. The data can be textual, like order or inventory data, or it can be pictures, programs or anything else that can be stored on a computer in binary form. 

 

A relational database stores the data in the form of tables and columns. A table is the category of data, like Employee, and the columns are information about the category, like name or address. 

 

Oracle is a program that is running in the background, maintaining your data for you and figuring out where it should go on your hard drive. 

 

Data is accessed through SQL, or Structured Query Language. It allows you to SELECT your data, INSERT new records, UPDATE existing records and DELETE records. [8]

 

We chose the Oracle Database 11g release 2 Standard Edition, Standard Edition One, and Enterprise Edition (11.2.0.1.0).  We installed the following two versions on three different machines to make three copies of our collected data

 Microsoft Windows (32-bit)

 Microsoft Windows (x64)

 

2.3.2         PL/SQL Developer 8.0.0.1480

PL/SQL is the procedural language extension to SQL. PL/SQL is a programming language like C, Java or Pascal. In the Oracle world, there is no better way to access your data from inside a program. SQL can be natively embedded in PL/SQL programs. I will be using both SQL and PL/SQL very heavily in my future articles. [9]

 

PL/SQL is a feature-rich language geared toward developing database applications. PL/SQL is the procedural language of the database, but it is also the procedural language for most of Oracle's tools. Programs that run inside the database are called stored procedures. These stored procedures are almost always PL/SQL, but can be written in Java. [5]

2.3.3         Interface with Oracle Database Directly

There are also a lot of third party tools for accessing the database. For our purposes, we employed Java to communicate with Oracle Database because Java is natively supported by Oracle.

 

We write our own Java program on one hand to connect Sina Weibo API and collect data (tweets) from it and store them into our database. Our program store each piece of data as one record, into 5different fields, in this case are ID, USERNAME, TEXT, TIME, and LOCATION.

 

See Appendix D-a & D-e for detail of this program.

 

Fig 3: PL/SQL view of the content of table “Test”

2.4      Text re-structuring Program

2.4.1    Encoding Format Conversion

Given the circumstance that encoding format of Simplified Chinese Character (GBK) is not readable to many other applications, we need to convert GBK encoding txt files to a most popular used encoding format UTF-8. We invoked methods in Java to convert both single file and multiple files to the desired format.

 

See Appendix D-g for detail of this program.

2.4.2    Interface with Oracle Database Indirectly

In some occasions, people may need to process text from simple txt files. We enable this function in our tools development and all the collected data can be stored in txt files in uniform format for other uses. To make the data also applicable to import into database, we also came up the program to convert the data in txt file to a format that suitable for database to recognize. Each piece of data is bound as a record and each field (such as ID, Name etc.) is separated by a Tab. An example of the input and output of this program:

 

Input:

 

No. 83

ID: 1401432364

Name:峡客

ScreenName: 峡客

Text: 我正在上海黄兴公园  http://t.cn/hBNIXq

Time: Mon Apr 04 03:20:33 EDT 2011

Location: 上海

 

Output:

1401432364    峡客       我正在上海黄兴公园  http://t.cn/hBNIXq Mon Apr 04 03:20:33 EDT 2011 上海      

 

See Appendix D-h for detail codes.

 

2.5           User Window and Applet

We design this user-friendly and easy used window and also embedded it into website for Internet users.

Fig 4: User Window

 

Enter:              to enter the topic that user is interested in

Next:               to enter the place that user is interested in, 4 places at most. Number of places can be modified

                       by developer.

Start:               to show the thorough query results in text format in the window.

Place:              to input 4 different places that user is interested in

AutoGenerate:  to generate Bar and Pie chart of returned results based on 4 locations: Beijing, Shanghai,

                      Guangdong, and Overseas.

Generate:        to generate Bar chart of returned results based on 4 locations that specified by user.

Exit:                to close the window and exit the program.

 

Fig 5: Demo Result

 

See Appendix D-d for detail of this program.

 

2.6           JfreeChart

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart's extensive feature set includes [3]:

·         a consistent and well-documented API, supporting a wide range of chart types;

·         a flexible design that is easy to extend, and targets both server-side and client-side applications;

·         support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

The following are two Charts we generated for use in our project:

 

Fig 6: Tweets on Qaddfai

 

Fig 7: Tweets on Qaddfai 2

 

See Appendix D-b & D-c for detail of this program.

2.7           IK Analyze     

2.7.1         Difference between English and Chinese

English is word-based. One word has its own meaning, even though there are many ambiguities, but one word means something in English, so we can give it a tag for its most used meaning. For Chinese, it is character based; one character may means nothing in separate. One character has to be combined with its adjacent one or more characters to form a useful word.

2.7.2         Solution

Because of the significant of Chinese, we cannot use Hadoop to process our job directly. A kind of software called IK Analyzer was found, this software can break one sentence in Chinese into separate meaningful words which can be used in our future study.

It has two methods to separate the sentence into segments. [11]

1           Longest Word Segment

In this method, the software just separates the sentence into words which meets the longest one in its dictionary.

Example: 我是中国人 (I am Chinese)

After segment: | |中国人

I|am|Chinese

2           Most Fine Grit Segment

In this method, the software will search the sentence word-by-word and do the longest-word segment first, and then the software will search its own dictionary to check whether that word can be cut into finer-grained words.

Example: 我是中国人 (I am Chinese)

After segment: | |中国人| 中国 |   国人|  

             I | am| Chinese | China | Citizen from a Country| Person

 

See Appendix D-f for detail of this program.

 

2.8           Hadoop and Experiments Results     

2.8.1         Why do We Need Hadoop

Hadoop is open-source software for reliable, scalable, distributed computing. The reason we choose this software is because: as we are collecting more and more data, and the limitation of only one computer’s computing ability, one computer isn’t able to handle all the jobs. We did some research and found Hadoop is the best choice for our project, we want to combine computers together to do parallel computing, therefore, we can handle no matter what size the file is. Another data which supports our idea is: more than 100 companies are using Hadoop nowadays, including adobe, EBay, Facebook, Google, IBM and so on. Hadoop’s wordcount method which is based on one method called MapReduce helps us generate our dictionary in around 4 minutes (it costs 17 minutes to do the same job database) which can be used for our future study.

 

In our project, we use Hadoop’s wordcount function to generate our own dictionary [Appendix A] [6]. Since we want to focus on Chinese research, in order to use Hadoop’s function to finish this job, first we did some pre-processes.         

2.8.2         Hadoop’s Main Components

2.8.2.1   HDFS (Hadoop Distributed File System)

Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations (Hadoop).

Fig 8: Hadoop Distributed File System (HDFS)

 

2.8.2.2   MapReduce

MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes (MapReduce).

Fig 9: MapReduce Software Framework

 

Fig 10: Work Process of MapReduce

 

A picture worth a thousand words, this picture shows how the MapReduce framework works. There are six key-value pairs on the top of this picture. The key is the same as the document ID and Value represents the text in the document. The framework assigns these pairs to the mappers in separate or combined in pairs randomly. The Hadoop counts the frequency of each word it finds. For instance, “a” exists once and “b” twice in the first mapper, c three times and c 6 times in the second mapper (there are two documents in it). After finishing this process, the framework shuffles it around and put all the “a”s, “b”s and “c”s into one reducer. The framework sums up all the numbers and gives out a total frequency of one word. This is a toy example of how the framework works, if you want to know more technique details, please refer to our midterm report.

 

2.8.3         System Set Up

l  Single-Node Set Up

Set up everything on every computer you want to make it as part of your Hadoop framework.

l  Multi-Node Set Up

Multi-node is a group of single-nodes combined together through a common networking.

Details will be showed in Appendix A and B.

2.8.4         Hardware

Master:

Dell optiplex745

CPU

Intel® Core™2 CPU

HDD

80GB

RAM

2 GB of DDR2 RAM

 

 

Slave:

Dell Inspiron 1300 (Laptop)                                           Dell Optiplex 960 (Desktop)

CPU

Intel Celeron M 1.4 GHz

 

CPU

 

Intel Core 2 Duo E8400 3GHz

HDD

Fujitsu (MHV2040AH) 40GB

 

HDD

 

80GB

 

RAM

1GB

 

RAM

3GB of DDR2-800

 

 

 

 

2.8.5         Experiments

2.8.5.1   First experiment

                 Fig 11: Line chart of the Experiment Result

 

We did several experiments to check whether the Hadoop really helps in increasing the speed. In the first experiment, we did wordcount for a 1GB file. As we can check from the table, it costs about 5minutes to finish the work instead of 17 minutes in the database way. The speed really improves a lot in the Hadoop computing. One more thing is,Hadoop has a big potential. We can add more slaves into the system if we want to deal with a really big amount of data, as what Google is using now. However, the acceleration of only one computer’s speed is really limited. We believe that Hadoop or some other cluster computing is really necessary in data analysis.

2.8.5.2  Second experiment

In this experiment, we compared the data we get from oracle and wordcount and formed the line chart. The X-axis is the time period and the Y-axis is the quality of tweets we get. Even though there is a little difference between the two lines in the beginning few periods, but that is unavoidable. The two lines almost get overlapped from the middle point to the end which proves that our two ways of analyzing data all work perfectly.