So a while back, Twitter rolled out the option to download the entire archive of all the tweets. This got me thinking what if we wanted to analyze the tweets using Hadoop & its ecosystem to draw out interesting facts about the way we tweet, retweet etc.

Introduction

Hadoop Twitter Analytics aims to develop a solution for analyzing tweets using Hadoop & related technologies. This project currently contains fetching the Twitter IDs of those users whose statuses have been retweeted the most by the user whose tweets are being analyzed. Multiple analytics will continue to be added as I work on developing this project.

Fork me on GitHub

Fork me on GitHub

Grocery Shopping

Before we actually get our hands cloudy, we’re going to go grocery shopping and pick up the ingredients.

  1. Log in to your twitter account.
  2. Go to Settings. Near the end of the page you will see the option of downloading your entire tweet archive.
  3. Click on the same and wait for a confirmation mail.
  4. Once the archive is ready and you’ve got the download link, click on it to download the zip file.

This zip file contains a full offline browsable interface of all your tweets. Tweets are present in 2 formats:

  • CSV
  • JSON

This project currently uses the CSV format for analysis since it’s easier to parse and process in Map Reduce jobs. Later on, we will switch to JSON format since the latter contains much more details about the tweet than the former one.

Let the fun begin

Assuming that you have a working hadoop installation- either standalone or pseudo-distributed or fully distributed. I also assumes that you have maven installed and correctly setup.

Building & Compiling

  1. Open a shell in the directory where pom.xml of this project exists.
  2. Execute
    mvn hadoop:pack
  3. A jar named analytics-hdeploy will be created will all the necessary extra dependencies packed in the /target/hadoop-deploy folder

Pre-heating the oven

Before we actually execute the jar, we’ll have to decide which way we wanna go- either process the csv files directly or pre process them to make it easier for hadoop. You may have noticed that there is 1 csv file for each month you’ve been active on twitter. That may been many files and we all know hadoop is not very good at processing large number of files. So, we can either provide the entire csv/* folder as input to hadoop or combine these csv files into 1 single csv on which hadoop can work. If you wish to choose the latter, I’m including one script to do all the work for you. The same in included in the /data/ folder.

  1. Copy all your csv files in the /data/csv folder.
  2. Execute
    ./process_csv.sh

    This will read all the csv files from the folder and combine them into 1 csv named combined.csv. This is written in awk scripting language.

Deploying & Executing

If you are running hadoop in pseduo-distributed or fully distributed mode, be sure to copy the input csv folder or combined.csv file into HDFS before executing the map reduce jobs.

Assuming hadoop’s bin folder is in the Path variable, execute the following command after entering the appropriate path to the jar file created in the previous step and also the path to csv/combined.csv (HDFS or local as the case may be):

hadoop jar /path/to/analytics-hdeploy.jar com.github.hadoop.twitter.TweetsAnalyzer /path/to/input /path/to/output

Which method you choose, the task will output the count of retweets and the Twitter ID of the users whose status were retweeted by you, separated by a tab. Then, you can sort the list by piping its output. For example,

hadoop fs -cat /path/to/output/* | sort -n -r > leaderboard.txt

This command will sort the output of map reduce jobs in descending order and output the results in a text file named leaderboard.txt

This project is open source at Github. Please contribute here.

Advertisements