Monday, August 01, 2011

Data Sets

1. TIGER/Line dataset (http://www.census.gov/geo/www/tiger/)
2.2 millions California Roads in the TIGER/Line dataset widely used in spatial database research.

2. Kristina Lerman at USC ISI
http://www.isi.edu/integration/people/lerman/downloads.html

a. Digg 2009
This anonymized data set consists of the voting records for 3553 stories promoted to the front page over a period of a month in 2009. The voting record for each story contains id of the voter and time stamp of the vote. In addition, data about friendship links of voters was collected from Digg.

Download Digg 2009 data set

b. Flickr personal taxonomies
This anonymized data set contains personal taxonomies constructed by 7,000+ Flickr users to organize their photos, as well as the tags they associated with the photos. Personal taxonomies are shallow hierarchies (trees) containing collections and their constituent sets (aka photo-albums) and collections.

Download Flickr data set

c. Wrapper maintenance
Wrappers facilitate access to Web-based information sources by providing a uniform querying and data extraction capability. When wrapper stops working due to changed in the layout of web pages, our task is to automatically reinduce the wrapper. The data sets used for experiments in our JAIR 2003 paper contain web pages downloaded from two dozen sources over a period of a year.

Data set

Wednesday, July 27, 2011

Market Basket Analysis Algorithm with Map/Reduce and HBase on AWS EC2

Slide and Papers with Hadoop MapReduce and Hbase at PDPTA 2011 (http://www.world-academy-of-science.org/worldcomp11/ws/conferences/pdpta11)and at EDB 2011 (http://dke.khu.ac.kr/edb2011/)

(1) “(Submitted on June 30 2011) “Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”, Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, The Third International Conference on Emerging Databases (EDB 2011), Songdo Park Hotel, Incheon, Korea, Aug. 25-27, 2011 (pdf - only page 1 and 12)

(2) “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2011), Las Vegas (July 18-21, 2011) (Slide, pdf)

Tuesday, June 21, 2011

How to set up Hadoop and HBase together with Whirr on Amazon EC2

It is not easy to set up both Hadoop and HBase on EC2 at the same time. This is to illustrate how to set them up together with Apache Incubator project Whirr. Besides, it describes how to login the master node so that you can easily execute your Hadoop codes and HBase data on thde node remotely.

References

[1] Phil Whelan, http://www.philwhln.com/run-the-latest-whirr-and-deploy-hbase-in-minutes
[2] http://incubator.apache.org/whirr/quick-start-guide.html
[3] http://incubator.apache.org/whirr/whirr-in-5-minutes.html
[4] http://stackoverflow.com/questions/5113217/installing-hbase-hadoop-on-ec2-cluster
[5] http://www.philwhln.com/map-reduce-with-ruby-using-hadoop
[5.1] http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/

********************** Install Hadoop/HBase on Whirr [1] on Ubuntu 10.04 **********************
NOTES: install JDK 1.6 not JRE
1) mvn clean install
First time: hbsql not found install error
Second time: no problem successful

2) set the following:
export AWS_ACCESS_KEY_ID=xxxxxxxxxxxxxxxxxxxxxx
export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

2.a) 5 min test of Whirr [3]
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr
bin/whirr launch-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr

echo "ruok" | nc $(awk '{print $3}' ~/.whirr/zookeeper/instances | head -1) 2181; echo


2.b) bin/whirr destroy-cluster --config recipes/zookeeper-ec2.properties


3)jongwook@localhost:~/whirr$ bin/whirr launch-cluster --config hbase-ec2.properties

3.a) Exception in thread "main" org.apache.commons.configuration.ConfigurationException: Invalid key pair: (/home/jongwook/.ssh/id_rsa, /home/jongwook/.ssh/id_rsa.pub)

Solution)
ssh-keygen -t rsa -P ''

4) You will see the following for about 5 min
<< (ubuntu@184.72.173.143:22) error acquiring Session(ubuntu@184.72.173.143:22): Session.connect: java.io.IOException: End of IO Stream Read
<< bad status -1 ExecResponse(ubuntu@50.17.19.46:22)[./setup-jongwook status]
<< bad status -1 ExecResponse(ubuntu@174.129.131.50:22)[./setup-jongwook status]

5) then, hbase folder with shell and xml files are generated under '.whirr'
jongwook@localhost:~/whirr$ ls -al /home/jongwook/.whirr/
total 12
drwxr-xr-x 3 jongwook jongwook 4096 2011-06-17 16:19 .
drwxr-xr-x 46 jongwook jongwook 4096 2011-06-17 16:09 ..
drwxr-xr-x 2 jongwook jongwook 4096 2011-06-17 16:19 hbase

6) Setup proxy server at Systems > Preferences > Network Proxy [5, 5.1]
Mark SOCKS Proxy
Proxy Server: localhost
port: 6666

6.a) at another terminal - temr2, setupHadoopEnv :
jongwook@localhost:~/whirr$ source ~/Documents/setupHadoop0.20.2.sh

6.b) And, at another terminal - temr2, Run Hadoop proxy to connect external and internal clusters
jongwook@localhost:~/whirr$ sh ~/.whirr/hbase/hadoop-proxy.sh
Running proxy to Hadoop cluster at ec2-184-xx-xxx-0.compute-1.amazonaws.com. Use Ctrl-c to quit.
Warning: Permanently added 'ec2-184-xx-xxx-0.compute-1.amazonaws.com,184.72.152.0' (RSA) to the list of known hosts.

7) Run a sample hadoop shell at the original terminal - term1
11/06/17 17:03:20 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
Found 4 items
drwxr-xr-x - hadoop supergroup 0 2011-06-17 16:19 /hadoop
drwxr-xr-x - hadoop supergroup 0 2011-06-17 16:19 /hbase
drwxrwxrwx - hadoop supergroup 0 2011-06-17 16:18 /tmp
drwxrwxrwx - hadoop supergroup 0 2011-06-17 16:18 /user


8) at another terminal - temr3, setupHadoopEnv :
jongwook@localhost:~/whirr$ source ~/Documents/setupHadoop0.20.2.sh

8.a) And, at another terminal - temr3, Run HBase proxy to connect external and internal clusters; NOTE: need to close Hadoop proxy server because the port 6666 is shared
jongwook@localhost:~/whirr$ sh ~/.whirr/hbase/hbase-proxy.sh
Running proxy to HBase cluster at ec2-184-72-152-0.compute-1.amazonaws.com. Use Ctrl-c to quit.
Warning: Permanently added 'ec2-184-72-152-0.compute-1.amazonaws.com,184.xx.xxx.0' (RSA) to the list of known hosts.

9) Log in the master to run hadoop code with hbase data; user name is your local login, eg, jongwook for me.
jongwook@localhost:~/whirr$ ssh -i /home/jongwook/.ssh/id_rsa jongwook@ec2-75-xx-xx-xx.compute-1.amazonaws.com
OR
jongwook@localhost:~/whirr$ ssh -i /home/jongwook/.ec2/id_rsa-dal_keypair jongwook@ec2-75-xx-xx-x.compute-1.amazonaws.com

10) Now run Hadoop pi demo:
[root@ip-10-116-94-104 ~]# cd /usr/local/hadoop-0.20.2/
[root@ip-10-116-94-104 hadoop-0.20.2]# bin/hadoop jar hadoop-0.20.2-examples.jar pi 20 1000

11) setup path and CLASSPATH to run hbase and hadoop codes
export HADOOP_HOME=/usr/local/hadoop-0.20.2
export HBASE_HOME=/usr/local/hbase-0.89.20100924
export PATH=$HADOOP_HOME/bin:$HBASE_HOME/bin:$PATH

# CLASSPATH for HADOOP
export CLASSPATH=$HADOOP_HOME/hadoop-0.20.2-core.jar:$HADOOP_HOME/hadoop-0.20.2-ant.jar:$CLASSPATH
export CLASSPATH=$HADOOP_HOME/hadoop-0.20.2-examples.jar:$HADOOP_HOME/hadoop-0.20.2-test.jar:$CLASSPATH
export CLASSPATH=$HADOOP_HOME/hadoop-0.20.2-tools.jar:$CLASSPATH
#export CLASSPATH=$HADOOP_HOME/commons-logging-1.0.4.jar:$HADOOP_HOME/commons-logging-api-1.0.4.jar:$CLASSPATH

# CLASSPATH for HBASE
export CLASSPATH=$HBASE_HOME/hbase-0.89.20100924.jar:$HBASE_HOME/lib/zookeeper-3.3.1.jar:$CLASSPATH
export CLASSPATH=$HBASE_HOME/lib/commons-logging-1.1.1.jar:$HBASE_HOME/lib/avro-1.3.2.jar:$CLASSPATH
export CLASSPATH=$HBASE_HOME/lib/log4j-1.2.15.jar:$HBASE_HOME/lib/commons-cli-1.2.jar:$CLASSPATH
export CLASSPATH=$HBASE_HOME/lib/jackson-core-asl-1.5.2.jar:$HBASE_HOME/lib/jackson-mapper-asl-1.5.2.jar:$CLASSPATH
export CLASSPATH=$HBASE_HOME/lib/commons-httpclient-3.1.jar:$HBASE_HOME/lib/jetty-6.1.24.jar:$CLASSPATH
export CLASSPATH=$HBASE_HOME/lib/jetty-util-6.1.24.jar:$HBASE_HOME/lib/hadoop-core-0.20.3-append-r964955-1240.jar:$CLASSPATH
export CLASSPATH=$HBASE_HOME/lib/hbase-0.89.20100924.jar:$HBASE_HOME/lib/hsqldb-1.8.0.10.jar:$CLASSPATH

12) Run HBase demo:
jongwook@ip-10-xx-xx-xx:/usr/local$ cd hbase-0.89.20100924/
jongwook@ip-10-xx-xx-xx:/usr/local/hbase-0.89.20100924$ ls
bin CHANGES.txt conf docs hbase-0.89.20100924.jar hbase-webapps lib LICENSE.txt NOTICE.txt README.txt
jongwook@ip-10-108-155-6:/usr/local/hbase-0.89.20100924$ bin/hbase shell
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version: 0.89.20100924, r1001068, Tue Oct 5 12:12:44 PDT 2010

hbase(main):001:0> status 'simple'
5 live servers
ip-10-71-70-182.ec2.internal:60020 1308520337148
requests=0, regions=1, usedHeap=158, maxHeap=1974
domU-12-31-39-0F-B5-21.compute-1.internal:60020 1308520337138
requests=0, regions=0, usedHeap=104, maxHeap=1974
domU-12-31-39-0B-90-11.compute-1.internal:60020 1308520336780
requests=0, regions=0, usedHeap=104, maxHeap=1974
domU-12-31-39-0B-C1-91.compute-1.internal:60020 1308520336747
requests=0, regions=1, usedHeap=158, maxHeap=1974
ip-10-108-250-193.ec2.internal:60020 1308520336863
requests=0, regions=0, usedHeap=102, maxHeap=1974
0 dead servers
Aggregate load: 0, regions: 2

Tuesday, March 29, 2011

Market Basket Analysis Example in Hadoop

Market Basket Analysis is one of the important approach to analyse the association in Data Mining. The basic idea is to find the associated pairs of items in a store when there are huge volumes of transaction data as follows:
trax1: cracker, icecream, beer
trax2: chicken, pizza, coke, bread
...

The following is the example code that I implemented on Hadoop 0.21.0, which takes the input "AssociationSP.txt" and generates the top 10 associated items that customers purchased together. After I complete a paper for conference with this example code, I will post more detailed info.

Donwload
- ItemCount.java Source file to have an idea how it looks like
- cloud9-csulaud-0.1.jar file to execute the code
- AssociationsSP.txt input file
- itemscount_sort2.txt and itemscount_sort4.txt sample outs for two- and four-pairs of items

(1) You need to create a dir "data" and upload the file to "data" on HDF:
> hadoop fs -mkdir data
> hadoop fs -put AssociationsSP.txt data/

(2) type in and run the example code (output dir: itemcount, 5 reducers, 2 pairs of association):
> hadoop jar cloud9-csulaud-0.1.jar edu.calstatela.hadoop.example.associations.ItemCount data/AssociationsSP.txt itemcount 5 2

(3) Type in the following to see the analysis:
> hadoop jar cloud9-csulaud-0.1.jar edu.calstatela.hadoop.utils.analysis.AnalyzeInputCount itemcount

Saturday, February 26, 2011

set up Hadoop 0.19.2 on Eclipse 3.3.2 for Ubuntu 8.10

a. Refer to: http://ebiquity.umbc.edu/Tutorials/Hadoop/00%20-%20Intro.html

b.
Download Eclipse 3.3.2 Europa: http://www.eclipse.org/downloads/packages/release/europa/winter

c.
Download Hadoop 0.19.2: http://apache.osuosl.org//hadoop/core/hadoop-0.19.2/

Setup including the following hadoop-site.xml: http://hadoop.apache.org/common/docs/r0.19.2/quickstart.html#Local


<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000/<value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>


d.
cp [yourpath]/hadoop-0.19.2/contrib/eclipse-plugin/hadoop-0.19.2-eclipse-plugin.jar [yourpath]/eclipse/plugin/

e.
start eclipse: Can start eclipse at "File Browser" of Ubuntu

f.
Window > open perspective > other > map/reduce

g.
In hadoop to start, open 5 terminals and type in the following respectively
~/apache/hadoop-0.19.2/bin/hadoop namenode -format
~/apache/hadoop-0.19.2/bin/hadoop namenode

~/apache/hadoop-0.19.2/bin/hadoop secondarynamenode
~/apache/hadoop-0.19.2/bin/hadoop jobtracker

~/apache/hadoop-0.19.2/bin/hadoop datanode
issue: if there is an error "Unexpected version of storage directory /tmp/hadoop-jongwook/dfs/data", remove "data" folder in the error

~/apache/hadoop-0.19.2/bin/hadoop tasktracker


h.
At eclipse: http://ebiquity.umbc.edu/Tutorials/Hadoop/17%20-%20set%20up%20hadoop%20location%20in%20the%20eclipse.html
New hadoop location has the default value as follows as defined at hadoop-site.xml:
map/reduce master: localhost:9001
DFS mater: localhost:9000
user name: jongwook
mapred.job.tracker: localhost:9001

i.
How to run Hadoop example at Eclipse
Refer to: http://dal-cloudcomputing.blogspot.com/2009/08/hadoop-example-mymaxtemperaturewithcomb.html
- Create (or import) Hadoop example as shown in the above blog as Hadoop project of Eclipse: File > New > Map/Reduce Project
- 1901 does not exist so that use only 1902
- open Hadoop project perspective view at Eclipse
- You can create DFS folder or upload files to DFS at this view

Friday, February 25, 2011

The Technical Demand of Cloud Computing (no-SQL DB, Map/Reduce Hadoop)

The Technical Demand of Cloud Computing in Korean
Technical Report granted from KISTI (Korea Institute of Science and Technical Information) 한국과학기술정보연구원
by Jongwook Woo, California State University Los Angeles

Followers

Profile