PIG tutorials

Installation

Download

 

Untar

tar xvfz pig-0.16.0.tar.gz

Move

mv pig-0.16.0/ ~/

Environment Variables

copy the following in the /home/amar/sourceme file

export PIG_HOME=/home/amar/pig-0.16.0
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_HOME/conf

------------------------------------------------------------------

Data Set

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai

Load Data



student = LOAD '/user/student_data.txt' 
   USING PigStorage(',')
   as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, 
   city:chararray );

Store Data



STORE student INTO '/gitam/arm' USING PigStorage (',');
hadoop dfs -ls /gitam/arm
hadoop dfs -cat /gitam/arm/part-m-00000


Explain and Illustrate

The explain operator is used to display the logical, physical, and MapReduce execution plans of a relation.
Explain relation_name
Describe Relation_name


The illustrate operator gives you the step-by-step execution of a sequence of statements.

grunt> illustrate Relation_name;

New Data Set

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
student_details = LOAD '/user/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
group_data = GROUP student_details by age;
Dump group_data;


Describe group_data;

group_multiple = GROUP student_details by (age, city);







customers.txt

1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00 
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00




orders.txt

102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060



grunt> customers = LOAD '/user/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, address:chararray, salary:int);
  
grunt> orders = LOAD '/user/orders.txt' USING PigStorage(',')
   as (oid:int, date:chararray, customer_id:int, amount:int);

Inner Join

grunt> customer_orders = JOIN customers BY id, orders BY customer_id;

Left Outer Join

grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;

Right Outer Join

grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;

Full Outer Join

grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
The UNION operator of Pig Latin is used to merge the content of two relations. To perform UNION operation on two relations, their columns and domains must be identical.

grunt> student = UNION student1, student2;

The SPLIT operator is used to split a relation into two or more relations.


SPLIT student_details into student_details1 if age<23, student_details2 if (age > 22 and age < 25);
 Foreach loop -
grunt> foreach_data = FOREACH student_details GENERATE id,age,city;

grunt> distinct_data = DISTINCT student_details;
grunt> order_by_data = ORDER student_details BY age DESC;
 
grunt> limit_data = LIMIT student_details 4;

Data

– https://dumps.wikimedia.org/other/pagecounts-raw/2016/2016-08/

To get the top 4 most used pages –

records = LOAD '/user/pagecounts-20160801-000000' USING PigStorage(' ') as (projectName:chararray,pageName:chararray,pageCount:int,pageSize:int);

filtered_records = FILTER records by projectName == 'en';
 grouped_records = GROUP filtered_records by pageName;
 results = FOREACH grouped_records generate group, SUM(filtered_records.pageCount);
 sorted_result = ORDER results by $1 desc;
 limit_data = LIMIT sorted_result 4;
 dump limit_data;

 

 

 

HBase Commands

./bin/hbase shell

woir> list

woir> status

woir> version

woir> table_help

woir> whoami

woir> create 'woir', 'family data', ’work data’

woir> disable 'woir'

woir> is_disabled 'table name'

woir> is_disabled 'woir'

woir> disable_all 'amar.*'

woir> enable 'woir'

woir> scan 'woir'

woir> is_enabled 'table name'

woir> is_enabled 'woir'

woir> describe 'woir'

woir> alter 'woir', NAME => 'family data', VERSIONS => 5

woir> alter 'woir', READONLY

woir> alter 't1', METHOD => 'table_att_unset', NAME => 'MAX_FILESIZE'

woir> alter ‘ table name ’, ‘delete’ => ‘ column family ’ 

woir> scan 'woir'

woir> alter 'woir','delete'=>'work'

woir> scan 'woir'

woir> exists 'woir'

woir> exists 'student'

woir> exists 'student'

woir> drop 'woir'

woir> exists 'woir'

woir> drop_all ‘t.*’ 

./bin/stop-hbase.sh

woir> put 'woir','1','family data:name','kaka'

woir> put 'woir','1','work 

woir> put 'woir','1','work data:salary','50000'

woir> scan 'woir'

woir> put 'woir','row1','family:city','Delhi'

woir> scan 'woir'

woir> get 'woir', '1'

woir>get 'table name', ‘rowid’, {COLUMN => ‘column family:column name ’}

woir> get 'woir', 'row1', {COLUMN=>'family:name'}

woir> delete 'woir', '1', 'family data:city', 1417521848375

woir> deleteall 'woir','1'

woir> scan 'woir'

woir> count 'woir'

woir> truncate 'table name'




Command Usage
hbase> scan ‘.META.’, {COLUMNS => ‘info:regioninfo’} It display all the meta data information related to columns that are present in the tables in HBase
hbase> scan ‘woir’, {COLUMNS => [‘c1’, ‘c2’], LIMIT => 10, STARTROW => ‘abc’} It display contents of table guru99 with their column families c1 and c2 limiting the values to 10
hbase> scan ‘woir’, {COLUMNS => ‘c1’, TIMERANGE => [1303668804, 1303668904]} It display contents of guru99 with its column name c1 with the values present in between the mentioned time range attribute value
hbase> scan ‘woir’, {RAW => true, VERSIONS =>10} In this command RAW=> true provides advanced feature like to display all the cell values present in the table guru99
 examples

Hive Installation

Download
woir@woir-VirtualBox:/tmp$ wget http://archive.apache.org/dist/hive/hive-2.1.0/apache-hive-2.1.0-bin.tar.gz
 

woir@woir-VirtualBox:/tmp$ tar xvzf apache-hive-2.1.0-bin.tar.gz -C /home/woir
copy paste following in the /home/woir/sourceme
export HIVE_HOME=/home/woir/apache-hive-2.1.0-bin
export HIVE_CONF_DIR=/home/woir/apache-hive-2.1.0-bin/conf
export PATH=$HIVE_HOME/bin:$PATH
export CLASSPATH=$CLASSPATH:/home/woir/hadoop-2.6.0/lib/*:.
export CLASSPATH=$CLASSPATH:/home/woir/apache-hive-2.1.0-bin/lib/*:.
export HADOOP_HOME=~/hadoop-2.6.0


hduser@laptop:/home/woir/apache-hive-2.1.1-bin$ source ~/.bashrc
$ echo $HADOOP_HOME
/home/woir/hadoop-2.6.0
$ hive --version
Hive 2.1.0
Subversion git://jcamachguezrMBP/Users/jcamachorodriguez/src/workspaces       /hive/HIVE-release2/hive -r 9265bc24d75ac945bde9ce1a0999fddd8f2aae29
Compiled by jcamachorodriguez on Fri Jun 17 01:03:25 BST 2016
From source with checksum 1f896b8fae57fbd29b047d6d67b75f3c
woir@woir-VirtualBox:~$ hadoop dfs -ls /
drwxr-xr-x   - hduser supergroup          0 2016-11-23 11:17 /hbase
drwx------   - hduser supergroup          0 2016-11-18 16:04 /tmp
drwxr-xr-x   - hduser supergroup          0 2016-11-18 09:13 /user

woir@woir-VirtualBox:~$ hadoop dfs -mkdir -p /user/hive/warehouse
woir@woir-VirtualBox:~$ hadoop dfs -mkdir /tmp
woir@woir-VirtualBox:~$ hadoop dfs -chmod g+w /tmp 
woir@woir-VirtualBox:~$ hadoop dfs -chmod g+w /user/hive/warehouse 
woir@woir-VirtualBox:~$ hadoop dfs -ls /
   drwxr-xr-x - hduser supergroup 0 2016-11-23 11:17 /hbase 
   drwx-w---- - hduser supergroup 0 2016-11-18 16:04 /tmp 
   drwxr-xr-x - hduser supergroup 0 2016-11-23 17:18 /user 
woir@woir-VirtualBox:~$ hadoop dfs -ls /user 
   drwxr-xr-x - hduser supergroup 0 2016-11-18 23:17 /user/hduser 
   drwxr-xr-x - hduser supergroup 0 2016-11-23 17:18 /user/hive
woir@woir-VirtualBox:~$ cd $HIVE_HOME/conf
woir@woir-VirtualBox:~/home/woir/apache-hive-2.1.0-bin/conf$ vi hive-env.sh.template
# Set HADOOP_HOME to point to a specific hadoop install directory
# HADOOP_HOME=${bin}/../../hadoop

# Hive Configuration Directory can be controlled by:
# export HIVE_CONF_DIR=/home/woir/apache-hive-2.1.0-bin/conf


hduser@laptop:/home/woir/apache-hive-2.1.0-bin/conf$ cp hive-env.sh.template hive-env.sh
export HADOOP_HOME=/home/woir/hadoop-2.6.0
$ cd /tmp

$ wget http://archive.apache.org/dist/db/derby/db-derby-10.13.1.1/db-derby-10.13.1.1-bin.tar.gz

$ tar xvzf db-derby-10.13.1.1-bin.tar.gz -C /home/woir
copy paste following in the /home/woir/sourceme
export DERBY_HOME=/home/woir/db-derby-10.13.1.1-bin
export PATH=$PATH:$DERBY_HOME/bin
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar
$ mkdir $DERBY_HOME/data
hduser@laptop:~$ cd $HIVE_HOME/conf

create a new file hive-site.xml
woir@woir-VirtualBox:/home/woir/apache-hive-2.1.1-bin/conf$ vi hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/home/woir/apache-hive-2.1.0-bin/metastore_db;create=true</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<property>
<name>hive.metastore.uris</name>
<value/>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.PersistenceManagerFactoryClass</name>
<value>org.datanucleus.api.jdo.JDOPersistenceManagerFactory</value>
<description>class implementing the jdo persistence</description>
</property>
</configuration>

 

Metastore schema initialisation
woir@woir-VirtualBox:/home/woir/apache-hive-2.1.0-bin/bin$ schematool -initSchema -dbType derby
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/woir/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/woir/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:     jdbc:derby:;databaseName=/home/woir/apache-hive-2.1.0-bin/metastore_db;create=true
Metastore Connection Driver :     org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User:     APP
Starting metastore schema initialization to 2.1.0
Initialization script hive-schema-2.1.0.derby.sql
Initialization script completed
schemaTool completed
 Launch Hive
woir@woir-VirtualBox:/home/woir/apache-hive-2.1.0-bin/bin$ hive

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/woir/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/woir/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/home/woir/apache-hive-2.1.0-bin/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive>

 

Running an external jar from Aws Hadoop

Running an external jar from Aws Hadoop

Hadoop AWS

  • First select the services  and click on EMR from Analytics.

Screenshot (45)

 

  • Then click on the add cluster.

Screenshot (46)

 

  • Fill the Details of Cluster.
  1. Cluster name as Ananthapur-jntu
  2. Here we are checking the Logging
  3. Browse the s3 folder with the amar2017/feb
  4. Launch mode should be Step Extension
  5. After that select step type as custom jar and click on configure.
  6. The below image is showing the details.

Screenshot (49)

  • After clicking on the configure button we will see the popup like shown below
  1. Name as Custom JAR
  2. Jar location should be s3://amar2017/inputJar/wcount.jar
  3. Fill the Arguments with org.myorg.wordcount, s3://amar2017/deleteme.txt, s3://amar2017/output3
  4. Select the Action on failure as Terminate cluster
  5. Then click on add button.
  6. How to fill the details as shown below.

 

Screenshot (48)

  • Software configuration
  1. Select vendor as Amazon
  2. select Release as emr-5.3.1
  • Hardware Configuration
  1. select instance type as m1.medium
  2. And number of instanses as 3
  • Security and access
  1. Permissions checking as default
  2. After that click on create cluster button
  • Details as shown in the below image

Screenshot (50)

  • You will see the Cluster Ananthapur is Starting as shown below.

Screenshot (51)

  • The below image is showing that in cluster list Ananthapur is starting.

Screenshot (52)

  • After complishing the process we can see like below image.

Screenshot (53)

  • To see the result of AWS Hadoop go to the services.
  • Select S3 under storage.

Screenshot (142)

 

  • After clicking on S3
  • select amar2017

Screenshot (149)

  • Anad then select output3 folder
  • You will see the list of files as shown below

Screenshot (146)

  • Open the part-r-00000 file
  • You will see the page as shown below

Screenshot (147)

  • Click on the Download button
  • And Open the downloaded file you will see the result as shown below.

Screenshot (57)

Hadoop Word Count Problem

Hadoop Word Count Problem

Few basics in Unix –

UNIX TUTORIAL

  • How to check if a process is running or not
~> ps -eaf | grep 'java'  will list down all the process which uses java
  • How to kill a process forcefully
~> ps -eaf | grep 'java'

The above command shows the process ids of the process which uses java

~> kill -9 'process id'

it will kill the job with that process id
  • What does sudo do –
    • It runs the command with root’s privilege

Start Desktop

  • Start Desktop Box
  • Login as amar
    • Click on the right top corner and Chose Hadoop User.
    • Enter password – <your password>

screen-shot-2016-12-16-at-2-14-27-am

  • Click on the Top Left Ubuntu Button and search for the terminal and click on it.

screen-shot-2016-12-16-at-2-18-44-am

  • You should see something similar as below

screen-shot-2016-12-16-at-2-20-24-am

Start with HDFS

  • Setup the environment
~> source /home/woir/sourceme
  • Stop all the processes
~> /home/woir/stop_all.sh
  • Start hadoop if not already started –
~> /home/woir/start_hadoop.sh
  • Check if Hadoop is running fine
~> jps

it will list down the running hadoop processes.

o/p should look like below -

woir@woir-VirtualBox:/usr/local/hadoop/sbin$ jps
 14416 SecondaryNameNode
 14082 NameNode
 14835 Jps
 3796 Main
 14685 NodeManager
 14207 DataNode
 14559 ResourceManager
  • Make directory for the purpose of demonstration

The command creates the /user/hduser/dir/dir1 and /user/hduser/employees/salary

~> hadoop fs -mkdir -p /user/hduser/dir/dir1 /user/hduser/employees/salary
  • Copy contents in to the directory. It can copy directory also.
~> hadoop fs -copyFromLocal /home/woir/example/WordCount1/file* /user/hduser/dir/dir1
  • The hadoop ls command is used to list out the directories and files –
~> hadoop fs -ls /user/hduser/dir/dir1/
  • The hadoop lsr command recursively displays the directories, sub directories and files in the specified directory. The usage example is shown below:
~> hadoop fs -lsr /user/hduser/dir
  • Hadoop cat command is used to print the contents of the file on the terminal (stdout). The usage example of hadoop cat command is shown below:
~> hadoop fs -cat /user/hduser/dir/dir1/file*
  • The hadoop chmod command is used to change the permissions of files. The -R option can be used to recursively change the permissions of a directory structure.

Note the permission before –

~> hadoop fs -ls /user/hduser/dir/dir1/

Change the persission-

~> hadoop fs -chmod 777 /user/hduser/dir/dir1/file1

See it again –

~> hadoop fs -ls /user/hduser/dir/dir1/
  • The hadoop chown command is used to change the ownership of files. The -R option can be used to recursively change the owner of a directory structure.
~> hadoop fs -chown amars:amars /user/hduser/dir/dir1/file1

Check the ownership now –

~> hadoop fs -ls /user/hduser/dir/dir1/file1
  • The hadoop copyFromLocal command is used to copy a file from the local file system to the hadoop hdfs. The syntax and usage example are shown below:
~> hadoop fs -copyFromLocal /home/woir/example/WordCount1/file* /user/hduser/employees/salary
  • The hadoop copyToLocal command is used to copy a file from the hdfs to the local file system. The syntax and usage example is shown below:
~> hadoop fs -copyToLocal /user/hduser/dir/dir1/file1 /home/woir/Downloads/
  • The hadoop cp command is for copying the source into the target.
~>hadoop fs -cp /user/hduser/dir/dir/file1 /user/hduser/dir/
  • The hadoop moveFromLocal command moves a file from local file system to the hdfs directory. It removes the original source file. The usage example is shown below:
~> hadoop fs -moveFromLocal /home/woir/Downloads/file1  /user/hduser/employees/
  • It moves the files from source hdfs to destination hdfs. Hadoop mv command can also be used to move multiple source files into the target directory. In this case the target should be a directory. The syntax is shown below:
~> hadoop fs -mv /user/hduser/dir/dir1/file2 /user/hduser/dir/
  • The du command displays aggregate length of files contained in the directory or the length of a file in case its just a file. The syntax and usage is shown below:
~> hadoop fs -du /user/hduser
  • Removes the specified list of files and empty directories. An example is shown below:
~> hadoop fs -rm /user/hduser/dir/dir1/file1
  • Recursively deletes the files and sub directories. The usage of rmr is shown below:
~> hadoop fs -rmr /user/hduser/dir

—Web UI

NameNode daemon
  • ~> http://localhost:50070/
—Log Files
  • ~> http://localhost:50070/logs/
—Explore Files
  • ~> http://localhost:50070/explorer.html#/
—Status
  • ~> http://localhost:50090/status.html

Hadoop Word Count Example

Go to home directory and take a look on the directory presents

  • ~> cd /home/woir
  • ~> 'pwd' command should show path as '/home/woir'.
  • execute 'ls -lart' to take a look on the files and directory in general.
  • Confirm that service is running successfully or not
    • ~> run 'jps' - you should see something similar to following -

screen-shot-2016-12-09-at-12-44-15-pm

Go to example directory –

  • ~> cd /home/woir/example/WordCount1/
  • Run command ‘ls’ – if there is a directory named ‘build’ please delete that and recreate the same directory. This step will ensure that your program does not uses precompiled jars and other files
~> ls -lart 
~> rm -rf build
~> mkdir build
  • Remove JAR file if already existing
    • ~> rm /home/woir/example/WordCount1/wcount.jar
  • Ensure JAVA_HOME and PATH variables are set appropriately
~> echo $PATH
~> echo $JAVA_HOME

JAVA_HOME should be something like /home/woir/JAVA
PATH should have /home/amar/JAVA/bin in that.
  • If the above variables are not set please do that now
~> export JAVA_HOME=/home/woir/JAVA
~> export PATH=$JAVA_HOME/bin:$PATH
  • Set HADOOP_HOME
~> export HADOOP_HOME=/home/woir/hadoop-2.6.0
  • Build the example ( please make sure that when you copy – paste it does not leave any space between the command) –
  • ~> $JAVA_HOME/bin/javac -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-2.6.0.jar:$HADOOP_HOME/share/hadoop/common/lib/hadoop-annotations-2.6.0.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.6.0.jar -d build WordCount.java

screen-shot-2016-12-09-at-11-31-18-am

  • Create Jar –
    • ~> jar -cvf wcount.jar -C build/ .
  • Now prepare the input for the program ( please give ‘output’ directory your own name – it should not be existing earlier )
    • Make your own input directory –
      • ~> hadoop dfs -mkdir /user/hduser/input
    • Copy the input files ( file1, file2, file3 ) to hdfs location
      • ~> hadoop dfs -put file* /user/hduser/input
    • Check if the output directory already exists.
      ~> hadoop dfs -ls /user/hduser/output
    • If it already existing delete with the help of following command –
~> hadoop dfs -rm /user/hduser/output/*

~> hadoop dfs -rmdir /user/hduser/output
  • Run the program
~> hadoop jar wcount.jar org.myorg.WordCount /user/hduser/input/ /user/hduser/output

At the end you should see something similar –

screen-shot-2016-12-09-at-11-44-33-am

  • Check if the output files have been generated

screen-shot-2016-12-09-at-11-37-51-am

~> hadoop dfs -ls /user/hduser/output

you should see something similar to below screenshot

screen-shot-2016-12-09-at-11-46-35-am

  • Get the contents of the output files –
~> hadoop dfs -cat /user/hduser/output/part-r-00000

screen-shot-2016-12-09-at-11-48-22-am

  • Verify the word count with the input files-
~> cat file1 file2 file3

The words count should match.

Installer for Ubuntu

Installer for Ubuntu

  • Please create a new user with sudo permission
    • username – hduser
  • Login to the system with the new user
  • Setup ssh server
    • sudo apt-get install ssh
  • Uncheck the option in Startup Application
  • Screen Shot 2017-01-30 at 11.56.14 AM
  • Screen Shot 2017-01-30 at 11.56.43 AM
  • RESTART BOX
  • Install JPS –
    • Please ensure ‘jps’ command is available. If not please type following to install it
      • sudo apt-get install  openjdk-7-jdk
  • Download the package
    • https://drive.google.com/open?id=0B3z01aLb6U-JNWFoUEpYMV81Q2s
    • Check it’s md5 sum
      • md5sum woir_workshop.tar.gz
      • MD5 ( woir_workshop.tar.gz) = 3dc3efcf732c4222c31ca8dac28c47c1
    • Unpack it in the home directory.
      • tar xvfz woir_workshop.tar.gz
  • Setup the environment
    • add following to you in your .bachrc (/home/hduser/.bachrc) in the last
    • source sourceme
  • Install python packages
    • sudo apt-get install python-pip
    • sudo pip install pika
    • sudo apt-get install erlang
  • Cleanup the directory
    • rm -rf ~/data_mongodb
    • mkdir -p ~/data_mongodb
  • Create symbolic link to the JAVA
    • ln -s ~/jdk1.8.0_112    ~/JAVA
  • Generate key
    • ssh-keygen -t rsa -P “”
    • Sample o/p

jagdeep@jagdeep-VirtualBox:~$  ssh-keygen -t rsa -P “”
Generating public/private rsa key pair.
Enter file in which to save the key (/home/jagdeep/.ssh/id_rsa):
Your identification has been saved in /home/jagdeep/.ssh/id_rsa.
Your public key has been saved in /home/jagdeep/.ssh/id_rsa.pub.
The key fingerprint is:
24:b2:dd:ec:cb:5f:d4:cc:3a:c6:f1:95:6e:bc:9a:26 jagdeep@jagdeep-VirtualBox
The key’s randomart image is:
+–[ RSA 2048]—-+
|                 |
|                 |
|    . . .        |
|     + =     +  .|
|    . . S   o +..|
|       .   o +o. |
|        .   * .+ |
|       . . E oo .|
|        o.. oo.. |
+—————–+

  • Copy the Key

jagdeep@jagdeep-VirtualBox:~$ cat /home/$USER/.ssh/id_rsa.pub >> /home/$USER/.ssh/authorized_keys

  • Validate
    jagdeep@jagdeep-VirtualBox:~$ ssh localhost

You should be able to login without any password requirement.

If it ask for the password – please don’t proceed – rather call me.

 

 

  • Modify following files ( replace amar with the user name you have created above )-
    • vi  ~/stock-logstash.conf_1.csv
    • hadoop-2.6.0/etc/hadoop/core-site.xml
    • hadoop-2.6.0/etc/hadoop/hadoop-env.sh
    • hadoop-2.6.0/etc/hadoop/hdfs-site.xml
    • Sample output

core-site.xml:  <value>/home/amar/app/hadoop/tmp</value>
hadoop-env.sh:export JAVA_HOME=/home/amar/JAVA
hdfs-site.xml:   <value>file:/home/amar/hadoop_store/hdfs/namenode</value>
hdfs-site.xml:   <value>file:/home/amar/hadoop_store/hdfs/datanode</value

  • Start Hadoop
    • ./start_hadoop.sh ( first time we will have to enter ‘yes’ )
    • Sample o/plocalhost: starting namenode, logging to /home/jagdeep/hadoop-2.6.0/logs/hadoop-jagdeep-namenode-jagdeep-VirtualBox.out
      localhost: starting datanode, logging to /home/jagdeep/hadoop-2.6.0/logs/hadoop-jagdeep-datanode-jagdeep-VirtualBox.out
      Starting secondary namenodes [0.0.0.0]
      The authenticity of host ‘0.0.0.0 (0.0.0.0)’ can’t be established.
      ECDSA key fingerprint is 8c:f5:a0:bb:63:c0:0e:36:50:cc:4a:c0:60:4c:f6:b5.
      Are you sure you want to continue connecting (yes/no)? yes
      0.0.0.0: Warning: Permanently added ‘0.0.0.0’ (ECDSA) to the list of known hosts.
  • If required format the HDFS partition ( don’t do it unless instructed)
    • hadoop namenode -format

 

Steps to Verify Hadoop/Elasticsearch/ActiveMQ/Cassandra Installation in One Box Setup VM

Steps to Verify Hadoop/Elasticsearch/ActiveMQ/Cassandra Installation in One Box Setup VM

Hadoop Installation

  •     Start Virtual Box, choose the machine you prepared in earlier step and click on the “Start” button ( green colour ).

screen-shot-2016-12-09-at-11-08-22-am

  • Please login as user Hadoop ( user id hduser), if asked for please enter password ‘abcd1234’

Screen Shot 2017-01-27 at 12.41.46 AM

  • Click on the Ubuntu on the top-left corner and look for terminal and click on the terminal

screen-shot-2016-12-09-at-11-10-51-am

  • Once the terminal is up and running it should look similar to following –

Screen Shot 2017-01-27 at 12.42.28 AM

  • Go to home directory and take a look on the directory presents
    • cd /home/hduser
    • ‘pwd’ command should show path as ‘/home/hduser’.
    • execute ‘ls -lart’ to take a look on the files and directory in general.
  • Close already running applications
    • /home/hduser/stop_all.sh
  • Start hadoop
    • /home/hduser/start_hadoop.sh
  • Confirm that service is running successfully or not
    • run ‘jps’ – you should see something similar to following –

screen-shot-2016-12-09-at-12-44-15-pm

  • Run wordcount program by using following command –
    • /home/hduser/run_helloword.sh
  • At the end you should see something similar –

screen-shot-2016-12-09-at-11-44-33-am

  • Check if the output files have been generated
  • hadoop dfs -ls /user/hduser/output     – you should see something similar to below screenshot

screen-shot-2016-12-09-at-11-46-35-am

  • Get the contents of the output files ( similar to following ) –
    • hadoop dfs -cat /user/hduser/output/part-r-00000

screen-shot-2016-12-09-at-11-48-22-am

  • Finally shutdown the hadoop services
    • /home/hduser/stop_hadoop.sh

Elasticsearch Installation

  • Close already running applications
    • /home/hduser/stop_all.sh
  • Start Elasticsearch –
    • /home/hduser/start_elasticsearch.sh
    • tail /home/hduser/elastic123.log
      • You should see some messages ( it should not have any ERROR ) in the last you may something similar –

screen-shot-2016-12-09-at-12-22-59-pm

  • Verify Elasticsearch instance
    • Open browser ( firefox )
    • goto http://localhost:9200
    • You should see following output

screen-shot-2016-12-09-at-12-26-20-pm

 

  • Start Kibana –
    • /home/hduser/start_kibana.sh
    • tail /home/hduser/kibana123.log
  • You should see some messages ( it should not have any ERROR ) in the last you may something similar –

 

Screen Shot 2017-01-27 at 1.04.45 AM

  • Verify Kibana instance
    • Open browser ( firefox )
    • goto http://localhost:5601/app/kibana#
    • You should see similar output

screen-shot-2016-12-16-at-1-26-42-am

  • Shutdown Elasticsearch and Kibana
    • /home/hduser/stop_elasticsearch.sh
    • /home/hduser/stop_kibana.sh

ActiveMQ Installation

  • Close already running applications
    • /home/hduser/stop_all.sh
  • Start ActiveMQ
    • /home/hduser/start_activemq.sh
  • Run validation test – send messages
    • cd /home/hduser/activemq-5.14.3
    • /home/hduser/activemq-5.14.3/send_message.sh
  • See following output on the screen

Screen Shot 2017-01-27 at 1.19.50 AM

  • Continue – receive messages
    • cd /home/hduser/activemq-5.14.3
    • /home/hduser/activemq-5.14.3/receive_message.sh
  • See following output on the screen

Screen Shot 2017-01-27 at 1.24.01 AM

  • Stop ActiveMQ
    • /home/hduser/stop_activemq.sh

 

Steps to Install VM for workshop

Steps to Install VM for workshop

  1. Install Virtual Box ( follow the steps ) on your computer.
  2. Once the Virtual Box is installed  the VM ( please see the license terms and conditions to use it for any commercial/business purpose) can be downloaded from the following link –
    • https://drive.google.com/open?id=0B2vqFbCIJR_USXBzNVZYZGloOVU
  3. The downloaded image name will be ‘woir.vdi’.
  4. Create a New Virtual Machine in VirtualBox using the uncompressed VDI file as the Hard Drive.
    • Run VirtualBox
    • Click the “New” button
    • Enter the name “Ubuntu-Vasvi”;
    • Select “Linux” with the OS Type dropdown
    • Select “Next”
    • On the “Memory” panel choose around 4 gb memory and click “Next”
    • On the Virtual Hard Disk” panel select “Existing” – this opens the VirtualBox Virtual Disk Manager”
    • Select the “Add” button.
    • Select the “harddisk which you have downloaded” ( in this case it should be Vijayawada.vdi) file.
    • Click “Select”
    • Click “Next”
    • Click “Finished”
    • Click RUN to Start the VM ( you should see Ubuntu running )
    • Use username as woir and password as abcd1234 whenever required.