PIG tutorials

Installation

Download

 

Untar

tar xvfz pig-0.16.0.tar.gz

Move

mv pig-0.16.0/ ~/

Environment Variables

copy the following in the /home/amar/sourceme file

export PIG_HOME=/home/amar/pig-0.16.0
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_HOME/conf

------------------------------------------------------------------

Data Set

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai

Load Data



student = LOAD '/user/student_data.txt' 
   USING PigStorage(',')
   as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, 
   city:chararray );

Store Data



STORE student INTO '/gitam/arm' USING PigStorage (',');
hadoop dfs -ls /gitam/arm
hadoop dfs -cat /gitam/arm/part-m-00000


Explain and Illustrate

The explain operator is used to display the logical, physical, and MapReduce execution plans of a relation.
Explain relation_name
Describe Relation_name


The illustrate operator gives you the step-by-step execution of a sequence of statements.

grunt> illustrate Relation_name;

New Data Set

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
student_details = LOAD '/user/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
group_data = GROUP student_details by age;
Dump group_data;


Describe group_data;

group_multiple = GROUP student_details by (age, city);







customers.txt

1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00 
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00




orders.txt

102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060



grunt> customers = LOAD '/user/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, address:chararray, salary:int);
  
grunt> orders = LOAD '/user/orders.txt' USING PigStorage(',')
   as (oid:int, date:chararray, customer_id:int, amount:int);

Inner Join

grunt> customer_orders = JOIN customers BY id, orders BY customer_id;

Left Outer Join

grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;

Right Outer Join

grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;

Full Outer Join

grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
The UNION operator of Pig Latin is used to merge the content of two relations. To perform UNION operation on two relations, their columns and domains must be identical.

grunt> student = UNION student1, student2;

The SPLIT operator is used to split a relation into two or more relations.


SPLIT student_details into student_details1 if age<23, student_details2 if (age > 22 and age < 25);
 Foreach loop -
grunt> foreach_data = FOREACH student_details GENERATE id,age,city;

grunt> distinct_data = DISTINCT student_details;
grunt> order_by_data = ORDER student_details BY age DESC;
 
grunt> limit_data = LIMIT student_details 4;

Data

– https://dumps.wikimedia.org/other/pagecounts-raw/2016/2016-08/

To get the top 4 most used pages –

records = LOAD '/user/pagecounts-20160801-000000' USING PigStorage(' ') as (projectName:chararray,pageName:chararray,pageCount:int,pageSize:int);

filtered_records = FILTER records by projectName == 'en';
 grouped_records = GROUP filtered_records by pageName;
 results = FOREACH grouped_records generate group, SUM(filtered_records.pageCount);
 sorted_result = ORDER results by $1 desc;
 limit_data = LIMIT sorted_result 4;
 dump limit_data;

 

 

 

HBase Commands

./bin/hbase shell

woir> list

woir> status

woir> version

woir> table_help

woir> whoami

woir> create 'woir', 'family data', ’work data’

woir> disable 'woir'

woir> is_disabled 'table name'

woir> is_disabled 'woir'

woir> disable_all 'amar.*'

woir> enable 'woir'

woir> scan 'woir'

woir> is_enabled 'table name'

woir> is_enabled 'woir'

woir> describe 'woir'

woir> alter 'woir', NAME => 'family data', VERSIONS => 5

woir> alter 'woir', READONLY

woir> alter 't1', METHOD => 'table_att_unset', NAME => 'MAX_FILESIZE'

woir> alter ‘ table name ’, ‘delete’ => ‘ column family ’ 

woir> scan 'woir'

woir> alter 'woir','delete'=>'work'

woir> scan 'woir'

woir> exists 'woir'

woir> exists 'student'

woir> exists 'student'

woir> drop 'woir'

woir> exists 'woir'

woir> drop_all ‘t.*’ 

./bin/stop-hbase.sh

woir> put 'woir','1','family data:name','kaka'

woir> put 'woir','1','work 

woir> put 'woir','1','work data:salary','50000'

woir> scan 'woir'

woir> put 'woir','row1','family:city','Delhi'

woir> scan 'woir'

woir> get 'woir', '1'

woir>get 'table name', ‘rowid’, {COLUMN => ‘column family:column name ’}

woir> get 'woir', 'row1', {COLUMN=>'family:name'}

woir> delete 'woir', '1', 'family data:city', 1417521848375

woir> deleteall 'woir','1'

woir> scan 'woir'

woir> count 'woir'

woir> truncate 'table name'




Command Usage
hbase> scan ‘.META.’, {COLUMNS => ‘info:regioninfo’} It display all the meta data information related to columns that are present in the tables in HBase
hbase> scan ‘woir’, {COLUMNS => [‘c1’, ‘c2’], LIMIT => 10, STARTROW => ‘abc’} It display contents of table guru99 with their column families c1 and c2 limiting the values to 10
hbase> scan ‘woir’, {COLUMNS => ‘c1’, TIMERANGE => [1303668804, 1303668904]} It display contents of guru99 with its column name c1 with the values present in between the mentioned time range attribute value
hbase> scan ‘woir’, {RAW => true, VERSIONS =>10} In this command RAW=> true provides advanced feature like to display all the cell values present in the table guru99
 examples
HIVE Queries

HIVE Queries

Sample Date employee.csv ( save it in /home/woir/Downloads/ )
1,Amar Sharma1,42,Gachibowli – Hyderabad-1,40000,Technology1
2,Amar Sharma2,43,Gachibowli – Hyderabad-2,40005,Technology2
3,Amar Sharma3,44,Gachibowli – Hyderabad-3,40010,Technology3
4,Amar Sharma4,45,Gachibowli – Hyderabad-4,40015,Technology4
5,Amar Sharma5,46,Gachibowli – Hyderabad-5,40020,Technology5
6,Amar Sharma6,47,Gachibowli – Hyderabad-6,40025,Technology6
7,Amar Sharma7,48,Gachibowli – Hyderabad-7,40030,Technology7
8,Amar Sharma8,49,Gachibowli – Hyderabad-8,40035,Technology8
9,Amar Sharma9,50,Gachibowli – Hyderabad-9,40040,Technology9
10,Amar Sharma10,51,Gachibowli – Hyderabad-10,40045,Technology10
11,Amar Sharma11,52,Gachibowli – Hyderabad-11,40050,Technology11
12,Amar Sharma12,53,Gachibowli – Hyderabad-12,40055,Technology12
13,Amar Sharma13,54,Gachibowli – Hyderabad-13,40060,Technology13
14,Amar Sharma14,55,Gachibowli – Hyderabad-14,40065,Technology14
15,Amar Sharma15,56,Gachibowli – Hyderabad-15,40070,Technology15
16,Amar Sharma16,57,Gachibowli – Hyderabad-16,40075,Technology16
17,Amar Sharma17,58,Gachibowli – Hyderabad-17,40080,Technology17
18,Amar Sharma18,59,Gachibowli – Hyderabad-18,40085,Technology18
19,Amar Sharma19,60,Gachibowli – Hyderabad-19,40090,Technology19
20,Amar Sharma20,61,Gachibowli – Hyderabad-20,40095,Technology20
21,Amar Sharma21,62,Gachibowli – Hyderabad-21,40100,Technology21
22,Amar Sharma22,63,Gachibowli – Hyderabad-22,40105,Technology22
23,Amar Sharma23,64,Gachibowli – Hyderabad-23,40110,Technology23
24,Amar Sharma24,65,Gachibowli – Hyderabad-24,40115,Technology24
25,Amar Sharma25,66,Gachibowli – Hyderabad-25,40120,Technology25
26,Amar Sharma26,67,Gachibowli – Hyderabad-26,40125,Technology26
27,Amar Sharma27,68,Gachibowli – Hyderabad-27,40130,Technology27
28,Amar Sharma28,69,Gachibowli – Hyderabad-28,40135,Technology28
29,Amar Sharma29,70,Gachibowli – Hyderabad-29,40140,Technology29
30,Amar Sharma30,71,Gachibowli – Hyderabad-30,40145,Technology30

hadoop dfs -ls /user/hive/warehouse/
hadoop dfs -copyFromLocal /home/woir/Downloads/*.csv /user/amar
hadoop dfs -ls /user/hive/warehouse/woir.db

Create database <DatabaseName>

-> create database woir ;
create database woir_training;
show databases;
drop database woir_training;
use woir;
-> show tables ;
-> create table employees_woir(Id INT, Name STRING, Age INT, Address STRING, Salary FLOAT, Department STRING) Row format delimited Fields terminated by ',';

-> LOAD DATA LOCAL INPATH '/home/woir/Downloads/employee.csv' INTO table employees_woir;

-> create TABLE order_history (OrderId INT,Date1 TIMESTAMP, Id INT, Amount FLOAT) ROW Format delimited Fields terminated by ',';

-> LOAD DATA INPATH '/user/woir/orders.csv' INTO table order_history;
-> create TABLE order_history_tmp (OrderId INT,Date1 TIMESTAMP, Id INT, Amount FLOAT) ROW Format delimited Fields terminated by ',';
-> show tables;
-> alter table order_history_tmp rename to order_history_deleteme;
-> show tables;
-> drop table order_history_deleteme;

To create the internal table

-> CREATE TABLE woirhive_internaltable (id INT,Name STRING) Row format delimited Fields terminated by ',';

Load the data into internal table

-> LOAD DATA INPATH '/user/names.csv' INTO table woirhive_internaltable;

Display the content of the table

-> select * from woirhive_internaltable;
-> select * from woirhive_internaltable where id=1;

To drop the internal table

-> DROP TABLE woirhive_internaltable;

If you dropped the woirhive_internaltable, including its metadata and its data will be deleted from Hive.

Create External table

-> CREATE EXTERNAL TABLE woirhive_external(id INT,Name STRING) Row format delimited Fields terminated by ',' LOCATION '/user/woirhive_external';

We can load the data manually

-> LOAD DATA INPATH '/user/names.txt' INTO TABLE woirhive_external;

Display the content of the table

-> select * from woirhive_external;

To drop the external table

-> DROP TABLE woirhive_external;

Creation of Table allstates

-> create table allstates(state string, District string,Enrolments string) row format delimited fields terminated by ',';

Loading data into created table all states

-> Load data local inpath '/home/woir/Downloads/states.csv' into table allstates;

Creation of partition table

-> create table state_part(District string,Enrolments string) PARTITIONED BY(state string);
-> set hive.exec.dynamic.partition.mode=nonstrict

For partition we have to set this property

Loading data into partition table

-> INSERT OVERWRITE TABLE state_part PARTITION(state) SELECT district,enrolments,state from allstates;

In Hive, we have to enable buckets by using the set hive.enforce.bucketing=true;

-> create table samplebucket (    
    first_name string,
    job_id    int,
    department string,
    salary string
) clustered by  (department) into 4 buckets 
  row format delimited
fields terminated by ',' ; from employees_woir insert overwrite table samplebucket select Name, id, Department, Salary;
# Id INT, Name STRING, Age INT, Address STRING, Salary FLOAT, Department STRING
->  hadoop dfs -lsr /user/hive/warehouse/woir.db/samplebucket/

Example:

-> Create VIEW Sample_View AS SELECT * FROM employees_woir  WHERE salary>40100;

-> select * from Sample_View;
-> Create INDEX sample_Index ON TABLE woirhive_internaltable(id);
-> #create table employees_woir(Id INT, Name STRING,Age INT, Address STRING, Salary FLOAT, Department STRING) Row Format delimiterd Fields terminated by ',';
-> SELECT * FROM employees_woir ORDER BY Department;

-> SELECT Department, count(*) FROM employees_woir GROUP BY Department;

-> SELECT * from employees_woir SORT BY Id DESC;

-> SELECT Id, Name from employees_woir CLUSTER BY Id;

-> SELECT Id, Name from employees_woir DISTRIBUTE BY Id;

Join queries Different type of joins

Joins are of 4 types, these are –

  • Inner join
  • Left outer Join
  • Right Outer Join
  • Full Outer Join

Inner Join:

The Records common to the both tables will be retrieved by this Inner Join.

-> SELECT c.Id, c.Name, c.Age, o.Amount FROM employees_woir c JOIN order_history o ON(c.Id=o.Id);

Left Outer Join:

Hive query language LEFT OUTER JOIN returns all the rows from the left table even though there are no matches in right table. If ON Clause matches zero records in the right table, the joins still return a record in the result with NULL in each column from the right table

-> SELECT c.Id, c.Name, o.Amount, o.Date1 FROM employees_woir c LEFT OUTER JOIN order_history o ON(c.Id=o.Id)

Right outer Join:

Hive query language RIGHT OUTER JOIN returns all the rows from the Right table even though there are no matches in left table. If ON Clause matches zero records in the left table, the joins still return a record in the result with NULL in each column from the left table. RIGHT joins always return records from a Right table and matched records from the left table. If the left table is having no values corresponding to the column, it will return NULL values in that place.

-> SELECT c.Id, c.Name, o.Amount, o.Date1 FROM employees_woir c RIGHT OUTER JOIN order_history o ON(c.Id=o.Id)

Full outer join:

It combines records of both the tables sample_joins and sample_joins1 based on the JOIN Condition given in query.It returns all the records from both tables and fills in NULL Values for the columns missing values matched on either side.

-> SELECT c.Id, c.Name, o.Amount, o.Date1 FROM woir_employees c FULL OUTER JOIN order_history o ON(c.Id=o.Id)

 Download – employee names order states

Hive Installation

Download
woir@woir-VirtualBox:/tmp$ wget http://archive.apache.org/dist/hive/hive-2.1.0/apache-hive-2.1.0-bin.tar.gz
 

woir@woir-VirtualBox:/tmp$ tar xvzf apache-hive-2.1.0-bin.tar.gz -C /home/woir
copy paste following in the /home/woir/sourceme
export HIVE_HOME=/home/woir/apache-hive-2.1.0-bin
export HIVE_CONF_DIR=/home/woir/apache-hive-2.1.0-bin/conf
export PATH=$HIVE_HOME/bin:$PATH
export CLASSPATH=$CLASSPATH:/home/woir/hadoop-2.6.0/lib/*:.
export CLASSPATH=$CLASSPATH:/home/woir/apache-hive-2.1.0-bin/lib/*:.
export HADOOP_HOME=~/hadoop-2.6.0


hduser@laptop:/home/woir/apache-hive-2.1.1-bin$ source ~/.bashrc
$ echo $HADOOP_HOME
/home/woir/hadoop-2.6.0
$ hive --version
Hive 2.1.0
Subversion git://jcamachguezrMBP/Users/jcamachorodriguez/src/workspaces       /hive/HIVE-release2/hive -r 9265bc24d75ac945bde9ce1a0999fddd8f2aae29
Compiled by jcamachorodriguez on Fri Jun 17 01:03:25 BST 2016
From source with checksum 1f896b8fae57fbd29b047d6d67b75f3c
woir@woir-VirtualBox:~$ hadoop dfs -ls /
drwxr-xr-x   - hduser supergroup          0 2016-11-23 11:17 /hbase
drwx------   - hduser supergroup          0 2016-11-18 16:04 /tmp
drwxr-xr-x   - hduser supergroup          0 2016-11-18 09:13 /user

woir@woir-VirtualBox:~$ hadoop dfs -mkdir -p /user/hive/warehouse
woir@woir-VirtualBox:~$ hadoop dfs -mkdir /tmp
woir@woir-VirtualBox:~$ hadoop dfs -chmod g+w /tmp 
woir@woir-VirtualBox:~$ hadoop dfs -chmod g+w /user/hive/warehouse 
woir@woir-VirtualBox:~$ hadoop dfs -ls /
   drwxr-xr-x - hduser supergroup 0 2016-11-23 11:17 /hbase 
   drwx-w---- - hduser supergroup 0 2016-11-18 16:04 /tmp 
   drwxr-xr-x - hduser supergroup 0 2016-11-23 17:18 /user 
woir@woir-VirtualBox:~$ hadoop dfs -ls /user 
   drwxr-xr-x - hduser supergroup 0 2016-11-18 23:17 /user/hduser 
   drwxr-xr-x - hduser supergroup 0 2016-11-23 17:18 /user/hive
woir@woir-VirtualBox:~$ cd $HIVE_HOME/conf
woir@woir-VirtualBox:~/home/woir/apache-hive-2.1.0-bin/conf$ vi hive-env.sh.template
# Set HADOOP_HOME to point to a specific hadoop install directory
# HADOOP_HOME=${bin}/../../hadoop

# Hive Configuration Directory can be controlled by:
# export HIVE_CONF_DIR=/home/woir/apache-hive-2.1.0-bin/conf


hduser@laptop:/home/woir/apache-hive-2.1.0-bin/conf$ cp hive-env.sh.template hive-env.sh
export HADOOP_HOME=/home/woir/hadoop-2.6.0
$ cd /tmp

$ wget http://archive.apache.org/dist/db/derby/db-derby-10.13.1.1/db-derby-10.13.1.1-bin.tar.gz

$ tar xvzf db-derby-10.13.1.1-bin.tar.gz -C /home/woir
copy paste following in the /home/woir/sourceme
export DERBY_HOME=/home/woir/db-derby-10.13.1.1-bin
export PATH=$PATH:$DERBY_HOME/bin
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar
$ mkdir $DERBY_HOME/data
hduser@laptop:~$ cd $HIVE_HOME/conf

create a new file hive-site.xml
woir@woir-VirtualBox:/home/woir/apache-hive-2.1.1-bin/conf$ vi hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/home/woir/apache-hive-2.1.0-bin/metastore_db;create=true</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<property>
<name>hive.metastore.uris</name>
<value/>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.PersistenceManagerFactoryClass</name>
<value>org.datanucleus.api.jdo.JDOPersistenceManagerFactory</value>
<description>class implementing the jdo persistence</description>
</property>
</configuration>

 

Metastore schema initialisation
woir@woir-VirtualBox:/home/woir/apache-hive-2.1.0-bin/bin$ schematool -initSchema -dbType derby
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/woir/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/woir/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:     jdbc:derby:;databaseName=/home/woir/apache-hive-2.1.0-bin/metastore_db;create=true
Metastore Connection Driver :     org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User:     APP
Starting metastore schema initialization to 2.1.0
Initialization script hive-schema-2.1.0.derby.sql
Initialization script completed
schemaTool completed
 Launch Hive
woir@woir-VirtualBox:/home/woir/apache-hive-2.1.0-bin/bin$ hive

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/woir/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/woir/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/home/woir/apache-hive-2.1.0-bin/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive>

 

Running an external jar from Aws Hadoop

Running an external jar from Aws Hadoop

Hadoop AWS

  • First select the services  and click on EMR from Analytics.

Screenshot (45)

 

  • Then click on the add cluster.

Screenshot (46)

 

  • Fill the Details of Cluster.
  1. Cluster name as Ananthapur-jntu
  2. Here we are checking the Logging
  3. Browse the s3 folder with the amar2017/feb
  4. Launch mode should be Step Extension
  5. After that select step type as custom jar and click on configure.
  6. The below image is showing the details.

Screenshot (49)

  • After clicking on the configure button we will see the popup like shown below
  1. Name as Custom JAR
  2. Jar location should be s3://amar2017/inputJar/wcount.jar
  3. Fill the Arguments with org.myorg.wordcount, s3://amar2017/deleteme.txt, s3://amar2017/output3
  4. Select the Action on failure as Terminate cluster
  5. Then click on add button.
  6. How to fill the details as shown below.

 

Screenshot (48)

  • Software configuration
  1. Select vendor as Amazon
  2. select Release as emr-5.3.1
  • Hardware Configuration
  1. select instance type as m1.medium
  2. And number of instanses as 3
  • Security and access
  1. Permissions checking as default
  2. After that click on create cluster button
  • Details as shown in the below image

Screenshot (50)

  • You will see the Cluster Ananthapur is Starting as shown below.

Screenshot (51)

  • The below image is showing that in cluster list Ananthapur is starting.

Screenshot (52)

  • After complishing the process we can see like below image.

Screenshot (53)

  • To see the result of AWS Hadoop go to the services.
  • Select S3 under storage.

Screenshot (142)

 

  • After clicking on S3
  • select amar2017

Screenshot (149)

  • Anad then select output3 folder
  • You will see the list of files as shown below

Screenshot (146)

  • Open the part-r-00000 file
  • You will see the page as shown below

Screenshot (147)

  • Click on the Download button
  • And Open the downloaded file you will see the result as shown below.

Screenshot (57)

Creating an Amazon EC2

Creating an Amazon EC2

Creating an Amazon EC2 instance

An EC2 instance is nothing but a virtual server in Amazon Web Services terminology. It stands for Elastic Compute Cloud. It is a web service where an AWS subscriber can request and provision a compute server in AWS cloud.

  • First Create an AWS account
  • Login and access to AWS services.

Step 1) In this step,

  • Login to your AWS account and go to the AWS Services tab at the top left corner.
  • Here, you will see all of the AWS Services categorized as per their area viz. Compute, Storage, Database, etc. For creating an EC2 instance, we have to choose Computeà EC2 as in the next step.

100

  • Open all the services and click on EC2 under Compute services. This will launch the dashboard of EC2.

Here is the EC2 dashboard. Here you will get all the information in gist about the AWS EC2 resources running.

101

Step 2) On the top right corner of the EC2 dashboard, choose the AWS Region in which you want to provision the EC2 server.

Here we are selecting Asia Pacific (Singapore). AWS provides 10 Regions all over the globe.

102

Step 3) In this step

  • Once your desired Region is selected, come back to the EC2 Dashboard.
  • Click on ‘Launch Instance’ button in the section of Create Instance (as shown below).

103

Choose AMI

Step 4) In this step we will do,

  • You will be asked to choose an AMI of your choice. (An AMI is an Amazon Machine Image. It is a template basically of an Operating System platform which you can use as a base to create your instance). Once you launch an EC2 instance from your preferred AMI, the instance will automatically be booted with the desired OS. (We will see more about AMIs in the coming part of the tutorial).
  • Here we are choosing the default Amazon Linux (64 bit) AMI.

104

Choose Instance Types

Step 5) In the next step, you have to choose the type of instance you require based on your business needs.

  • We will choose t2.micro instance type, which is a 1vCPU and 1GB memory server offered by AWS.
  • Click on “Configure Instance Details” for further configurations

105

Configure Instance

Step 6)

  • No. of instances- you can provision up to 20 instances at a time. Here we are launching one instance.
  • Under Purchasing Options, keep the option of ‘Request Spot Instances’ unchecked as of now. (This is done when we wish to launch Spot instances instead of on-demand ones.

106

Step 7) Next, we have to configure some basic networking details for our EC2 server.

  • You have to decide here, in which VPC (Virtual Private Cloud) you want to launch your instance and under which subnets inside your VPC.
  • Network section will give a list of VPCs available in our platform.
  • Select an already existing VPC or you can create your own VPN by clicking onthe create new VPN link.

Here I have selected an default VPC where I want to launch my instance.

107

Step 8) In this step,

  • A VPC consists of subnets, which are IP ranges that are separated for restricting access.
  • Below,
  1. Under Subnets, you can choose the subnet where you want to place your instance.
  2. I have chosen an default existing public subnet.
  3. You can also create a new subnet in this by clicking on the Create new subnet link.

108

Step 9) In this step,

  • You can choose if you want AWS to assign it an IP automatically, or you want to do it manually later. You can enable/ disable ‘Auto assign Public IP’ feature here likewise.
  • Here we are going to assign this instance a static IP called as EIP (Elastic IP) later. So we keep this feature Enabled as of now.

 

110

Step 10) In this step,

  • In the following step, keep the option of IAM role ‘None’ as of now.
  • Shutdown Behavior – when you accidently shut down your instance, you surely don’t want it to be deleted but stopped.
  • Here we are defining my shutdown behavior as Stop.

110

Step 11) In this step,

  • In case, you have accidently terminated your instance, AWS has a layer of security mechanism. It will not delete your instance if you have enabled accidental termination protection.
  • Here we are checking the option for further protecting our instance from accidental termination.

111_LI

Step 12) In this step,

  • Under Monitoring- you can enable Detailed Monitoring if your instance is a business critical instance. Here we have kept the option unchecked. AWS will always provide Basic monitoring on your instance free of cost.
  • Under Tenancy- select the option if shared tenancy. If your application is a highly secure application, then you should go for dedicated capacity. AWS provides both options.
  • Next,Click on ‘Add Storage’ to add data volumes to your instance in next step.

111

Add Storage

Step 13) In this step we do following things,

  • In the Add Storage step, you’ll see that the instance has been automatically provisioned a General Purpose SSD root volume of 8GB. ( Maximum volume size we can give to a General Purpose volume is 16GB)
  • You can change your volume size, add new volumes, change the volume type, etc.
  • AWS provides 3 types of EBS volumes- Magnetic, General Purpose SSD, Provisioned IOPs. You can choose a volume type based on your application’s IOPs needs.
  • Here we are selected the option General Purpose SSD (GP2).

112

Tag Instance

Step 14) In this step

  • you can tag your instance with a key-value pair. This gives visibility to the AWS account administrator when there are lot number of instances.
  • The instances should be tagged based on their department, environment like Dev/SIT/Prod. Etc. this gives a clear view of the costing on the instances under one common tag.
  1. Here we have tagged the instance as a Ananthapur-jntu
  2. Go to configure Security Groups later

113

Configuring Security Groups

Step 15) In this next step of configuring Security Groups, you can restrict traffic on your instance ports. This is an added firewall mechanism provided by AWS apart from your instance’s OS firewall.

You can define open ports and IPs.

  • Since our server is a webserver=, we will do following things
  1. Creating a new Security Group
  2. Naming our SG for easier reference
  3. Defining protocols which we want enabled on my instance
  4. Assigning IPs which are allowed to access our instance on the said protocols
  5. Once, the firewall rules are set- Review and launch

114

Review Instances

Step 16) In this step, we will review all our choices and parameters and go ahead to launch our instance.

115

Step 17) In the next step you will be asked to create a key pair to login to you an instance. A key pair is a set of public-private keys.

AWS stores the private key in the instance, and you are asked to download the public key. Make sure you download the key and keep it safe and secured; if it is lost you cannot download it again.

  1. Create a new key pair
  2. Give a name to your key
  3. Download and save it in your secured folder

116

  • When you download your key, you can open and have a look at your RSA private key.

142

Step 18) Once you are done downloading and saving your key, launch your instance.

  • You can see the launch status meanwhile.

143

  • You can also see the launch log.

118

  • After that click on the view instances button it will shows the your instance.

119

  • Click on the ‘Instances’ option on the left pane where you can see the status of the instance as ‘Pending’ for a brief while.
  • Once your instance is up and running, you can see its status as ‘Running’ now.
  • Note that the instance has received a Private IP from the pool of AWS.

121

Creating a EIP and connecting to your instance

An EIP is a static public IP provided by AWS. It stands for Elastic IP. Normally when you create an instance, it will receive a public IP from the AWS’s pool automatically. If you stop/reboot your instance, this public IP will change- it’dynamic. In order for your application to have a static IP from where you can connect via public networks, you can use an EIP.

Step 19) On the left pane of EC2 Dashboard, you can go to ‘Elastic IPs’ as shown below.

144

Step 20) Allocate a new Elastic IP Address.

123

  • After allocating New Address you will see the success massage as show below.

124

 

Step 21) Now assign this IP to your instance.

  • Select the IP
  • Click on Actions -> Associate Address

126

 

Step 22) In the next page,

  • Search for your instance and
  • Associate the IP to it.

127 128

  • After that,Click on the Associate button you can see the success message as shown below.

130

Step 23) Come back to your instances screen, you’ll see that your instance has received your EIP.

131

Step 24) Now open putty from your programs list and add your same EIP in there as below.

132

Step 25) In this step,

Add your private key in putty for secure connection

  1. Go to Auth
  2. Browse your private key in .ppk (putty private key) format

Once done click on “Open” button

133

  • Once you connect, you will successfully see the Linux prompt.
  • Please note that the machine you are connecting from should be enabled on the instance Security Group for SSH (like in the steps above).

134

  • Now you can see the Running instanses status is 1 in EC2 Dashboard is as shown below.

135

Step 26)

If you want to stop/close the instances  select the  left pane of EC2 Dashboard.

  • click on Actions > Instance state >terminate/stop

136

  • You will see like below image then click on the Yes,Terminate button.

138

  • After clicking on that we can see like below images.

139 140

  • Now see the EC2 Dashboard you shuld see the running instances is ‘0’.

141

 

 

 

Linux Commands Hands-on

Linux Commands Hands-on

Linux Main Commands – useful to complete the entire workshop

Start Desktop

  • Login as hduser
  • Click on the right top corner and Chose Hadoop User.
  • Enter password – <your password>
  • screen-shot-2016-12-16-at-2-14-27-am
    • Click on the Top Left Ubuntu Button and search for the terminal and click on it.

    screen-shot-2016-12-16-at-2-18-44-am

    • You should see something similar as below

    screen-shot-2016-12-16-at-2-20-24-am

 

Learn Unix on a different directory

  • Create directory ‘scratch’ to learn unix and change the directory into it.
  • mkdir /home/hduser/scratch
  • cd /home/hduser/scratch

 

  • mkdir – make directories
    Usage: mkdir [OPTION] DIRECTORY…

    mkdir pvpsit
  • mkdir btech3rdyear
  • mkdir btech4thyear

 

  • ls – list directory contents
    Usage: ls [OPTION]… [FILE]…
    eg. ls, ls ­l, ls lhn
  • ls -lart # it should show you the directory you have just created.

 

 

  • cd – changes directories
    Usage: cd [DIRECTORY]
    eg. cd lhn
cd btech4thyear # it will change your current directory
ls -lart   # it should not show anything as this directory does not have anything yet

 

  • pwd ­-  print name of current working directory
    Usage: pwd

 

pwd # should print the current working directory.

o/p should show - 
/home/hduser/scratch/btech4thyear

 

  • nano – text editor
    Usage: nano <file name>
    eg. nano test.txt
nano test.txt # opens a file, write something

o/p
it should open a file to be edited in the command line. Please use CTRL X and chose option save to come out from there.

check using -
ls -lart

 

  • cp – copy files and directories
    Usage: cp [OPTION]… SOURCE DEST
    eg. cp sample.txt sample_copy.txt
    cp sample_copy.txt target_dir

 

cp test.txt test2.txt  # make a copy of your file just now you have created

check using -
ls -lart

 

  • mv – move (rename) files
    Usage: mv [OPTION]… SOURCE DEST
    eg. mv source.txt target_dir
    mv old.txt new.txt

 

mv test.txt  test1.txt # it will rename your file to test1.txt
check using -
ls -lart

 

  • rm ­ remove files or directories
    Usage: rm [OPTION]… FILE…
    eg. rm file1.txt , rm ­rf some_dir

 

rm test2.txt # it will delete the test2.txt
check using - 
ls -lart #you should see only one file test1.txt

 

  • find – search for files in a directory hierarchy
    Usage: find [OPTION] [path] [pattern] eg. find file1.txt, find ­name file1.txt

 

find /home/hduser -name "*.txt" # will display all the files with extension .txt

 

  • history – prints recently used commands
    Usage: history

 

history  # should show all the commands just now you ran - helpful or not ?

 

  • cat – concatenate files and print on the standard output
    Usage: cat [OPTION] [FILE]…
    eg. cat file1.txt file2.txt
    cat ­n file1.txt

 

cat test1.txt # it will show the contents which just now you have created.

 

  • echo – display a line of text
    Usage: echo [OPTION] [string] …
    eg. echo I love India
    echo $HOME
echo "Hello Amar" # just any string

echo $PATH   # environment variable - tells where the files can be present.

echo $JAVA_HOME #environment variable


  • grep ­- print lines matching a pattern
    Usage: grep [OPTION] PATTERN [FILE]…
    eg. grep ­i apple sample.txt

 

grep <contents> test1.txt # should show only line which contains the word

 

  •  wc ­- print the number of newlines, words, and bytes in files
    Usage: wc [OPTION]… [FILE]…
    eg.  wc file1.txt
    wc ­L file1.txt

 

wc test1.txt # will show you how many characters,words and line your file have.

 

  • sort – sort lines of text files
    Usage: sort [OPTION]… [FILE]…
    eg. sort file1.txt
    sort ­r file1.txt

 

sort test1.txt # show the contents sorted.

 

  • tar – to archive a file
    Usage: tar [OPTION] DEST SOURCE
    eg. tar ­cvf /home/archive.tar /home/original
tar ­xvf /home/hduser/scratch # will make a tar of your directory.

 

  • kill – to kill a process(using signal mechanism)
    Usage: kill [OPTION] pid
    eg. kill ­9 2275

 

kill -9 2275 # will stop the process with process id 2275

 

  • ps – report a snapshot of the current processes
    Usage: ps [OPTION]
    eg. ps,  ps ­el

 

ps -eaf  # will show all the processes running currently

 

  • who – show who is logged on
    Usage: who [OPTION]
    eg. who , who ­b , who ­q

 

who # will list down totoal number of users currently logged in , in you case it will show only one user 'hduser'

 

  • passwd – update  a user’s authentication tokens(s)
    Usage: passwd [OPTION]
    eg. passwd

 

passwd # will prompt to take a new password - please don't do this during the workshop

 

  •  su –  change user ID or become super­user
    Usage: su [OPTION] [LOGIN]
    eg. su remo, su

 

su akhil # it will ask password for akhil and then you can login as user akhil

 

  • chown – change file owner and group
    Usage: chown [OPTION]… OWNER[:[GROUP]] FILE…
    eg. chown remo myfile.txt

 

chown test1txt # requires you to be working as root - please don't try this during workshop

 

  • chmod – change file access permissions
    Usage: chmod [OPTION] [MODE] [FILE]
    eg. chmod 744 calculate.sh
ls -lart test1.txt # observe file permission

chmod -x test1.txt # observer the permission again

ls -lart test1.txt

chmod 777 test1.txt # observe the permission again

ls -lart test1.txt
  • zip – package and compress (archive) files
    Usage: zip [OPTION] DEST SOURSE
    eg. zip original.zip original

 

zip / gzip test1.txt # it will zip the file and rename it with .zip

check using -

ls -lart

 

  • unzip – list, test and extract compressed files in a ZIP archive
    Usage: unzip filename
    eg. unzip original.zip

 

unzip/gunzip test1.zip  # it will uncompress your file again

 

  • ssh – SSH client (remote login program)
    “ssh is a program for logging into a remote machine and for
    executing commands on a remote machine”
    Usage: ssh [options] [user]@hostname
    eg. ssh ­X guest@10.105.11.20

 

ssh akhil@192.168.1.19 # you will login to the box with IP address 192.168.1.19 with user akhil

 

  • scp – secure copy (remote file copy program)
    “scp copies files between hosts on a network”
    Usage: scp [options] [[user]@host1:file1] [[user]@host2:file2]
    eg. scp file1.txt guest@10.105.11.20:~/Desktop/
scp test1.txt akhil@192.168.1.19:/home/akhil/  # it copies the file test1.txt to system having IP 192.168.1.19 in user akhil's home directory.

 

  • du – estimate file space usage
    Usage:  du [OPTION]… [FILE]…
    eg. du
du -s -h /home/hduser  # will show the total space occupied in /home/hduser directory.
  • df – report filesystem disk space usage
    Usage: df [OPTION]… [FILE]…
    eg. df

 

df -h . # it will show you the disk usage - very useful command

 

  • reboot – reboot the system
    Usage: reboot [OPTION]
    eg. reboot
reboot # it reboots the system - however it require you to be root. Please don't do it during the workshop

 

  • gedit ­ A text Editor. Used to create and edit files.
    Usage: gedit [OPTION] [FILE]…
    eg. gedit

 

gedit test1.txt # opens your file in GUI mode for editing - very useful for beginners

 

  • bg – make a foreground process to run in background
    Usage: type ‘ctrl+z’  and then ‘bg ‘
type ‘ctrl+z’  and then ‘bg ‘ # please run ctrl+z command in any foreground process and then type bg

 

  • fg – to make background process as foreground process
    Usage: fg [jobid]
fg # it bring backs the background process in the foreground

 

 

  • jobs – displays the names and ids of background jobs
    Usage: jobs
jobs # display all the jobs

locate – find or locate a file
Usage: locate [OPTION]… FILE…
eg. locate file1.txt

locate test1.txt # find the file for you in your system.

 

  • sed ­  stream editor for filtering and transforming text
    Usage: sed [OPTION] [input­file]…
    eg. sed ‘s/love/hate/g’ loveletter.txt

 

sed 's/tutorial/workshop/g' test1.txt # replace all the occurence of tutorial with workshop. Very useful command

 

  • awk ­ pattern scanning and processing language
    eg.  awk ­F: ‘{ print $1 }’ sample_awk.txt
awk -F ' ' '{print $1}' test1.txt # print first column of your file separated with white space

 

 

Elasticsearch, logstash and Kibana

Elasticsearch, logstash and Kibana

  • Start Desk Top

 

  • Login as amar
    • Click on the right top corner and Chose Hadoop User.
    • Enter password – <your password>

screen-shot-2016-12-16-at-2-14-27-am

  • Click on the Top Left Ubuntu Button and search for the terminal and click on it.

screen-shot-2016-12-16-at-2-18-44-am

  • You should see something similar as below

screen-shot-2016-12-16-at-2-20-24-am

Start all the services-

  • cd /home/amar
  • set JAVA_HOME
    • export JAVA_HOME=/home/amar/JAVA
  • Start elasticsearch
    ~/elasticsearch-5.0.2/bin/elasticsearch 2>&1 > ~/elastic123.log &

    Check – open firefox and type –

  • http://localhost:9200
  • Start Kibana
    • ~/kibana-5.0.2-linux-x86/bin/kibana 2>&1 > ~/kibana123.log &
      
      
  • Open firefox and type –
     http://localhost:5601

Reference to Unix Commands –

  • http://www.thegeekstuff.com/2010/11/50-linux-commands/

Elasticsearch

  • Run all the commands given below on the terminal which you have opened in the previous step.

Terminology:

Table comparing Elasticsearch terminology with traditional relational database terminology:

§  MYSQL (RDBMS) TERMINOLOGY ELASTICSEARCH TERMINOLOGY
Database Index
Table Type
Row Document

Used Restful methods

  • HTTP Methods used: GET, POST, PUT, DELETE

Exercises and Solutions

CLICK BELOW TO USE COPY PASTE FRIENDLY VERSION

—–>> elastic-search-cvr<<<-

To start, please open a terminal on your Ubuntu Box
  • Check the status of the elasticseach from the command line.

i/p

curl http://localhost:9200

o/p

Expected output on the screen –

{
 "name" : "MRPWrOy",
 "cluster_name" : "elasticsearch",
 "cluster_uuid" : "-YtQj9REQjaCbROg0Nc74w",
 "version" : {
 "number" : "5.0.2",
 "build_hash" : "f6b4951",
 "build_date" : "2016-11-24T10:07:18.101Z",
 "build_snapshot" : false,
 "lucene_version" : "6.2.1"
 },
 "tagline" : "You Know, for Search"
}
  • Create an index named “company”

i/p

curl -XPUT http://localhost:9200/company

o/p

{"acknowledged":true,"shards_acknowledged":true}

  • Create another index named “govtcompany”

i/p

curl -XPUT http://localhost:9200/govtcompany

o/p

{"acknowledged":true,"shards_acknowledged":true}
  • Get list of indices created so far –

i/p

curl -XGET http://localhost:9200/_cat/indices?pretty

o/p

yellow open nyc_visionzero BlWR26RdQYaHHym9JpYq_w 5 1 424707 0 436.7mb 436.7mb

yellow open govtcompany    6Wp2J3AxRoa9eCl82jPL2A 5 1      0 0    650b    650b

yellow open company        HB89IjvdSVez_In5nqSzWQ 5 1      0 0    650b    650b

yellow open employee       Xt3nJVgFRiWs3kwKEhY6XQ 5 1      0 0    650b    650b

yellow open .kibana        6Wl9qr8DSLm-g__2oQbX-w 1 1     14 0  34.6kb  34.6kb
  • Delete an Index and check the indices again –

i/p

curl -XDELETE http://localhost:9200/govtcompany

curl -XGET http://localhost:9200/_cat/indices?pretty

o/p

yellow open nyc_visionzero BlWR26RdQYaHHym9JpYq_w 5 1 424707 0 436.7mb 436.7mb

yellow open company        HB89IjvdSVez_In5nqSzWQ 5 1      0 0    650b    650b

yellow open employee       Xt3nJVgFRiWs3kwKEhY6XQ 5 1      0 0    650b    650b

yellow open .kibana        6Wl9qr8DSLm-g__2oQbX-w 1 1     14 0  34.6kb  34.6kb
  • Status of index ‘company’

i/p

curl -XGET  http://localhost:9200/company?pretty

o/p

{
 "company" : {
 "aliases" : { },
 "mappings" : { },
 "settings" : {
 "index" : {
 "creation_date" : "1481891504208",
 "number_of_shards" : "5",
 "number_of_replicas" : "1",
 "uuid" : "WNW9bNGRTdqbXVWFhBcYRg",
 "version" : {
 "created" : "5000299"
 },
 "provided_name" : "company"
 }
 }
 }
 }
  • Delete company also

i/p

curl -XDELETE http://localhost:9200/company

o/p

{"acknowledged":true }
  • Create index with specified data type –

i/p

curl -XPUT http://localhost:9200/company -d '{

"mappings": {
"employee": {
"properties": {
"age": {
"type": "long"
},

"experience": {
"type": "long"
},

"name": {
"type": "string",
"analyzer": "standard"
}
}
}
}
}'

o/p

{"acknowledged":true,"shards_acknowledged":true}
  • Status of index ‘company’

i/p

curl -XGET  http://localhost:9200/company?pretty

o/p

curl -XGET  http://localhost:9200/company?pretty

{

"company" : {

"aliases" : { },

"mappings" : {

"employee" : {

"properties" : {

"age" : {

"type" : "long"

},

"experience" : {

"type" : "long"

},

"name" : {

"type" : "text",

"analyzer" : "standard"

}

}

}

},

"settings" : {

"index" : {

"creation_date" : "1481744753984",

"number_of_shards" : "5",

"number_of_replicas" : "1",

"uuid" : "Wqsco08iROCqiwRr7eXnwA",

"version" : {

"created" : "5000299"

},

"provided_name" : "company"

}

}

}

}
  • Data Insertion

i/p

 curl -XPOST http://localhost:9200/company/employee -d '{

"name": "Amar Sharma",

"age" : 45,

"experience" : 10

}'

o/p

{"_index":"company","_type":"employee","_id":"AVj-48T3Vl1JB8XO-oqO","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"created":true}

i/p

 curl -XPOST http://localhost:9200/company/employee -d '{

"name": "Sriknaht Kandi",

"age" : 35,

"experience" : 7

}'

o/p

{"_index":"company","_type":"employee","_id":"AVj-5CrJVl1JB8XO-oqP","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"created":true}

i/p

curl -XPOST http://localhost:9200/company/employee -d '{

"name": "Abdul Malik",

"age" : 25,

"experience" : 3

}'

o/p

{"_index":"company","_type":"employee","_id":"AVj-5HiXVl1JB8XO-oqQ","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"created":true}
  • Retreive Data

i/p

curl -XGET http://localhost:9200/company/employee/_search?pretty

o/p

{

"took" : 2,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"failed" : 0

},

"hits" : {

"total" : 4,

"max_score" : 1.0,

"hits" : [

{

"_index" : "company",

"_type" : "employee",

"_id" : "AVj-5HiXVl1JB8XO-oqQ",

"_score" : 1.0,

"_source" : {

"name" : "Abdul Malik",

"age" : 25,

"experience" : 3

}

},

{

"_index" : "company",

"_type" : "employee",

"_id" : "AVj-5CrJVl1JB8XO-oqP",

"_score" : 1.0,

"_source" : {

"name" : "Sriknaht Kandi",

"age" : 35,

"experience" : 7

}

},

{

"_index" : "company",

"_type" : "employee",

"_id" : "AVj-48T3Vl1JB8XO-oqO",

"_score" : 1.0,

"_source" : {

"name" : "Amar Sharma",

"age" : 45,

"experience" : 10

}

},

{

"_index" : "company",

"_type" : "employee",

"_id" : "AVj-40-mVl1JB8XO-oqN",

"_score" : 1.0,

"_source" : {

"name" : "Andrew",

"age" : 45,

"experience" : 10

}

}

]

}

}
  • Conditional Search:

i/p

curl -XPOST http://localhost:9200/company/employee/_search?pretty -d '{

"query": {

"match_all": {}

}

}'

o/p

{

"took" : 1,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"failed" : 0

},

"hits" : {

"total" : 4,

"max_score" : 1.0,

"hits" : [

{

"_index" : "company",

"_type" : "employee",

"_id" : "AVj-5HiXVl1JB8XO-oqQ",

"_score" : 1.0,

"_source" : {

"name" : "Abdul Malik",

"age" : 25,

"experience" : 3

}

},

{

"_index" : "company",

"_type" : "employee",

"_id" : "AVj-5CrJVl1JB8XO-oqP",

"_score" : 1.0,

"_source" : {

"name" : "Sriknaht Kandi",

"age" : 35,

"experience" : 7

}

},

{

"_index" : "company",

"_type" : "employee",

"_id" : "AVj-48T3Vl1JB8XO-oqO",

"_score" : 1.0,

"_source" : {

"name" : "Amar Sharma",

"age" : 45,

"experience" : 10

}

},

{

"_index" : "company",

"_type" : "employee",

"_id" : "AVj-40-mVl1JB8XO-oqN",

"_score" : 1.0,

"_source" : {

"name" : "Andrew",

"age" : 45,

"experience" : 10

}

}

]

}

}
  • Another Variant

i/p

curl -XGET http://localhost:9200/company/employee/_search?pretty

o/p

{

"took" : 1,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"failed" : 0

},

"hits" : {

"total" : 4,

"max_score" : 1.0,

"hits" : [

{

"_index" : "company",

"_type" : "employee",

"_id" : "AVj-5HiXVl1JB8XO-oqQ",

"_score" : 1.0,

"_source" : {

"name" : "Abdul Malik",

"age" : 25,

"experience" : 3

}

},

{

"_index" : "company",

"_type" : "employee",

"_id" : "AVj-5CrJVl1JB8XO-oqP",

"_score" : 1.0,

"_source" : {

"name" : "Sriknaht Kandi",

"age" : 35,

"experience" : 7

}

},

{

"_index" : "company",

"_type" : "employee",

"_id" : "AVj-48T3Vl1JB8XO-oqO",

"_score" : 1.0,

"_source" : {

"name" : "Amar Sharma",

"age" : 45,

"experience" : 10

}

},

{

"_index" : "company",

"_type" : "employee",

"_id" : "AVj-40-mVl1JB8XO-oqN",

"_score" : 1.0,

"_source" : {

"name" : "Andrew",

"age" : 45,

"experience" : 10

}

}

]

}

}
  • Fetch all employees with a particular name

i/p

curl -XGET http://localhost:9200/_search?pretty -d  '{

"query": {

"match": {

"name": "Amar Sharma"

}

}

}'

o/p

{

"took" : 11,

"timed_out" : false,

"_shards" : {

"total" : 11,

"successful" : 11,

"failed" : 0

},

"hits" : {

"total" : 1,

"max_score" : 0.51623213,

"hits" : [

{

"_index" : "company",

"_type" : "employee",

"_id" : "AVj-48T3Vl1JB8XO-oqO",

"_score" : 0.51623213,

"_source" : {

"name" : "Amar Sharma",

"age" : 45,

"experience" : 10

}

}

]

}

}
  • Employees with age greater than a number :

i/p

curl -XGET http://localhost:9200/_search?pretty -d '

{

"query": {

"range": {

"age": {"gt": 35 }

}

}

}'

o/p

{

"took" : 7,

"timed_out" : false,

"_shards" : {

"total" : 11,

"successful" : 11,

"failed" : 0

},

"hits" : {

"total" : 2,

"max_score" : 1.0,

"hits" : [

{

"_index" : "company",

"_type" : "employee",

"_id" : "AVj-48T3Vl1JB8XO-oqO",

"_score" : 1.0,

"_source" : {

"name" : "Amar Sharma",

"age" : 45,

"experience" : 10

}

},

{

"_index" : "company",

"_type" : "employee",

"_id" : "AVj-40-mVl1JB8XO-oqN",

"_score" : 1.0,

"_source" : {

"name" : "Andrew",

"age" : 45,

"experience" : 10

}

}

]

}

}
  • Fetch data with multiple conditions

i/p

curl -XGET http://localhost:9200/_search?pretty -d '{

"query": {   "bool": {

"must":     { "match": {"name": "Andrew" }},

"should":   { "range": {"age": { "gte":  35 }}}

}

}}'

o/p

{

"took" : 9,

"timed_out" : false,

"_shards" : {

"total" : 11,

"successful" : 11,

"failed" : 0

},

"hits" : {

"total" : 1,

"max_score" : 1.287682,

"hits" : [

{

"_index" : "company",

"_type" : "employee",

"_id" : "AVj-40-mVl1JB8XO-oqN",

"_score" : 1.287682,

"_source" : {

"name" : "Andrew",

"age" : 45,

"experience" : 10

}

}

]

}

}
  • Create the records in Elasticsearch with specific id ( here it is ‘2’ )
i/p

curl -XPUT 'http://localhost:9200/company/employee/2' -d '{

"name": "Amar3 Sharma",

"age" : 45,

"experience" : 10

}'

 

 

o/p

{"_index":"company","_type":"employee","_id":"2","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"created":true}
  • Update the record which you created ( please note the version number in the o/p ) –

 

i/p

curl -XPUT 'http://localhost:9200/company/employee/2' -d '{

"name": "Amar4 Sharma",

"age" : 45,

"experience" : 10

}'

o/p

{"_index":"company","_type":"employee","_id":"2","_version":2,"result":"updated","_shards":{"total":2,"successful":1,"failed":0},"created":false}
  • Update the record which you created again (note the version number in the o/p again)

i/p

curl -XPUT http://localhost:9200/company/employee/2 -d '{

"name": "Amar5 Sharma",

"age" : 45,

"experience" : 10

}'

o/p

{"_index":"company","_type":"employee","_id":"2","_version":3,"result":"updated","_shards":{"total":2,"successful":1,"failed":0},"created":false}

[1pdfviewer]http://woir.in/wp-content/uploads/2016/12/AssignmentCVR-ES-watermark.pdf[/pdfviewer1]

Logstash –

Please run following commands from terminal login as user –

  • Login as amar
  • cd /home/amar
  • Cleanup the environment
curl -XDELETE http://localhost:9200/stock
\rm -rf ~/logstash-5.0.2/data/plugins/inputs/file/.sincedb_*
  • set JAVA_HOME
    export JAVA_HOME=~/JAVA
  • Download the data to be inserted into ES
    • Sample input file to be used with Logstash – table / larger file table-3
wget -O /home/amar/Downloads/table-3.csv http://woir.in/wp-content/uploads/2016/12/table-3.csv

 

 

wget -O /home/amar/Downloads/stock-logstash.conf_1.csv http://woir.in/wp-content/uploads/2016/12/stock-logstash.conf_1.csv
  • Point the config file and run the logstash – it will insert data into elasticsearch
/home/amar/logstash-5.0.2/bin/logstash -f /home/amar/Downloads/stock-logstash.conf_1.csv
  • Check data insertion is done or not –

 

curl -XGET http://localhost:9200/stock/_search?pretty

 

You should see o/p similar to following –

{
   "took" : 5,
   "timed_out" : false,
   "_shards" : {
     "total" : 5,
     "successful" : 5,
     "failed" : 0
   },
   "hits" : {
     "total" : 22,
     "max_score" : 1.0,
     "hits" : [
       {
         "_index" : "stock",
         "_type" : "logs",
         "_id" : "AVkBgrZ2Vl1JB8XO-oqh",
         "_score" : 1.0,
         "_source" : {
          "High" : "High", ........................................................

Kibana

Start Kibana
export JAVA_HOME=~/JAVA

~/kibana-5.0.2-linux-x86/bin/kibana 2>&1 > ~/kibana123.log &
  • Open firefox type –
 http://localhost:5601

It should look something similar to following –

screen-shot-2016-12-16-at-1-26-42-am

  • Configure Index in Kibana ( Click on Management )

screen-shot-2016-12-16-at-1-27-00-am

  • Click on Add New , fill in stock and choose Date field.

screen-shot-2016-12-16-at-1-28-01-am

  • Mark it as default Index to be used by clicking on the star

screen-shot-2016-12-16-at-1-27-13-am

  • Time is to create visualization – click on the tab right side.

screen-shot-2016-12-16-at-1-14-31-am

  • Chose Stock ( if there are more than one index available )

screen-shot-2016-12-16-at-1-14-53-am

  • Chose Line Chart for this example

screen-shot-2016-12-16-at-1-28-45-am

  • Fill in the fields as below

screen-shot-2016-12-16-at-1-29-03-am

  • Fill in the fields as below

screen-shot-2016-12-16-at-1-29-35-am

  • Fill in the fields as below

screen-shot-2016-12-16-at-1-29-52-am

  • Fill in the fields as below

screen-shot-2016-12-16-at-1-31-15-am

  • Click on the little carrot sign at the right top corner.

screen-shot-2016-12-16-at-1-31-48-am

  • Choose time interval last two year

screen-shot-2016-12-16-at-1-33-15-am

  • Apply your changes again

screen-shot-2016-12-16-at-1-33-35-am

  • Click on the save button and give it a name ( Line Chart in this case )

screen-shot-2016-12-16-at-1-33-58-am

  • Prepare another chart yourself ( Bar Chart )

screen-shot-2016-12-16-at-1-36-40-am screen-shot-2016-12-16-at-1-37-00-am

  • Visualizations are ready – we need to add them in the Dashboard – click on the Dashboard on the left side.

screen-shot-2016-12-16-at-1-37-28-am

  • Click on the New and then on the Add button, now chose the visualize which just have been added.

screen-shot-2016-12-16-at-1-51-48-am

NYC Motor Vehicle Collision

  • Please open a readymade report to play around with the dashboard.
  • screen-shot-2016-12-23-at-12-13-14-am
Hadoop Word Count Problem

Hadoop Word Count Problem

Few basics in Unix –

UNIX TUTORIAL

  • How to check if a process is running or not
~> ps -eaf | grep 'java'  will list down all the process which uses java
  • How to kill a process forcefully
~> ps -eaf | grep 'java'

The above command shows the process ids of the process which uses java

~> kill -9 'process id'

it will kill the job with that process id
  • What does sudo do –
    • It runs the command with root’s privilege

Start Desktop

  • Start Desktop Box
  • Login as amar
    • Click on the right top corner and Chose Hadoop User.
    • Enter password – <your password>

screen-shot-2016-12-16-at-2-14-27-am

  • Click on the Top Left Ubuntu Button and search for the terminal and click on it.

screen-shot-2016-12-16-at-2-18-44-am

  • You should see something similar as below

screen-shot-2016-12-16-at-2-20-24-am

Start with HDFS

  • Setup the environment
~> source /home/woir/sourceme
  • Stop all the processes
~> /home/woir/stop_all.sh
  • Start hadoop if not already started –
~> /home/woir/start_hadoop.sh
  • Check if Hadoop is running fine
~> jps

it will list down the running hadoop processes.

o/p should look like below -

woir@woir-VirtualBox:/usr/local/hadoop/sbin$ jps
 14416 SecondaryNameNode
 14082 NameNode
 14835 Jps
 3796 Main
 14685 NodeManager
 14207 DataNode
 14559 ResourceManager
  • Make directory for the purpose of demonstration

The command creates the /user/hduser/dir/dir1 and /user/hduser/employees/salary

~> hadoop fs -mkdir -p /user/hduser/dir/dir1 /user/hduser/employees/salary
  • Copy contents in to the directory. It can copy directory also.
~> hadoop fs -copyFromLocal /home/woir/example/WordCount1/file* /user/hduser/dir/dir1
  • The hadoop ls command is used to list out the directories and files –
~> hadoop fs -ls /user/hduser/dir/dir1/
  • The hadoop lsr command recursively displays the directories, sub directories and files in the specified directory. The usage example is shown below:
~> hadoop fs -lsr /user/hduser/dir
  • Hadoop cat command is used to print the contents of the file on the terminal (stdout). The usage example of hadoop cat command is shown below:
~> hadoop fs -cat /user/hduser/dir/dir1/file*
  • The hadoop chmod command is used to change the permissions of files. The -R option can be used to recursively change the permissions of a directory structure.

Note the permission before –

~> hadoop fs -ls /user/hduser/dir/dir1/

Change the persission-

~> hadoop fs -chmod 777 /user/hduser/dir/dir1/file1

See it again –

~> hadoop fs -ls /user/hduser/dir/dir1/
  • The hadoop chown command is used to change the ownership of files. The -R option can be used to recursively change the owner of a directory structure.
~> hadoop fs -chown amars:amars /user/hduser/dir/dir1/file1

Check the ownership now –

~> hadoop fs -ls /user/hduser/dir/dir1/file1
  • The hadoop copyFromLocal command is used to copy a file from the local file system to the hadoop hdfs. The syntax and usage example are shown below:
~> hadoop fs -copyFromLocal /home/woir/example/WordCount1/file* /user/hduser/employees/salary
  • The hadoop copyToLocal command is used to copy a file from the hdfs to the local file system. The syntax and usage example is shown below:
~> hadoop fs -copyToLocal /user/hduser/dir/dir1/file1 /home/woir/Downloads/
  • The hadoop cp command is for copying the source into the target.
~>hadoop fs -cp /user/hduser/dir/dir/file1 /user/hduser/dir/
  • The hadoop moveFromLocal command moves a file from local file system to the hdfs directory. It removes the original source file. The usage example is shown below:
~> hadoop fs -moveFromLocal /home/woir/Downloads/file1  /user/hduser/employees/
  • It moves the files from source hdfs to destination hdfs. Hadoop mv command can also be used to move multiple source files into the target directory. In this case the target should be a directory. The syntax is shown below:
~> hadoop fs -mv /user/hduser/dir/dir1/file2 /user/hduser/dir/
  • The du command displays aggregate length of files contained in the directory or the length of a file in case its just a file. The syntax and usage is shown below:
~> hadoop fs -du /user/hduser
  • Removes the specified list of files and empty directories. An example is shown below:
~> hadoop fs -rm /user/hduser/dir/dir1/file1
  • Recursively deletes the files and sub directories. The usage of rmr is shown below:
~> hadoop fs -rmr /user/hduser/dir

—Web UI

NameNode daemon
  • ~> http://localhost:50070/
—Log Files
  • ~> http://localhost:50070/logs/
—Explore Files
  • ~> http://localhost:50070/explorer.html#/
—Status
  • ~> http://localhost:50090/status.html

Hadoop Word Count Example

Go to home directory and take a look on the directory presents

  • ~> cd /home/woir
  • ~> 'pwd' command should show path as '/home/woir'.
  • execute 'ls -lart' to take a look on the files and directory in general.
  • Confirm that service is running successfully or not
    • ~> run 'jps' - you should see something similar to following -

screen-shot-2016-12-09-at-12-44-15-pm

Go to example directory –

  • ~> cd /home/woir/example/WordCount1/
  • Run command ‘ls’ – if there is a directory named ‘build’ please delete that and recreate the same directory. This step will ensure that your program does not uses precompiled jars and other files
~> ls -lart 
~> rm -rf build
~> mkdir build
  • Remove JAR file if already existing
    • ~> rm /home/woir/example/WordCount1/wcount.jar
  • Ensure JAVA_HOME and PATH variables are set appropriately
~> echo $PATH
~> echo $JAVA_HOME

JAVA_HOME should be something like /home/woir/JAVA
PATH should have /home/amar/JAVA/bin in that.
  • If the above variables are not set please do that now
~> export JAVA_HOME=/home/woir/JAVA
~> export PATH=$JAVA_HOME/bin:$PATH
  • Set HADOOP_HOME
~> export HADOOP_HOME=/home/woir/hadoop-2.6.0
  • Build the example ( please make sure that when you copy – paste it does not leave any space between the command) –
  • ~> $JAVA_HOME/bin/javac -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-2.6.0.jar:$HADOOP_HOME/share/hadoop/common/lib/hadoop-annotations-2.6.0.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.6.0.jar -d build WordCount.java

screen-shot-2016-12-09-at-11-31-18-am

  • Create Jar –
    • ~> jar -cvf wcount.jar -C build/ .
  • Now prepare the input for the program ( please give ‘output’ directory your own name – it should not be existing earlier )
    • Make your own input directory –
      • ~> hadoop dfs -mkdir /user/hduser/input
    • Copy the input files ( file1, file2, file3 ) to hdfs location
      • ~> hadoop dfs -put file* /user/hduser/input
    • Check if the output directory already exists.
      ~> hadoop dfs -ls /user/hduser/output
    • If it already existing delete with the help of following command –
~> hadoop dfs -rm /user/hduser/output/*

~> hadoop dfs -rmdir /user/hduser/output
  • Run the program
~> hadoop jar wcount.jar org.myorg.WordCount /user/hduser/input/ /user/hduser/output

At the end you should see something similar –

screen-shot-2016-12-09-at-11-44-33-am

  • Check if the output files have been generated

screen-shot-2016-12-09-at-11-37-51-am

~> hadoop dfs -ls /user/hduser/output

you should see something similar to below screenshot

screen-shot-2016-12-09-at-11-46-35-am

  • Get the contents of the output files –
~> hadoop dfs -cat /user/hduser/output/part-r-00000

screen-shot-2016-12-09-at-11-48-22-am

  • Verify the word count with the input files-
~> cat file1 file2 file3

The words count should match.