Hadoop Word Count Problem

Hadoop Word Count Problem

Few basics in Unix –

UNIX TUTORIAL

  • How to check if a process is running or not
~> ps -eaf | grep 'java'  will list down all the process which uses java
  • How to kill a process forcefully
~> ps -eaf | grep 'java'

The above command shows the process ids of the process which uses java

~> kill -9 'process id'

it will kill the job with that process id
  • What does sudo do –
    • It runs the command with root’s privilege

Start Desktop

  • Start Desktop Box
  • Login as amar
    • Click on the right top corner and Chose Hadoop User.
    • Enter password – <your password>

screen-shot-2016-12-16-at-2-14-27-am

  • Click on the Top Left Ubuntu Button and search for the terminal and click on it.

screen-shot-2016-12-16-at-2-18-44-am

  • You should see something similar as below

screen-shot-2016-12-16-at-2-20-24-am

Start with HDFS

  • Setup the environment
~> source /home/amar/sourceme
  • Stop all the processes
~> /home/amar/stop_all.sh
  • Start hadoop if not already started –
~> /home/amar/start_hadoop.sh
  • Check if Hadoop is running fine
~> jps

it will list down the running hadoop processes.

o/p should look like below -

amar@amar-VirtualBox:/usr/local/hadoop/sbin$ jps
 14416 SecondaryNameNode
 14082 NameNode
 14835 Jps
 3796 Main
 14685 NodeManager
 14207 DataNode
 14559 ResourceManager
  • Make directory for the purpose of demonstration

The command creates the /user/hduser/dir/dir1 and /user/hduser/employees/salary

~> hadoop fs -mkdir -p /user/hduser/dir/dir1 /user/hduser/employees/salary
  • Copy contents in to the directory. It can copy directory also.
~> hadoop fs -copyFromLocal /home/amar/example/WordCount1/file* /user/hduser/dir/dir1
  • The hadoop ls command is used to list out the directories and files –
~> hadoop fs -ls /user/hduser/dir/dir1/
  • The hadoop lsr command recursively displays the directories, sub directories and files in the specified directory. The usage example is shown below:
~> hadoop fs -lsr /user/hduser/dir
  • Hadoop cat command is used to print the contents of the file on the terminal (stdout). The usage example of hadoop cat command is shown below:
~> hadoop fs -cat /user/hduser/dir/dir1/file*
  • The hadoop chmod command is used to change the permissions of files. The -R option can be used to recursively change the permissions of a directory structure.

Note the permission before –

~> hadoop fs -ls /user/hduser/dir/dir1/

Change the persission-

~> hadoop fs -chmod 777 /user/hduser/dir/dir1/file1

See it again –

~> hadoop fs -ls /user/hduser/dir/dir1/
  • The hadoop chown command is used to change the ownership of files. The -R option can be used to recursively change the owner of a directory structure.
~> hadoop fs -chown amars:amars /user/hduser/dir/dir1/file1

Check the ownership now –

~> hadoop fs -ls /user/hduser/dir/dir1/file1
  • The hadoop copyFromLocal command is used to copy a file from the local file system to the hadoop hdfs. The syntax and usage example are shown below:
~> hadoop fs -copyFromLocal /home/amar/example/WordCount1/file* /user/hduser/employees/salary
  • The hadoop copyToLocal command is used to copy a file from the hdfs to the local file system. The syntax and usage example is shown below:
~> hadoop fs -copyToLocal /user/hduser/dir/dir1/file1 /home/amar/Downloads/
  • The hadoop cp command is for copying the source into the target.
~>hadoop fs -cp /user/hduser/dir/dir/file1 /user/hduser/dir/
  • The hadoop moveFromLocal command moves a file from local file system to the hdfs directory. It removes the original source file. The usage example is shown below:
~> hadoop fs -moveFromLocal /home/amar/Downloads/file1  /user/hduser/employees/
  • It moves the files from source hdfs to destination hdfs. Hadoop mv command can also be used to move multiple source files into the target directory. In this case the target should be a directory. The syntax is shown below:
~> hadoop fs -mv /user/hduser/dir/dir1/file2 /user/hduser/dir/
  • The du command displays aggregate length of files contained in the directory or the length of a file in case its just a file. The syntax and usage is shown below:
~> hadoop fs -du /user/hduser
  • Removes the specified list of files and empty directories. An example is shown below:
~> hadoop fs -rm /user/hduser/dir/dir1/file1
  • Recursively deletes the files and sub directories. The usage of rmr is shown below:
~> hadoop fs -rmr /user/hduser/dir

—Web UI

NameNode daemon
  • ~> http://localhost:50070/
—Log Files
  • ~> http://localhost:50070/logs/
—Explore Files
  • ~> http://localhost:50070/explorer.html#/
—Status
  • ~> http://localhost:50090/status.html

Hadoop Word Count Example

Go to home directory and take a look on the directory presents

  • ~> cd /home/amar
  • ~> 'pwd' command should show path as '/home/amar'.
  • execute 'ls -lart' to take a look on the files and directory in general.
  • Confirm that service is running successfully or not
    • ~> run 'jps' - you should see something similar to following -

screen-shot-2016-12-09-at-12-44-15-pm

Go to example directory –

  • ~> cd /home/amar/example/WordCount1/
  • Run command ‘ls’ – if there is a directory named ‘build’ please delete that and recreate the same directory. This step will ensure that your program does not uses precompiled jars and other files
~> ls -lart 
~> rm -rf build
~> mkdir build
  • Remove JAR file if already existing
    • ~> rm /home/amar/example/WordCount1/wcount.jar
  • Ensure JAVA_HOME and PATH variables are set appropriately
~> echo $PATH
~> echo $JAVA_HOME

JAVA_HOME should be something like /home/amar/JAVA
PATH should have /home/amar/JAVA/bin in that.
  • If the above variables are not set please do that now
~> export JAVA_HOME=/home/amar/JAVA
~> export PATH=$JAVA_HOME/bin:$PATH
  • Set HADOOP_HOME
~> export HADOOP_HOME=/home/amar/hadoop-2.6.0
  • Build the example ( please make sure that when you copy – paste it does not leave any space between the command) –
  • ~> $JAVA_HOME/bin/javac -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-2.6.0.jar:$HADOOP_HOME/share/hadoop/common/lib/hadoop-annotations-2.6.0.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.6.0.jar -d build WordCount.java

screen-shot-2016-12-09-at-11-31-18-am

  • Create Jar –
    • ~> jar -cvf wcount.jar -C build/ .
  • Now prepare the input for the program ( please give ‘output’ directory your own name – it should not be existing earlier )
    • Make your own input directory –
      • ~> hadoop dfs -mkdir /user/hduser/input
    • Copy the input files ( file1, file2, file3 ) to hdfs location
      • ~> hadoop dfs -put file* /user/hduser/input
    • Check if the output directory already exists.
      ~> hadoop dfs -ls /user/hduser/output
    • If it already existing delete with the help of following command –
~> hadoop dfs -rm /user/hduser/output/*

~> hadoop dfs -rmdir /user/hduser/output
  • Run the program
~> hadoop jar wcount.jar org.myorg.WordCount /user/hduser/input/ /user/hduser/output

At the end you should see something similar –

screen-shot-2016-12-09-at-11-44-33-am

  • Check if the output files have been generated

screen-shot-2016-12-09-at-11-37-51-am

~> hadoop dfs -ls /user/hduser/output

you should see something similar to below screenshot

screen-shot-2016-12-09-at-11-46-35-am

  • Get the contents of the output files –
~> hadoop dfs -cat /user/hduser/output/part-r-00000

screen-shot-2016-12-09-at-11-48-22-am

  • Verify the word count with the input files-
~> cat file1 file2 file3

The words count should match.