Hadoop Distributed File System (HDFS) Cluster: Documentation

https://docs.google.com/document/d/1v-2I9-krzFfymHr3sQaJZ0QoNnoVVYJw8iuf7JIUIPQ

Tutorial: Hadoop Distributed File System (HDFS)

This tutorial is adapted from the GENI tutorial “Distributed Computing on GENI: Hadoop in a Slice”

Overview

In this tutorial you will create a Hadoop cluster composed of three machines. The tutorial will lead you through creating the cluster, observing the properties of the cluster, and running a Hadoop example which sorts a large dataset.

After completing the tutorial you should be able to:

Use Chameleon to create a virtual distributed computational cluster.
Use complex appliances in Chameleon
Use the basic functionality of a Hadoop filesystem.
Observe resource utilization of compute nodes in a virtual distributed computational cluster.

Prerequisites

This tutorial assumes you have a Chameleon account and basic experience logging into and using Chameleon.

An introductory tutorial is available here: https://www.chameleoncloud.org/docs/getting-started/

Create the HDFS Cluster

Navigate to the Hadoop Distributed File System (HDFS) appliance in the Chameleon appliance catalog: https://www.chameleoncloud.org/appliances/53/. Note the “Get Template” link. This is where you can download or view the Heat template you will need to launch the cluster.
Click Launch Complex Appliance at CHI@TACC (or CHI@UC). This will take you to the Orchestration section of the Chameleon dashboard.
Click “Launch Stack”
Set the template source to URL and cut/past the following URL: https://www.chameleoncloud.org/appliances/api/appliances/53/template. Note that this is the URL available at the “Get Template” link on the appliance page seen in step 1.
Click “Next”
Fill in the complex appliance parameters. Stack name should be unique (e.g. include your login ID). Password is your Chameleon password. Hadoop_worker_count is the number of hadoop workers you want to deploy. Network_cidr is an arbitrary network subnet your cluster will use (the subnet does not need to be unique). Network name is a unique name for your network (e.g. include your login ID). Reservation ID is a drop down list that should include any reservations currently owned by your project.
Click “launch”. You will now see your stack (and possibly other stacks) listed.
Click your stack name. Explore the Topology, Overview, Resources, Events, and Template tabs.
Click the Overview tab to find and note the public “hadoop_master_ip” (i.e. the public IP you will use to ssh to your cluster) In the example, the master IP is 129.114.109.123. Also, note the private IPs of the workers. In the example, they are 192.168.100.9 and 192.168.100.6. The private IPs should be from the subnet you specified in the cluster template.
As your cluster is being created explore rest of the Chameleon dashboard to see the status of your cluster and/or its individual parts. First the network will be created, then the worker nodes, and finally the master node. After all the nodes are created boot scipts will configure the Hadoop cluster. This will take more than 10 mins to boot so go get some coffee.
After your coffee, your stack will be in the “Create Complete” state and will soon be configured and ready to use. Once it is ready, you will be able to login to the master node using the IP from step 9.
Login to the master node. You will need to ssh using the private key corresponding to the public key you used to configure the cluster. Ssh to the “cc” user using the IP you noted while deploying the cluster.
Observe the properties of the network. Notice that your master node also has a private IP from the subnet you specified. In this case it is 192.168.100.10.
Confirm that you can ping between the private IPs of your nodes.

Run the “Hello, World” experiment

Create a small test file

[cc@master ~]$  echo Hello Chameleon World > hello.txt

Push the file into the Hadoop filesystem

[hadoop@master ~]$ hdfs dfs -put hello.txt /hello.txt

Check for the file's existence

[hadoop@master ~]$ hdfs dfs -ls /
Found 1 items
-rw-r--r--   2 hadoop supergroup  22 2018-05-09 22:29 /hello.txt

Check the contents of the file

[hadoop@master ~]$ hdfs dfs -cat /hello.txt
Hello Chameleon World

Run the Hadoop sort experiment

Test the true power of the Hadoop filesystem by creating and sorting a large random dataset. It may be useful/interesting to login to the master and/or worker VMs and use tools like top, iotop, and iftop to observe the resource utilization on each of the VMs during the sort test. Note: on these VMs iotop and iftop must be run as root.

Create a 10 GB random data set. After the data is created, use the ls functionally to confirm the data exists. Note that the data is composed of several files in a directory.

[hadoop@master ~]$ hadoop jar \

/home/hadoop/hadoop-2.9.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar \ teragen 100000000 /input 18/05/09 22:34:37 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.100.10:8032 18/05/09 22:34:37 INFO terasort.TeraGen: Generating 100000000 using 2 18/05/09 22:34:37 INFO mapreduce.JobSubmitter: number of splits:2 18/05/09 22:34:37 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled 18/05/09 22:34:37 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1525904324496_0001 18/05/09 22:34:38 INFO impl.YarnClientImpl: Submitted application application_1525904324496_0001 18/05/09 22:34:38 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1525904324496_0001/ 18/05/09 22:34:38 INFO mapreduce.Job: Running job: job_1525904324496_0001 18/05/09 22:34:44 INFO mapreduce.Job: Job job_1525904324496_0001 running in uber mode : false 18/05/09 22:34:44 INFO mapreduce.Job: map 0% reduce 0% 18/05/09 22:35:00 INFO mapreduce.Job: map 12% reduce 0% 18/05/09 22:35:01 INFO mapreduce.Job: map 24% reduce 0% 18/05/09 22:35:06 INFO mapreduce.Job: map 31% reduce 0% 18/05/09 22:35:07 INFO mapreduce.Job: map 37% reduce 0% 18/05/09 22:35:12 INFO mapreduce.Job: map 43% reduce 0% 18/05/09 22:35:13 INFO mapreduce.Job: map 50% reduce 0% 18/05/09 22:35:18 INFO mapreduce.Job: map 56% reduce 0% 18/05/09 22:35:19 INFO mapreduce.Job: map 63% reduce 0% 18/05/09 22:35:24 INFO mapreduce.Job: map 69% reduce 0% 18/05/09 22:35:25 INFO mapreduce.Job: map 76% reduce 0% 18/05/09 22:35:30 INFO mapreduce.Job: map 77% reduce 0% 18/05/09 22:35:31 INFO mapreduce.Job: map 80% reduce 0% 18/05/09 22:35:36 INFO mapreduce.Job: map 86% reduce 0% 18/05/09 22:35:37 INFO mapreduce.Job: map 93% reduce 0% 18/05/09 22:35:39 INFO mapreduce.Job: map 94% reduce 0% 18/05/09 22:35:42 INFO mapreduce.Job: map 100% reduce 0% 18/05/09 22:35:42 INFO mapreduce.Job: Job job_1525904324496_0001 completed successfully 18/05/09 22:35:42 INFO mapreduce.Job: Counters: 31 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=402440 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=170 HDFS: Number of bytes written=10000000000 HDFS: Number of read operations=8 HDFS: Number of large read operations=0 HDFS: Number of write operations=4 Job Counters Launched map tasks=2 Other local map tasks=2 Total time spent by all maps in occupied slots (ms)=107757 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=107757 Total vcore-milliseconds taken by all map tasks=107757 Total megabyte-milliseconds taken by all map tasks=110343168 Map-Reduce Framework Map input records=100000000 Map output records=100000000 Input split bytes=170 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=827 CPU time spent (ms)=124230 Physical memory (bytes) snapshot=432975872 Virtual memory (bytes) snapshot=4561047552 Total committed heap usage (bytes)=303562752 org.apache.hadoop.examples.terasort.TeraGen$Counters CHECKSUM=214760662691937609 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=10000000000

Sort the dataset

[hadoop@master ~]$ hadoop jar \

/home/hadoop/hadoop-2.9.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar \ terasort /input /output 18/05/09 22:39:24 INFO terasort.TeraSort: starting 18/05/09 22:39:24 INFO input.FileInputFormat: Total input files to process : 2 Spent 93ms computing base-splits. Spent 2ms computing TeraScheduler splits. Computing input splits took 96ms Sampling 10 splits of 76 Making 1 from 100000 sampled records Computing parititions took 332ms Spent 430ms computing partitions. 18/05/09 22:39:25 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.100.10:8032 18/05/09 22:39:25 INFO mapreduce.JobSubmitter: number of splits:76 18/05/09 22:39:25 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled 18/05/09 22:39:25 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1525904324496_0002 18/05/09 22:39:26 INFO impl.YarnClientImpl: Submitted application application_1525904324496_0002 18/05/09 22:39:26 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1525904324496_0002/ 18/05/09 22:39:26 INFO mapreduce.Job: Running job: job_1525904324496_0002 18/05/09 22:39:31 INFO mapreduce.Job: Job job_1525904324496_0002 running in uber mode : false 18/05/09 22:39:31 INFO mapreduce.Job: map 0% reduce 0% 18/05/09 22:39:41 INFO mapreduce.Job: map 12% reduce 0% 18/05/09 22:39:42 INFO mapreduce.Job: map 18% reduce 0% 18/05/09 22:39:56 INFO mapreduce.Job: map 21% reduce 0% 18/05/09 22:39:57 INFO mapreduce.Job: map 29% reduce 0% 18/05/09 22:39:58 INFO mapreduce.Job: map 29% reduce 7% 18/05/09 22:39:59 INFO mapreduce.Job: map 32% reduce 7% 18/05/09 22:40:00 INFO mapreduce.Job: map 35% reduce 7% 18/05/09 22:40:01 INFO mapreduce.Job: map 36% reduce 7% 18/05/09 22:40:05 INFO mapreduce.Job: map 36% reduce 10% 18/05/09 22:40:08 INFO mapreduce.Job: map 37% reduce 10% 18/05/09 22:40:09 INFO mapreduce.Job: map 42% reduce 10% 18/05/09 22:40:11 INFO mapreduce.Job: map 42% reduce 12% 18/05/09 22:40:15 INFO mapreduce.Job: map 43% reduce 12% 18/05/09 22:40:17 INFO mapreduce.Job: map 50% reduce 14% 18/05/09 22:40:19 INFO mapreduce.Job: map 52% reduce 14% 18/05/09 22:40:21 INFO mapreduce.Job: map 54% reduce 14% 18/05/09 22:40:22 INFO mapreduce.Job: map 59% reduce 14% 18/05/09 22:40:23 INFO mapreduce.Job: map 59% reduce 18% 18/05/09 22:40:26 INFO mapreduce.Job: map 61% reduce 18% 18/05/09 22:40:28 INFO mapreduce.Job: map 66% reduce 18% 18/05/09 22:40:29 INFO mapreduce.Job: map 66% reduce 20% 18/05/09 22:40:32 INFO mapreduce.Job: map 67% reduce 20% 18/05/09 22:40:34 INFO mapreduce.Job: map 70% reduce 20% 18/05/09 22:40:35 INFO mapreduce.Job: map 70% reduce 21% 18/05/09 22:40:38 INFO mapreduce.Job: map 72% reduce 21% 18/05/09 22:40:40 INFO mapreduce.Job: map 77% reduce 21% 18/05/09 22:40:42 INFO mapreduce.Job: map 81% reduce 21% 18/05/09 22:40:44 INFO mapreduce.Job: map 85% reduce 21% 18/05/09 22:40:45 INFO mapreduce.Job: map 86% reduce 21% 18/05/09 22:40:47 INFO mapreduce.Job: map 86% reduce 22% 18/05/09 22:40:49 INFO mapreduce.Job: map 87% reduce 22% 18/05/09 22:40:54 INFO mapreduce.Job: map 88% reduce 22% 18/05/09 22:40:56 INFO mapreduce.Job: map 89% reduce 22% 18/05/09 22:40:58 INFO mapreduce.Job: map 93% reduce 22% 18/05/09 22:40:59 INFO mapreduce.Job: map 95% reduce 23% 18/05/09 22:41:00 INFO mapreduce.Job: map 97% reduce 23% 18/05/09 22:41:11 INFO mapreduce.Job: map 97% reduce 24% 18/05/09 22:41:16 INFO mapreduce.Job: map 99% reduce 24% 18/05/09 22:41:17 INFO mapreduce.Job: map 99% reduce 25% 18/05/09 22:41:19 INFO mapreduce.Job: map 100% reduce 25% 18/05/09 22:41:35 INFO mapreduce.Job: map 100% reduce 26% 18/05/09 22:41:41 INFO mapreduce.Job: map 100% reduce 27% 18/05/09 22:41:47 INFO mapreduce.Job: map 100% reduce 28% 18/05/09 22:41:53 INFO mapreduce.Job: map 100% reduce 29% 18/05/09 22:42:05 INFO mapreduce.Job: map 100% reduce 32% 18/05/09 22:42:11 INFO mapreduce.Job: map 100% reduce 33% 18/05/09 22:42:23 INFO mapreduce.Job: map 100% reduce 37% 18/05/09 22:42:29 INFO mapreduce.Job: map 100% reduce 67% 18/05/09 22:42:35 INFO mapreduce.Job: map 100% reduce 69% 18/05/09 22:42:41 INFO mapreduce.Job: map 100% reduce 71% 18/05/09 22:42:47 INFO mapreduce.Job: map 100% reduce 73% 18/05/09 22:42:53 INFO mapreduce.Job: map 100% reduce 75% 18/05/09 22:42:59 INFO mapreduce.Job: map 100% reduce 77% 18/05/09 22:43:05 INFO mapreduce.Job: map 100% reduce 79% 18/05/09 22:43:11 INFO mapreduce.Job: map 100% reduce 81% 18/05/09 22:43:17 INFO mapreduce.Job: map 100% reduce 83% 18/05/09 22:43:23 INFO mapreduce.Job: map 100% reduce 85% 18/05/09 22:43:29 INFO mapreduce.Job: map 100% reduce 88% 18/05/09 22:43:35 INFO mapreduce.Job: map 100% reduce 90% 18/05/09 22:43:41 INFO mapreduce.Job: map 100% reduce 92% 18/05/09 22:43:47 INFO mapreduce.Job: map 100% reduce 94% 18/05/09 22:43:53 INFO mapreduce.Job: map 100% reduce 96% 18/05/09 22:43:59 INFO mapreduce.Job: map 100% reduce 98% 18/05/09 22:44:05 INFO mapreduce.Job: map 100% reduce 100% 18/05/09 22:44:07 INFO mapreduce.Job: Job job_1525904324496_0002 completed successfully 18/05/09 22:44:07 INFO mapreduce.Job: Counters: 50 File System Counters FILE: Number of bytes read=30850224848 FILE: Number of bytes written=41265825515 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=10000007752 HDFS: Number of bytes written=10000000000 HDFS: Number of read operations=231 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Killed map tasks=2 Launched map tasks=78 Launched reduce tasks=1 Data-local map tasks=78 Total time spent by all maps in occupied slots (ms)=1071665 Total time spent by all reduces in occupied slots (ms)=263028 Total time spent by all map tasks (ms)=1071665 Total time spent by all reduce tasks (ms)=263028 Total vcore-milliseconds taken by all map tasks=1071665 Total vcore-milliseconds taken by all reduce tasks=263028 Total megabyte-milliseconds taken by all map tasks=1097384960 Total megabyte-milliseconds taken by all reduce tasks=269340672 Map-Reduce Framework Map input records=100000000 Map output records=100000000 Map output bytes=10200000000 Map output materialized bytes=10400000456 Input split bytes=7752 Combine input records=0 Combine output records=0 Reduce input groups=100000000 Reduce shuffle bytes=10400000456 Reduce input records=100000000 Reduce output records=100000000 Spilled Records=396636764 Shuffled Maps =76 Failed Shuffles=0 Merged Map outputs=76 GC time elapsed (ms)=10306 CPU time spent (ms)=875310 Physical memory (bytes) snapshot=24264806400 Virtual memory (bytes) snapshot=176019595264 Total committed heap usage (bytes)=15184429056 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=10000000000 File Output Format Counters Bytes Written=10000000000 18/05/09 22:44:07 INFO terasort.TeraSort: done

Look at the output: You can use Hadoop's cat and/or get functionally to look at the random and sorted files to confirm their size and that the sort actually worked. Try some or all of these commands. Does the output make sense to you?

hdfs dfs -ls /input/ hdfs dfs -ls /output/ hdfs dfs -cat /input/part-m-00000 | less hdfs dfs -cat /output/part-r-00000 | less

Use hexdump to see the sorted file. Because the files are binary, it is hard to see the sorted output in ascii. Use xxd: The output should look something like the following. The first five columns after the colon are the key field and should be in sorted order. Try this on the input data and you should see unsorted data.

[hadoop@master ~]$ hdfs dfs -get /output/part-r-00000 part-r-00000 [hadoop@master ~]$ xxd -c 100 part-r-00000 | less 0000000: 0000 02a0 e217 d738 d776 0011 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3833 4639 3045 8899 aabb 3939 3939 3333 3333 4545 4545 3838 3838 3333 3333 3131 3131 4242 4242 3939 3939 3737 3737 3737 3737 4141 4141 3333 3333 ccdd eeff .......8.v..0000000000000000000000000083F90E....99993333EEEE888833331111BBBB999977777777AAAA3333.... 0000064: 0000 0336 98a4 0790 b743 0011 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3038 3937 4232 8899 aabb 3333 3333 3131 3131 3737 3737 3434 3434 3636 3636 4141 4141 4242 4242 3939 3939 4545 4545 3131 3131 4545 4545 4646 4646 ccdd eeff ...6.....C..000000000000000000000000000897B2....33331111777744446666AAAABBBB9999EEEE1111EEEEFFFF.... 00000c8: 0000 04c6 251d e52d 9124 0011 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3842 3330 4438 8899 aabb 4444 4444 3737 3737 4343 4343 3838 3838 3131 3131 3131 3131 4343 4343 4646 4646 3131 3131 3030 3030 3434 3434 3939 3939 ccdd eeff ....%..-.$..000000000000000000000000008B30D8....DDDD7777CCCC888811111111CCCCFFFF1111000044449999.... 000012c: 0000 0630 4115 f4b5 416b 0011 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3745 3534 3439 8899 aabb 3333 3333 4141 4141 4444 4444 4646 4646 4545 4545 3737 3737 3333 3333 3434 3434 3333 3333 4646 4646 4242 4242 4545 4545 ccdd eeff ...0A...Ak..000000000000000000000000007E5449....3333AAAADDDDFFFFEEEE7777333344443333FFFFBBBBEEEE.... 0000190: 0000 06d8 f96c 785c 5054 0011 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3238 3743 4544 8899 aabb 4242 4242 3131 3131 4545 4545 3838 3838 3838 3838 4141 4141 4343 4343 3636 3636 3737 3737 4444 4444 3535 3535 4141 4141 ccdd eeff .....lx\PT..00000000000000000000000000287CED....BBBB1111EEEE88888888AAAACCCC66667777DDDD5555AAAA.... 00001f4: 0000 0768 67de 847c baec 0011 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3443 3032 4231 8899 aabb 4242 4242 3636 3636 3030 3030 4646 4646 3737 3737 4545 4545 3939 3939 4444 4444 3838 3838 3535 3535 3131 3131 3636 3636 ccdd eeff ...hg..|....000000000000000000000000004C02B1....BBBB66660000FFFF7777EEEE9999DDDD8888555511116666.... ... ... ...

Advanced tasks: Re-do the tutorial with a different number of workers, amount of bandwidth, and/or worker instance types. Warning: be courteous to other users and do not use too many of the resources. Time the performance of runs with different resources. Observe largest size file you can create with different resources.