Version: 0.2.0

HDFS Installation

In this guide I will set up HDFS on 3 machine, 1 machine is for name node and other 2 machines is for data nodes. I refer to the following official document SingleCluster and ClusterSetup. I will install HDFS version 3.3.4 so I need java-8 because this is the recommended java version for this Hadoop

Firstly we install Java in 3 machines. There are many ways to install java but I will install with this two commands

sudo apt-get update
sudo apt-get install openjdk-8-jdk

Next we need to download a hadoop distribution, extract it and go inside the directory

$ curl -L -o hadoop-3.3.4.tar.gz https://dlcdn.apache.org/hadoop/common/stable/hadoop-3.3.4.tar.gz
$ tar xvf hadoop-3.3.4.tar.gz
$ ls
hadoop-3.3.4  hadoop-3.3.4.tar.gz
$ cd hadoop-3.3.4

Then, inside the directory, we edit file etc/hadoop/hadoop-env.sh to set a suitable env for it . I need to go uncomment and modify the below lines to set some variable.

export JAVA_HOME=/usr/lib/jvm/java-8-byteopenjdk-amd64
export HADOOP_HOME=/<your_directory>/hdfs/hadoop-3.3.4
export HADOOP_LOG_DIR=/<your_directory>/hdfs/logs

Next edit the file etc/hadoop/core-site.xml with content like this. Note that the value tag will be the value of your name node address

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://<your_name_node_ip_address>:12000</value>
        </property>
</configuration>

Now we have finished the common setup for all three machines. From now, the setup is different for namenode and datanode. In the node that we want to install namenode, we create a file that contains a list of datanode. For example, in my case I create datanodes_list.txt with content like this

$ cat /root/user_xyz/hdfs/datanodes_list.txt
<datanode_1_address>
<datanode_2_address>

Then create a directory for storing namenode runtime data

mkdir -p /<your_directory>/hdfs/root_data_path_for_namenode

Next edit file etc/hadoop/hdfs-site.xml with content like this

<configuration>
        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>
        <property>
                <name>dfs.namenode.name.dir</name>
                <value>file:///<your_directory>/hdfs/root_data_path_for_namenode</value>
        </property>
        <property>
                <name>dfs.hosts</name>
                <value>/<your_directory>/hdfs/datanodes_list.txt</value>
        </property>

</configuration>

That's it for namenode. Now for those two nodes that you need to deploy data node, create a directory to store datanode runtime data

mkdir -p /root/user_xyz/hdfs/root_data_path_for_datanode

Next edit file etc/hadoop/hdfs-site.xml with content like this

<configuration>
        <property>
                <name>dfs.data.dir</name>
                <value>file:///<your_directory>/hdfs/root_data_path_for_datanode</value>
        </property>
</configuration>

We have finish the configuration, now go to the namenode machine, go to the hadoop directory Format the file system and start namenode with this command

bin/hdfs namenode -format
bin/hdfs  --daemon start namenode

Then go to other two data node machines, go to the hadoop directory and start data node with this command

bin/hdfs  --daemon start datanode

We have finished the setup for HDFS. Now we will have to create a directory to store data. So go to the namenode machine, in the hadoop directory, execute follow commands

bin/hdfs dfs -mkdir -p /user/clickhouse/
bin/hdfs dfs -chown clickhouse /user/clickhouse
bin/hdfs dfs -chmod -R 775 /user/clickhouse