完全分布式

官方文档

https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/ClusterSetup.html

https://hadoop.apache.org/docs/r3.1.3/hadoop-project-dist/hadoop-common/ClusterSetup.html

节点克隆与免密登录

将预配置好的机器克隆:机器名为node1, node2, node3,根据VM网络信息配置合适IP地址,比如。

IP
node1 192.168.179.101
node2 192.168.179.102
node3 192.168.179.103

Hostname修改

略。

IP修改

略。

配置hosts

/etc/hosts (对三个节点都做如下配置)

1
2
3
4
5
6
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.179.101 node1
192.168.179.102 node2
192.168.179.103 node3

可通过ping命令测试映射文件配置是否正确

1
2
3
[root@node1 hadoop]# ping node2
PING node2 (192.168.179.102) 56(84) bytes of data.
64 bytes from node2 (192.168.179.102): icmp_seq=1 ttl=64 time=0.634 ms

修改完,退出root账号,切换到张三用户。

免密登录

HDFS Yarn
node1 DataNode, NameNode, historyserver NodeManager
node2 DataNode NodeManager,ResourceManager
node3 DataNode, SecondaryNameNode NodeManager

说明:这里面只配置了node1、node2到其他主机的无密登录;因为node1配置的是NameNode,node2配置的是ResourceManager,都要求对其他节点无密访问。

(1)node1上生成公钥和私钥:

1
[zhangsan@node1 .ssh]$ ssh-keygen -t rsa

然后敲(三个回车),就会生成两个文件id_rsa(私钥)、id_rsa.pub(公钥)

(2)将node1公钥拷贝到要免密登录的目标机器上

1
2
3
[zhangsan@node1 .ssh]$ ssh-copy-id node1
[zhangsan@node1 .ssh]$ ssh-copy-id node2
[zhangsan@node1 .ssh]$ ssh-copy-id node3

(3)node2上生成公钥和私钥:

1
[zhangsan@node2 .ssh]$ ssh-keygen -t rsa

然后敲(三个回车),就会生成两个文件id_rsa(私钥)、id_rsa.pub(公钥)

(4)将node2公钥拷贝到要免密登录的目标机器上

1
2
3
[zhangsan@node2 .ssh]$ ssh-copy-id node1
[zhangsan@node2 .ssh]$ ssh-copy-id node2
[zhangsan@node2 .ssh]$ ssh-copy-id node3

[SSH免密登录原理]: ../Linux/Linux_SSH_Passwordless_login.md “原理”

集群规划

HDFS Yarn
node1 DataNode, NameNode, historyserver NodeManager
node2 DataNode NodeManager,ResourceManager
node3 DataNode, SecondaryNameNode NodeManager

上传Hadoop压缩包并解压

1
2
3
本文将其解压到此目录;
(base) [zhangsan@node1 hadoop]$ pwd
/opt/bigdata/hadoop

创建软连接

1
2
3
4
5
(base) [zhangsan@node1 hadoop]$ ln -s hadoop-3.1.3/ default
(base) [zhangsan@node1 hadoop]$ ll
total 12
lrwxrwxrwx. 1 zhangsan zhangsan 12 Feb 28 12:53 default -> hadoop-3.1.3
drwxr-xr-x. 11 zhangsan zhangsan 4096 Feb 28 15:57 hadoop-3.1.3

配置环境变量

1
2
3
(base) [zhangsan@node1 ~]$ vim ~/.bash_profile 
export HADOOP_HOME=/opt/bigdata/hadoop/default
export PATH=$PATH:$HADOOP_HOME/bin

source一下配置文件,使环境变量生效。

1
(base) [zhangsan@node1 ~]$ source ~/.bash_profile 

配置文件

core-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<!-- 指定NameNode的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://node1:8020</value>
<!-- hadoop2.x 通常使用9000端口 -->
</property>
<!-- 指定hadoop数据的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/bigdata/hadoop/default/tmp</value>
</property>
<property>
<name>hadoop.proxyuser.zhangsan.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.zhangsan.groups</name>
<value>*</value>
</property>

hdfs-site.xml

1
2
3
4
5
6
7
8
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>node3:9868</value>
</property>

mapred-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- 历史服务器端地址 -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>node1:10020</value>
</property>
<!-- 历史服务器web端地址 -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>node1:19888</value>
</property>

yarn-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node2</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 开启日志聚集功能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 设置日志聚集服务器地址 -->
<property>
<name>yarn.log.server.url</name>
<value>http://node1:19888/jobhistory/logs</value>
</property>
<!-- 设置日志保留时间为7天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>

slaves/workers

hadoop 2.x 从进程的配置文件为 slaves

hadoop 3.x 从进程的配置文件为 workers

1
2
3
node1
node2
node3

使用scp命令将修改后的配置文件复制到其他两台机器

1
2
3
[zhangsan@node1 ~]$ scp -r /opt/bigdata/hadoop/hadoop-2.7.3/etc/hadoop/* zhangsan@node2:/opt/bigdata/hadoop/hadoop-2.7.3/etc/hadoop/

[zhangsan@node1 ~]$ scp -r /opt/bigdata/hadoop/hadoop-2.7.3/etc/hadoop/* zhangsan@node3:/opt/bigdata/hadoop/hadoop-2.7.3/etc/hadoop/

格式化

这三台机器是由node0克隆出来的,因此会有node0的脏数据。在格式化前,删除三台机器$HADOOP_HOME/tmp目录和$HADOOP_HOME/logs目录中的数据。

1
2
# 在node1格式化namenode
[zhangsan@node1 ~]$ hdfs namenode -format

启动

1
2
3
4
5
# 在node1 上执行 start-all.sh , 启动 HDFS 和 YARN
[zhangsan@node1 ~]$ start-all.sh

# 启动 JobHistoryServer port: 19888
[zhangsan@node1 ~]$ mr-jobhistory-daemon.sh start historyserver

节点进程状态

node1节点

1
2
3
4
5
[zhangsan@node1 ~]$ jps
14258 NodeManager
14579 Jps
13783 DataNode
13644 NameNode

node2节点

1
2
3
4
5
[zhangsan@node2 ~]$ jps
4113 ResourceManager
8211 NodeManager
8382 Jps
8095 DataNode

node3节点

1
2
3
4
5
[zhangsan@node3 ~]$ jps
3955 SecondaryNameNode
7928 DataNode
8044 NodeManager
8220 Jps

Web UI

HDFS

​ NameNode: http://node1:50070 (hadoop 2.x)

​ NameNode: http://node1:9870 (hadoop 3.x)

YARN

http://node2:8088

image-20220215001541244

测试

文件准备

1
2
3
4
5
6
[zhangsan@node1 ~]$ hdfs dfs -mkdir /input

[zhangsan@node1 ~]$ hdfs dfs ls -R /
drwxr-xr-x - zhangsan supergroup 0 2022-02-15 00:12 /input

[zhangsan@node1 ~]$ hdfs dfs -put /home/zhangsan/bigdata.txt /input

image-20220215001501193

WordCount

1
[zhangsan@node1 ~]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /input/bigdata.txt /out/02161

image-20220216130145863

集群进程查看脚本

/home/zhangsan/bin/hdp.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/bin/bash
if [ $# -lt 1 ]
then
echo "No Args Input..."
exit ;
fi
case $1 in
"start")
echo " =================== 启动 hadoop集群 ==================="

echo " --------------- 启动 hdfs ---------------"
ssh node1 "/opt/bigdata/hadoop/default/sbin/start-dfs.sh"
echo " --------------- 启动 yarn ---------------"
ssh node2 "/opt/bigdata/hadoop/default/sbin/start-yarn.sh"
echo " --------------- 启动 historyserver ---------------"
ssh node1 "/opt/bigdata/hadoop/default/bin/mapred --daemon start historyserver"
;;
"stop")
echo " =================== 关闭 hadoop集群 ==================="

echo " --------------- 关闭 historyserver ---------------"
ssh node1 "/opt/bigdata/hadoop/default/bin/mapred --daemon stop historyserver"
echo " --------------- 关闭 yarn ---------------"
ssh node2 "/opt/bigdata/hadoop/default/sbin/stop-yarn.sh"
echo " --------------- 关闭 hdfs ---------------"
ssh node1 "/opt/bigdata/hadoop/default/sbin/stop-dfs.sh"
;;
*)
echo "Input Args Error..."
;;
esac