大数据分析实践报告

实验内容

  1. 在平台上完成安装配置 Java、Hadoop
  2. 运行 Hadoop 的样例 wordcount 程序

关键步骤及截图

前言 给小白说的话

在这里,如果你看到 rm -rf / --no-preserve-root 这样的文本或者是这样的

1
dd if=/dev/zero of=/dev/vda

这种样式的文本,如果特殊说明,均为在终端中执行的命令。

连接到平台上的服务器

按照提供资料上,使用的是图形界面连接的资料,但大部分操作都没有利用上图形界面的优势,同时图形界面对网络带宽也相对较高,所以我这里选择了 Visual Studio Code 中的 Remote SSH 扩展作为我的连接方式。

连接后的界面

Remote SSH 采用的是通过 SSH 连接上服务器,在服务器上运行 VSCode 底层服务和扩展,并使用本地 VSCode 作为显示输出,提供了文件快速打开编辑染色等功能,在操作便利性和网络带宽上取得了较好的平衡。

但由于其依赖于 SSH,需要在系统中安装 openssh 客户端,在 Windows 10 较新的版本中已经附带,但对于部分 Windows 7 用户来说,需要自行安装。

附:使用方法

  1. 安装 VSCode ( 下载地址 )

  2. 安装 Remote – SSH 扩展和 Remote - SSH: Editing Configuration Files 扩展(左侧切换到扩展商店搜索安装)

  3. 在左侧新增的远程资源管理器中添加学校提供的服务器(需要先绑定外部 IP)

输入内容类似于 ssh root@192.16.1.1

5.然后会问我们保存到哪里,我们选择保存到自己用户目录下的配置文件

也是就 C:\Users\****\.ssh\config 那一项。

左侧就会多出刚才添加一项,我们右键 Connect to Host in new Window ,如果配置没错的话就会问你这是一台 Linux 还是 windows 主机,选 linux,初次连接会比较慢,稍微等一会儿就好了。

因为 VSCode 一般是推荐在文件夹/工作区里面操作,可能会弹出个框让你选工作目录,直接确定就行,没有的话就在终端中输入 code ~

附:连接后打开终端的快捷键

1
Ctrl + ` (就是键盘数字1旁边那个键了!)

免密登录 (pubkey 登录) (公钥登录)

这个一般是被放到很后,不知道为什么不放到前面,迷惑。。。

我们用 VSCode 打开本地的C:\Users\你的用户名\.ssh\id_rsa.pub

  • 如果没有这个文件的话,可以在本地命令行执行ssh-keygen -t rsa,期间会有几个问题给你,直接一路回车就行。

复制里面的所有东西,在远程 VSCode 终端输入 code ~/.ssh/authorized_keys 打开 authorized_keys,粘贴进去,如果里面已经有内容了,就新起一行输入。

因为这里后面 hadoop 也需要这个免密登录,我们在服务器上也执行ssh-keygen -t rsa,并把 _ .ssh\id_rsa.pub _ (应该能在左侧的资源管理器中找到) 里面的内容也粘贴进 _authorized_keys_ 然后保存即可

连接外网

emmmm 这里也是要登录学校的网络啦,毕竟上不了网,会不爽的~

作为没有界面的环境,当然也是可以登录的!

下面是我用来登陆的 bash 命令,建议保存为一个 sh 文件并添加到 cron 中每隔一小时运行下。

1
2
3
4
5
6
7
8
curl 'http://学校登录/drcom/login?callback=dr&DDDDD=用户名&upass=密码&0MKKey=123456&R1=0&R3=0&R6=0&para=00&v6ip=' \
-H 'Connection: keep-alive' \
-H 'Accept: text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01' \
-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36' \
-H 'X-Requested-With: XMLHttpRequest' \
-H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8,ja;q=0.7' \
--compressed \
--insecure

你们如果要用的话,自己看着替换吧!

设置 hostname 主机名

这个步骤一般被放在比较后,但我决定还是放一个比较前的位置吧!

1
code /etc/hostname

打开 hostname 文件,把文件内容改成你想这台服务器叫的名字吧!一般情况下,不可换行,没有空格,只支持英文。我这里写的是 shugenatschool,本文中你看到这个的时候你可以自行替换成你自己的主机名。

按下 Ctrl+S 保存,然后

1
reboot

重启!

安装 Java 及配置环境变量

既然前面都有网了,为什么直接从网上下,而是要从课程平台下载呢???

1
apt install openjdk-8-jdk -y

当然直接 apt 安装啊!

emmm 安装完就是要去配置下环境变量啦!

因为系统自带的 profile 文件是会自动运行 /etc/profile.d/ 下的所有 .sh 文件,既然有那么好的设定,为什么不用呢?

于是就有这个命令来在这里新建文件并在 VSCode 中打开,

1
code /etc/profile.d/java.sh

文件内容的话,就是下面两句

1
2
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin

保存之后我们跑一个 source /etc/profile 来应用一下环境变量(正确的话是没有任何输出的)

接着我们尝试看看 Java 的版本 java -version 大概就会得到下面这样的输出:

Jvav Version

接下来应该是 Hadoop 了

按照老师给的教程里面给的 Hadoop 版本是 2.7.3 但显然不是最新的版本了,我们稍微更新一下版本,选个大版本号下的稳定版本吧。

清华镜像大法好!****保平安!(误)

网页截图

我们可以看到这里好几个版本,还有什么 current 什么 stable 的文件夹,稳定第一,我们分别点开 stable 和 stable2,里面存的是分别是 3.3.0 和 2.10.1,显然 3.3.0 已经是一个大版本的差距了,所以我们不选择 3.3.0,选择 2.10.1,然后接下来应该有同学问:这里好几个文件啊,该下哪一个?

stable2

很简单,就是hadoop-2.10.1.tar.gz这个文件,上面 CHANGES.md 和 RELEASENOTES.md 都是版本说明更新记录,rat 那个是个报告,site 是网站来的,你可以当手册翻一翻,src 就是源码,所以就只剩下这个 390M 的文件了。

让我们在平台上下载这个文件吧!

首先我们打开 /usr/local 这个文件夹

1
code /usr/local

会跳出一个新的 VSCode 窗口,然后下载

1
wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/stable2/hadoop-2.10.1.tar.gz

接着解压

1
tar zxvf hadoop-2.10.1.tar.gz

解压完之后,我们在 VSCode 的左侧资源管理器刷新一下列表,可以看到多了一个 hadoop-2.10.1 的文件夹,右键重命名为 hadoop

同样的,Hadoop 也需要配置环境变量我们也是采用建立独立的 sh 文件来配置

1
code /etc/profile.d/hadoop.sh

文件内容

1
2
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

保存之后我们跑一个 source /etc/profile 来应用一下环境变量(正确的话是没有任何输出的)

接着尝试看看 Hadoop 的版本 hadoop version ,来确认环境变量正确,大概会得到这样的输出吧!

hadoop Version

后面没有要改环境变量的地方了,我们reboot一下,让所有地方都应用上环境变量

配置 hadoop 伪分布式

为什么选择单节点伪分布式呢?因为我们没有那么大量的数据需要处理,而且简单

为了方便,我们切换工作目录到 hadoop 里面去code hadoop

配置 core-site.xml

我们在左侧 etc/hadoop 中找到 core-site.xml 文件,在 <configuration></configuration> 之间添加以下内容

1
2
3
4
5
6
7
8
<property>
<name>fs.defaultFS</name>
<value>hdfs://shugenatschool:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>

改完之后的文件大概是这样的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://shugenatschool:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
</configuration>

保存,关闭文件。

配置 hdfs-site.xml

我们在左侧 etc/hadoop 中找到 hdfs-site.xml 文件,在 <configuration></configuration> 之间添加以下内容

1
2
3
4
5
6
7
8
9
10
11
12
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/dfs/data</value>
</property>

改完之后应该是这样的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/dfs/data</value>
</property>
</configuration>

保存,关闭文件。

修改 hadoop-env.sh

虽然我们确实定义了这个 JAVA_HOME 环境变量,但不知道为什么就是没读到这个。。。。

我们在左侧 etc/hadoop 中找到 hadoop-env.sh 文件,找到这一行

1
export JAVA_HOME=${JAVA_HOME}

修改为

1
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"

保存文件,关闭

格式化 namenode

1
hdfs namenode -format

会得到一大串的消息

日志
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
20/12/13 11:31:39 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = shugenatschool/10.101.3.195
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.10.1
STARTUP_MSG: classpath = /usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/netty-3.10.6.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/jersey-core-1.9.jar:/usr/local/hadoop/share/hadoop/common/lib/asm-3.2.jar:/usr/local/hadoop/share/hadoop/common/lib/servlet-api-2.5.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar:/usr/local/hadoop/share/hadoop/common/lib/gson-2.2.4.jar:/usr/local/hadoop/share/hadoop/common/lib/zookeeper-3.4.14.jar:/usr/local/hadoop/share/hadoop/common/lib/guava-11.0.2.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/common/lib/json-smart-1.3.1.jar:/usr/local/hadoop/share/hadoop/common/lib/curator-recipes-2.13.0.jar:/usr/local/hadoop/share/hadoop/common/lib/spotbugs-annotations-3.1.9.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-collections-3.2.2.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-math3-3.1.1.jar:/usr/local/hadoop/share/hadoop/common/lib/nimbus-jose-jwt-7.9.jar:/usr/local/hadoop/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/usr/local/hadoop/share/hadoop/common/lib/jets3t-0.9.0.jar:/usr/local/hadoop/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/usr/local/hadoop/share/hadoop/common/lib/snappy-java-1.0.5.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-util-6.1.26.jar:/usr/local/hadoop/share/hadoop/common/lib/paranamer-2.3.jar:/usr/local/hadoop/share/hadoop/common/lib/audience-annotations-0.5.0.jar:/usr/local/hadoop/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-sslengine-6.1.26.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-net-3.1.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-lang3-3.4.jar:/usr/local/hadoop/share/hadoop/common/lib/xmlenc-0.52.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-beanutils-1.9.4.jar:/usr/local/hadoop/share/hadoop/common/lib/hadoop-auth-2.10.1.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/usr/local/hadoop/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/usr/local/hadoop/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/usr/local/hadoop/share/hadoop/common/lib/httpcore-4.4.4.jar:/usr/local/hadoop/share/hadoop/common/lib/stax-api-1.0-2.jar:/usr/local/hadoop/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/usr/local/hadoop/share/hadoop/common/lib/hadoop-annotations-2.10.1.jar:/usr/local/hadoop/share/hadoop/common/lib/jettison-1.1.jar:/usr/local/hadoop/share/hadoop/common/lib/curator-framework-2.13.0.jar:/usr/local/hadoop/share/hadoop/common/lib/curator-client-2.13.0.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-io-2.4.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar:/usr/local/hadoop/share/hadoop/common/lib/junit-4.11.jar:/usr/local/hadoop/share/hadoop/common/lib/jsp-api-2.1.jar:/usr/local/hadoop/share/hadoop/common/lib/stax2-api-3.1.4.jar:/usr/local/hadoop/share/hadoop/common/lib/jcip-annotations-1.0-1.jar:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-digester-1.8.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-codec-1.4.jar:/usr/local/hadoop/share/hadoop/common/lib/jsch-0.1.55.jar:/usr/local/hadoop/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/usr/local/hadoop/share/hadoop/common/lib/slf4j-api-1.7.25.jar:/usr/local/hadoop/share/hadoop/common/lib/mockito-all-1.8.5.jar:/usr/local/hadoop/share/hadoop/common/lib/jersey-json-1.9.jar:/usr/local/hadoop/share/hadoop/common/lib/log4j-1.2.17.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/common/lib/httpclient-4.5.2.jar:/usr/local/hadoop/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/usr/local/hadoop/share/hadoop/common/lib/jersey-server-1.9.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-lang-2.6.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-logging-1.1.3.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-compress-1.19.jar:/usr/local/hadoop/share/hadoop/common/lib/hamcrest-core-1.3.jar:/usr/local/hadoop/share/hadoop/common/lib/jsr305-3.0.2.jar:/usr/local/hadoop/share/hadoop/common/lib/activation-1.1.jar:/usr/local/hadoop/share/hadoop/common/lib/htrace-core4-4.1.0-incubating.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-6.1.26.jar:/usr/local/hadoop/share/hadoop/common/lib/woodstox-core-5.0.3.jar:/usr/local/hadoop/share/hadoop/common/lib/avro-1.7.7.jar:/usr/local/hadoop/share/hadoop/common/hadoop-common-2.10.1-tests.jar:/usr/local/hadoop/share/hadoop/common/hadoop-nfs-2.10.1.jar:/usr/local/hadoop/share/hadoop/common/hadoop-common-2.10.1.jar:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-3.10.6.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jersey-core-1.9.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/asm-3.2.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/servlet-api-2.5.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-cli-1.2.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/guava-11.0.2.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jackson-core-2.9.10.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/okhttp-2.7.5.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jetty-util-6.1.26.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/xmlenc-0.52.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-all-4.1.50.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-io-2.4.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-codec-1.4.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/leveldbjni-all-1.8.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jackson-annotations-2.9.10.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/hadoop-hdfs-client-2.10.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/xercesImpl-2.12.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/log4j-1.2.17.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jackson-databind-2.9.10.6.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jersey-server-1.9.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-lang-2.6.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/okio-1.6.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/xml-apis-1.4.01.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jsr305-3.0.2.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/htrace-core4-4.1.0-incubating.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jetty-6.1.26.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-rbf-2.10.1-tests.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-nfs-2.10.1.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-native-client-2.10.1.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-2.10.1.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-native-client-2.10.1-tests.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-client-2.10.1-tests.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-client-2.10.1.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-2.10.1-tests.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-rbf-2.10.1.jar:/usr/local/hadoop/share/hadoop/yarn:/usr/local/hadoop/share/hadoop/yarn/lib/netty-3.10.6.Final.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-client-1.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-core-1.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/asm-3.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/servlet-api-2.5.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-cli-1.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/gson-2.2.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/ehcache-3.3.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/zookeeper-3.4.14.jar:/usr/local/hadoop/share/hadoop/yarn/lib/aopalliance-1.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/guava-11.0.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/yarn/lib/HikariCP-java7-2.4.12.jar:/usr/local/hadoop/share/hadoop/yarn/lib/json-smart-1.3.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/curator-recipes-2.13.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/spotbugs-annotations-3.1.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-collections-3.2.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-math3-3.1.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/nimbus-jose-jwt-7.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jaxb-api-2.2.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jets3t-0.9.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/api-asn1-api-1.0.0-M20.jar:/usr/local/hadoop/share/hadoop/yarn/lib/snappy-java-1.0.5.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jetty-util-6.1.26.jar:/usr/local/hadoop/share/hadoop/yarn/lib/paranamer-2.3.jar:/usr/local/hadoop/share/hadoop/yarn/lib/audience-annotations-0.5.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jaxb-impl-2.2.3-1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jetty-sslengine-6.1.26.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-net-3.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/guice-3.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-lang3-3.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/xmlenc-0.52.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-beanutils-1.9.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jackson-xc-1.9.13.jar:/usr/local/hadoop/share/hadoop/yarn/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/usr/local/hadoop/share/hadoop/yarn/lib/fst-2.50.jar:/usr/local/hadoop/share/hadoop/yarn/lib/apacheds-i18n-2.0.0-M15.jar:/usr/local/hadoop/share/hadoop/yarn/lib/httpcore-4.4.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/stax-api-1.0-2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/json-io-2.5.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/api-util-1.0.0-M20.jar:/usr/local/hadoop/share/hadoop/yarn/lib/metrics-core-3.0.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jettison-1.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-guice-1.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/curator-framework-2.13.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/curator-client-2.13.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-io-2.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-configuration-1.6.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jsp-api-2.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/java-util-1.9.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/stax2-api-3.1.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jcip-annotations-1.0-1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/guice-servlet-3.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-digester-1.8.jar:/usr/local/hadoop/share/hadoop/yarn/lib/mssql-jdbc-6.2.1.jre7.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-codec-1.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jsch-0.1.55.jar:/usr/local/hadoop/share/hadoop/yarn/lib/java-xmlbuilder-0.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/leveldbjni-all-1.8.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-json-1.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/geronimo-jcache_1.0_spec-1.0-alpha-1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/log4j-1.2.17.jar:/usr/local/hadoop/share/hadoop/yarn/lib/javax.inject-1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jackson-mapper-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/yarn/lib/httpclient-4.5.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/protobuf-java-2.5.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jackson-jaxrs-1.9.13.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-server-1.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-lang-2.6.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-logging-1.1.3.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-compress-1.19.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jsr305-3.0.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/activation-1.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/htrace-core4-4.1.0-incubating.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jetty-6.1.26.jar:/usr/local/hadoop/share/hadoop/yarn/lib/woodstox-core-5.0.3.jar:/usr/local/hadoop/share/hadoop/yarn/lib/avro-1.7.7.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-common-2.10.1.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-2.10.1.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-timeline-pluginstorage-2.10.1.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-sharedcachemanager-2.10.1.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-registry-2.10.1.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-tests-2.10.1.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-applicationhistoryservice-2.10.1.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-router-2.10.1.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-2.10.1.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-api-2.10.1.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.10.1.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-web-proxy-2.10.1.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-common-2.10.1.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.10.1.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-client-2.10.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/netty-3.10.6.Final.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/jersey-core-1.9.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/asm-3.2.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/aopalliance-1.0.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/jackson-core-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/snappy-java-1.0.5.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/paranamer-2.3.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/guice-3.0.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/hadoop-annotations-2.10.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/jersey-guice-1.9.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/commons-io-2.4.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/junit-4.11.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/guice-servlet-3.0.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/leveldbjni-all-1.8.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/log4j-1.2.17.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/javax.inject-1.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/jackson-mapper-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/protobuf-java-2.5.0.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/jersey-server-1.9.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/commons-compress-1.19.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/hamcrest-core-1.3.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/avro-1.7.7.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.10.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.10.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-2.10.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.10.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-2.10.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.10.1-tests.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-app-2.10.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-2.10.1.jar:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
STARTUP_MSG: build = https://github.com/apache/hadoop -r 1827467c9a56f133025f28557bfc2c562d78e816; compiled by 'centos' on 2020-09-14T13:17Z
STARTUP_MSG: java = 1.8.0_275
************************************************************/
20/12/13 11:31:39 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
20/12/13 11:31:40 INFO namenode.NameNode: createNameNode [-format]
Formatting using clusterid: CID-17273503-35c6-49ff-a837-e628584a6685
20/12/13 11:31:41 INFO namenode.FSEditLog: Edit logging is async:true
20/12/13 11:31:41 INFO namenode.FSNamesystem: KeyProvider: null
20/12/13 11:31:41 INFO namenode.FSNamesystem: fsLock is fair: true
20/12/13 11:31:41 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
20/12/13 11:31:41 INFO namenode.FSNamesystem: fsOwner = root (auth:SIMPLE)
20/12/13 11:31:41 INFO namenode.FSNamesystem: supergroup = supergroup
20/12/13 11:31:41 INFO namenode.FSNamesystem: isPermissionEnabled = true
20/12/13 11:31:41 INFO namenode.FSNamesystem: HA Enabled: false
20/12/13 11:31:41 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
20/12/13 11:31:41 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit: configured=1000, counted=60, effected=1000
20/12/13 11:31:41 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
20/12/13 11:31:41 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
20/12/13 11:31:41 INFO blockmanagement.BlockManager: The block deletion will start around 2020 Dec 13 11:31:41
20/12/13 11:31:41 INFO util.GSet: Computing capacity for map BlocksMap
20/12/13 11:31:41 INFO util.GSet: VM type = 64-bit
20/12/13 11:31:41 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB
20/12/13 11:31:41 INFO util.GSet: capacity = 2^21 = 2097152 entries
20/12/13 11:31:41 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
20/12/13 11:31:41 WARN conf.Configuration: No unit for dfs.heartbeat.interval(3) assuming SECONDS
20/12/13 11:31:41 WARN conf.Configuration: No unit for dfs.namenode.safemode.extension(30000) assuming MILLISECONDS
20/12/13 11:31:41 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
20/12/13 11:31:41 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.min.datanodes = 0
20/12/13 11:31:41 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.extension = 30000
20/12/13 11:31:41 INFO blockmanagement.BlockManager: defaultReplication = 3
20/12/13 11:31:41 INFO blockmanagement.BlockManager: maxReplication = 512
20/12/13 11:31:41 INFO blockmanagement.BlockManager: minReplication = 1
20/12/13 11:31:41 INFO blockmanagement.BlockManager: maxReplicationStreams = 2
20/12/13 11:31:41 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
20/12/13 11:31:41 INFO blockmanagement.BlockManager: encryptDataTransfer = false
20/12/13 11:31:41 INFO blockmanagement.BlockManager: maxNumBlocksToLog = 1000
20/12/13 11:31:41 INFO namenode.FSNamesystem: Append Enabled: true
20/12/13 11:31:41 INFO namenode.FSDirectory: GLOBAL serial map: bits=24 maxEntries=16777215
20/12/13 11:31:41 INFO util.GSet: Computing capacity for map INodeMap
20/12/13 11:31:41 INFO util.GSet: VM type = 64-bit
20/12/13 11:31:41 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB
20/12/13 11:31:41 INFO util.GSet: capacity = 2^20 = 1048576 entries
20/12/13 11:31:41 INFO namenode.FSDirectory: ACLs enabled? false
20/12/13 11:31:41 INFO namenode.FSDirectory: XAttrs enabled? true
20/12/13 11:31:41 INFO namenode.NameNode: Caching file names occurring more than 10 times
20/12/13 11:31:41 INFO snapshot.SnapshotManager: Loaded config captureOpenFiles: falseskipCaptureAccessTimeOnlyChange: false
20/12/13 11:31:41 INFO util.GSet: Computing capacity for map cachedBlocks
20/12/13 11:31:41 INFO util.GSet: VM type = 64-bit
20/12/13 11:31:41 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB
20/12/13 11:31:41 INFO util.GSet: capacity = 2^18 = 262144 entries
20/12/13 11:31:41 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
20/12/13 11:31:41 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
20/12/13 11:31:41 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
20/12/13 11:31:42 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
20/12/13 11:31:42 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
20/12/13 11:31:42 INFO util.GSet: Computing capacity for map NameNodeRetryCache
20/12/13 11:31:42 INFO util.GSet: VM type = 64-bit
20/12/13 11:31:42 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB
20/12/13 11:31:42 INFO util.GSet: capacity = 2^15 = 32768 entries
20/12/13 11:31:42 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1937264665-10.101.3.195-1607830302045
20/12/13 11:31:42 INFO common.Storage: Storage directory /usr/local/hadoop/tmp/dfs/name has been successfully formatted.
20/12/13 11:31:42 INFO namenode.FSImageFormatProtobuf: Saving image file /usr/local/hadoop/tmp/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
20/12/13 11:31:42 INFO namenode.FSImageFormatProtobuf: Image file /usr/local/hadoop/tmp/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 323 bytes saved in 0 seconds .
20/12/13 11:31:42 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
20/12/13 11:31:42 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid = 0 when meet shutdown.
20/12/13 11:31:42 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at shugenatschool/10.101.3.195
************************************************************/

我们之需要看到这一句就行了。

1
20/12/13 11:31:42 INFO common.Storage: Storage directory /usr/local/hadoop/tmp/dfs/name has been successfully formatted.

启动 hadoop

执行命令start-dfs.sh

应该会得到下列输出:

1
2
3
4
5
Starting namenodes on [shugenatschool]
shugenatschool: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-shugenatschool.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-shugenatschool.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-shugenatschool.out

接着执行命令start-yarn.sh

应该会得到下列输出:

1
2
3
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-root-resourcemanager-shugenatschool.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-shugenatschool.out

这时候 hadoop 应该算是启动完成了。

我们可以用jps看看正在运行的 hadoop 进程

应该会得到下列类似输出:

运行 wordcount 样例程序

添加输入文件

首先在 HDFS 中创建输入文件夹(没有输出)

1
hadoop fs -mkdir /input

然后把要统计的文件丢进去,我们这里以README.txt为例子,(没有输出)

1
hadoop fs -put README.txt /input

然后我们不放心的话,可以查看一下input文件夹下的文件列表

1
hadoop fs -ls README.txt /input

应该会得到下列类似输出

执行样例程序

1
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar wordcount /input /output

得到下列日志

日志
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
20/12/13 12:06:32 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
20/12/13 12:06:32 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
20/12/13 12:06:32 INFO input.FileInputFormat: Total input files to process : 1
20/12/13 12:06:32 INFO mapreduce.JobSubmitter: number of splits:1
20/12/13 12:06:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local327845771_0001
20/12/13 12:06:33 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
20/12/13 12:06:33 INFO mapreduce.Job: Running job: job_local327845771_0001
20/12/13 12:06:33 INFO mapred.LocalJobRunner: OutputCommitter set in config null
20/12/13 12:06:33 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
20/12/13 12:06:33 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup \_temporary folders under output directory:false, ignore cleanup failures: false
20/12/13 12:06:33 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
20/12/13 12:06:33 INFO mapred.LocalJobRunner: Waiting for map tasks
20/12/13 12:06:33 INFO mapred.LocalJobRunner: Starting task: attempt_local327845771_0001_m_000000_0
20/12/13 12:06:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
20/12/13 12:06:34 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup \_temporary folders under output directory:false, ignore cleanup failures: false
20/12/13 12:06:34 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
20/12/13 12:06:34 INFO mapred.MapTask: Processing split: hdfs://shugenatschool:9000/input/README.txt:0+1366
20/12/13 12:06:34 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
20/12/13 12:06:34 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
20/12/13 12:06:34 INFO mapred.MapTask: soft limit at 83886080
20/12/13 12:06:34 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
20/12/13 12:06:34 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
20/12/13 12:06:34 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
20/12/13 12:06:34 INFO mapred.LocalJobRunner:
20/12/13 12:06:34 INFO mapred.MapTask: Starting flush of map output
20/12/13 12:06:34 INFO mapred.MapTask: Spilling map output
20/12/13 12:06:34 INFO mapred.MapTask: bufstart = 0; bufend = 2055; bufvoid = 104857600
20/12/13 12:06:34 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26213684(104854736); length = 713/6553600
20/12/13 12:06:34 INFO mapred.MapTask: Finished spill 0
20/12/13 12:06:34 INFO mapred.Task: Task:attempt_local327845771_0001_m_000000_0 is done. And is in the process of committing
20/12/13 12:06:34 INFO mapred.LocalJobRunner: map
20/12/13 12:06:34 INFO mapred.Task: Task 'attempt_local327845771_0001_m_000000_0' done.
20/12/13 12:06:34 INFO mapred.Task: Final Counters for attempt_local327845771_0001_m_000000_0: Counters: 23
File System Counters
FILE: Number of bytes read=303493
FILE: Number of bytes written=798348
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1366
HDFS: Number of bytes written=0
HDFS: Number of read operations=5
HDFS: Number of large read operations=0
HDFS: Number of write operations=1
Map-Reduce Framework
Map input records=31
Map output records=179
Map output bytes=2055
Map output materialized bytes=1836
Input split bytes=108
Combine input records=179
Combine output records=131
Spilled Records=131
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=13
Total committed heap usage (bytes)=243269632
File Input Format Counters
Bytes Read=1366
20/12/13 12:06:34 INFO mapred.LocalJobRunner: Finishing task: attempt_local327845771_0001_m_000000_0
20/12/13 12:06:34 INFO mapred.LocalJobRunner: map task executor complete.
20/12/13 12:06:34 INFO mapred.LocalJobRunner: Waiting for reduce tasks
20/12/13 12:06:34 INFO mapred.LocalJobRunner: Starting task: attempt_local327845771_0001_r_000000_0
20/12/13 12:06:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
20/12/13 12:06:34 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
20/12/13 12:06:34 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
20/12/13 12:06:34 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@781b9f43
20/12/13 12:06:34 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
20/12/13 12:06:34 INFO reduce.EventFetcher: attempt_local327845771_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
20/12/13 12:06:34 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local327845771_0001_m_000000_0 decomp: 1832 len: 1836 to MEMORY
20/12/13 12:06:34 INFO reduce.InMemoryMapOutput: Read 1832 bytes from map-output for attempt_local327845771_0001_m_000000_0
20/12/13 12:06:34 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 1832, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->1832
20/12/13 12:06:34 WARN io.ReadaheadPool: Failed readahead on ifile
EBADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146)
at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:208)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/12/13 12:06:34 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
20/12/13 12:06:34 INFO mapreduce.Job: Job job_local327845771_0001 running in uber mode : false
20/12/13 12:06:34 INFO mapreduce.Job: map 100% reduce 0%
20/12/13 12:06:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
20/12/13 12:06:34 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
20/12/13 12:06:34 INFO mapred.Merger: Merging 1 sorted segments
20/12/13 12:06:34 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1823 bytes
20/12/13 12:06:34 INFO reduce.MergeManagerImpl: Merged 1 segments, 1832 bytes to disk to satisfy reduce memory limit
20/12/13 12:06:34 INFO reduce.MergeManagerImpl: Merging 1 files, 1836 bytes from disk
20/12/13 12:06:34 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
20/12/13 12:06:34 INFO mapred.Merger: Merging 1 sorted segments
20/12/13 12:06:34 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1823 bytes
20/12/13 12:06:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
20/12/13 12:06:34 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
20/12/13 12:06:35 INFO mapred.Task: Task:attempt_local327845771_0001_r_000000_0 is done. And is in the process of committing
20/12/13 12:06:35 INFO mapred.LocalJobRunner: 1 / 1 copied.
20/12/13 12:06:35 INFO mapred.Task: Task attempt_local327845771_0001_r_000000_0 is allowed to commit now
20/12/13 12:06:35 INFO output.FileOutputCommitter: Saved output of task 'attempt_local327845771_0001_r_000000_0' to hdfs://shugenatschool:9000/output/\_temporary/0/task_local327845771_0001_r_000000
20/12/13 12:06:35 INFO mapred.LocalJobRunner: reduce > reduce
20/12/13 12:06:35 INFO mapred.Task: Task 'attempt_local327845771_0001_r_000000_0' done.
20/12/13 12:06:35 INFO mapred.Task: Final Counters for attempt_local327845771_0001_r_000000_0: Counters: 29
File System Counters
FILE: Number of bytes read=307197
FILE: Number of bytes written=800184
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1366
HDFS: Number of bytes written=1306
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Map-Reduce Framework
Combine input records=0
Combine output records=0
Reduce input groups=131
Reduce shuffle bytes=1836
Reduce input records=131
Reduce output records=131
Spilled Records=131
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=0
Total committed heap usage (bytes)=243269632
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Output Format Counters
Bytes Written=1306
20/12/13 12:06:35 INFO mapred.LocalJobRunner: Finishing task: attempt_local327845771_0001_r_000000_0
20/12/13 12:06:35 INFO mapred.LocalJobRunner: reduce task executor complete.
20/12/13 12:06:35 INFO mapreduce.Job: map 100% reduce 100%
20/12/13 12:06:35 INFO mapreduce.Job: Job job_local327845771_0001 completed successfully
20/12/13 12:06:35 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=610690
FILE: Number of bytes written=1598532
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2732
HDFS: Number of bytes written=1306
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Map-Reduce Framework
Map input records=31
Map output records=179
Map output bytes=2055
Map output materialized bytes=1836
Input split bytes=108
Combine input records=179
Combine output records=131
Reduce input groups=131
Reduce shuffle bytes=1836
Reduce input records=131
Reduce output records=131
Spilled Records=262
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=13
Total committed heap usage (bytes)=486539264
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1366
File Output Format Counters
Bytes Written=1306

不管,我们直接拉取输出文件,把整个 output 文件夹拉下来

1
hadoop fs -get /output

我们可以在output/part-r-00000看到类似这样的统计信息

你可以复制所有内容到 excel 中,会自动分为两列,一列是单词,一列是出现次数

有一些奇怪的数据可能因为分词不对的原因混了进来,可以自行删除。

结束语

到此,这个实践就结束了。

顺带,欢迎把这个文档拿去当二次教学辅助材料,如果需要修改后二次发布,请务必保证内容正确!

吐槽

老师给的材料好像一般般。。。还好我没少和 linux 系的系统打架。