Spark环境搭建(一)Local模式
Spark环境搭建(一)Local模式

| Term | Meaning |
|---|---|
Application |
User program built on Spark. Consists of a driver program and executors on the cluster. |
| Application jar | A jar containing the user’s Spark application. In some cases users will want to create an “uber jar” containing their application along with its dependencies. The user’s jar should never include Hadoop or Spark libraries, however, these will be added at runtime. |
Driver |
The process running the main() function of the application and creating the SparkContext |
| Cluster manager | An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN) |
| Deploy mode | Distinguishes where the driver process runs. In “cluster” mode, the framework launches the driver inside of the cluster. In “client” mode, the submitter launches the driver outside of the cluster. |
| Worker node | Any node that can run application code in the cluster |
Executor |
A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. |
| Task | A unit of work that will be sent to one executor |
| Job | A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you’ll see this term used in the driver’s logs. |
| Stage | Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you’ll see this term used in the driver’s logs. |
环境部署
官方文档
https://spark.apache.org/docs/2.4.8/
下载地址
https://archive.apache.org/dist/spark/spark-2.4.8/
安装目录
/opt/bigdata/spark部署规划
node0:Local (Linux File System & HDFS)
node1node2node3配置为spark集群。

在/opt/目录下创建bigdata目录,并修改目录的所有者和所属组为zhangsan,我们将spark安装到该目录中。
1 | 创建/opt/bigdata/spark文件夹 |
使用xftp等文件传输工具,将spark安装包上传到CentOS 7系统的/opt/bigdata/spark目录中。
本地运行
读写Linux
本地运行模式开箱即用,只需要Java环境,无需其他任何配置。此时,只能读写本机Linux文件系统中的文件,无法读写HDFS。
解压
1 | [zhangsan@node0 ~]$ cd /opt/bigdata/spark/ |
配置
无需配置。
启动
1 | [zhangsan@node0 spark]$ cd default/bin/ |
可通过spark web界面查job的启动情况,网页中看不到job的运行情况,因为我们还没有执行spark job。
1 | http://node0:4040 |
本地模式只有一个进程启动。
1 | [zhangsan@node0 ~]$ jps |
Master参数
| –master参数 | 说明 |
|---|---|
local |
使用一个Worker线程本地化运行Spark |
local[k] |
使用K个Worker线程本地化运行Spark |
local[*] |
使用* 个Worker线程本地化运行Spark(* =机器的CPU核数)(默认) |
spark://HOST:PORT |
连接到指定的Standalone集群。HOST参数是Spark Master的hostname或IP,默认端口是7077。 |
mesos://HOST:PORT |
连接到指定的Mesos集群。HOST参数是Moses Master的hostname或IP,默认端口是5050 。 |
yarn |
默认以客户端模式连接到YARN集群,集群位置由环境变量YARN_CONF_DIR决定 。 |
Spark2.0以前,yarn分为yarn-client与yarn-cluster
Spark2.0以后,设置--deploy-mode=[client/cluster]以不同模式连接到YARN集群
案例 - wordcount
1 | hello |
1 | scala> var wordcount = sc.textFile("/home/zhangsan/bigdata.txt").flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_) |
打开web http://node0:4040 界面,可以查看Job的运行情况,此地址为driver执行任务的查看地址。

可以看到该Job包括两个stage

直接解压运行,spark只能读写本地Linux文件系统;接下来我们配置spark,让其能够读写HDFS。
读写HDFS
接下来的实验中,我们需要使用Spark读写HDFS,因此,我们需要完成Hadoop伪分布式环境的搭建。
配置文件
Spark配置文件spark-env.sh中添加如下行即可。
1 | 进入spark配置文件存放目录 |
测试
格式化名称节点
1 | [zhangsan@node0 ~]$ hadoop namenode -format |
启动Hadoop
1 | [zhangsan@node0 ~]$ start-all.sh |
创建文件夹
1 | [zhangsan@node0 ~]$ hdfs dfs -mkdir /input |
上传测试数据
1 | [zhangsan@node0 ~]$ hdfs dfs -put bigdata.txt /input |
wordcount
$SPARK_HOME/bin
1 | scala> var wordcount = sc.textFile("hdfs:///input/bigdata.txt").flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_) |
退出
1 | scala> :quit |