背景信息
Spark是新一代分布式内存计算框架,Apache开源的顶级项目。相比于Hadoop Map-Reduce计算框架,Spark将中间计算结果保留在内存中,速度提升10~100倍;同时它还提供更丰富的算子,采用弹性分布式数据集(RDD)实现迭代计算,更好地适用于数据挖掘、机器学习算法,极大提升开发效率。
前提条件
- 确保您已经创建SCE集群,具体操作请参阅创建SCE集群。
- 确保kubectl工具已经连接目标集群。
操作步骤
步骤一:准备镜像和创建命名空间namespace
- 从dockerHub镜像仓库获取Spark相关镜像。
docker pull index.docker.io/caicloud/spark:1.5.2
docker pull index.docker.io/caicloud/zeppelin:0.5.6
- 创建命名空间。
#namespace-spark-cluster.yaml
apiVersion: v1
kind: Namespace
metadata:
name: spark-cluster
labels:
name: spark-cluster
$ kubectl create -f namespace-spark-cluster.yaml
- 查看Namespace。
$ kubectl get ns
NAME LABELS STATUS
default <none> Active
spark-cluster name=spark-cluster Active
步骤二:启动master服务
- 创建无状态工作负载,spark-master-deployment.yaml可参考如下:
#spark-master-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: spark-master-controller
namespace: spark-cluster
spec:
replicas: 1
selector:
matchLabels:
component: spark-master
template:
metadata:
labels:
component: spark-master
spec:
containers:
- name: spark-master
image: index.****/spark:1.5.2 ##替换成您自己的spark镜像
imagePullPolicy: Always
command: ["/start-master"]
ports:
- containerPort: 7077
- containerPort: 8080
resources:
requests:
cpu: 100m
$ kubectl create -f spark-master-deployment.yaml
- 创建Master-Service,spark-master-service.yaml可参考如下:
# spark-master-service.yaml
kind: Service
apiVersion: v1
metadata:
name: spark-master
namespace: spark-cluster
spec:
ports:
- port: 7077
targetPort: 7077
selector:
component: spark-master
$ kubectl create -f spark-master-service.yaml
service "spark-master"created
- 创建WebUI-Service,spark-webui.yaml可参考如下:
# spark-webui.yaml
kind: Service
apiVersion: v1
metadata:
name: spark-webui
namespace: spark-cluster
spec:
ports:
- port: 8080
targetPort: 8080
selector:
component: spark-master
$ kubectl create -f service/spark-webui.yaml
service "spark-webui" created
- 检查Master是否能运行和访问:
$ kubectl get deploy -nspark-cluster
NAME DESIRED CURRENT AGE
spark-master-controller 1 1 23h
$ kubectl get svc
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
spark-master 10.254.106.29 <none> 7077/TCP 1d
spark-webui 10.254.66.138 <none> 8080/TCP 18h
$ kubectl get pod -nspark-cluster
NAME READY STATUS RESTARTS AGE
spark-master-controller-b3gbf 1/1 Running 0 23h
- 确认master正常运行后,再使用Kubernetes proxy连接Spark WebUI:
$ kubectl proxy --port=8001
然后通过浏览器访问http://localhost:8001/api/v1/proxy/namespaces/spark-cluster/services/spark-webui/查看spark的任务运行状态。其中localhost替换成执行kubectl proxy命令的主机IP,如若在本地主机上执行kubectl proxy命令,直接在本地浏览器访问localhost即可。
步骤三:启动 Spark workers
Spark workers 启动时需要 Master service处于运行状态,您可以通过修改replicas来设定worker数目(比如设定 replicas: 4,即可建立4个Spark Worker)。您可以为每一个worker节点设置了CPU和内存的配额,保证Spark的worker应用不会过度抢占集群中其他应用的资源。spark-worker-deployment.yaml可参考如下:
#spark-worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: spark-worker-controller
namespace: spark-cluster
spec:
replicas: 4
selector:
matchLabels:
component: spark-worker
template:
metadata:
labels:
component: spark-worker
spec:
containers:
- name: spark-worker
image: index.caicloud.io/spark:1.5.2
imagePullPolicy: Always
command: ["/start-worker"]
ports:
- containerPort: 8081
resources:
requests:
cpu: 100m
$ kubectl create -f spark-worker-deployment.yaml
deployment "spark-worker-controller" created
查看 workers是否正常运行,通过kubectl查询状态(可看到spark-worker都已经正常运行):
$ kubectl get pods -nspark-cluster
NAME READY STATUS RESTARTS AGE
spark-master-controller-b3gbf 1/1 Running 0 1d
spark-worker-controller-ill4z 1/1 Running 1 2h
spark-worker-controller-j29sc 1/1 Running 0 2h
spark-worker-controller-siue2 1/1 Running 0 2h
spark-worker-controller-zd5kb 1/1 Running 0 2h
通过WebUI查看: worker就绪后应该出现在UI中。
步骤四:提交Spark任务
- 通过Spark-client,可以利用spark-submit来提交复杂的Python脚本、Java/Scala的jar包代码。
$ kubectl get pods -nspark-cluster | grep worker
NAME READY STATUS RESTARTS AGE
spark-worker-controller-1h0l7 1/1 Running 0 4h
spark-worker-controller-d43wa 1/1 Running 0 4h
spark-worker-controller-ka78h 1/1 Running 0 4h
spark-worker-controller-sucl7 1/1 Running 0 4h
$ kubectl exec spark-worker-controller-1h0l7 -it bash
$ cd /opt/spark
# 提交python spark任务
./bin/spark-submit \
--executor-memory 4G \
--master spark://spark-master:7077 \
examples/src/main/python/wordcount.py \
"hdfs://hadoop-namenode:9000/caicloud/spark/data"
# 提交scala spark任务
./bin/spark-submit
--executor-memory 4G
--master spark://spark-master:7077
--class io.caicloud.LinearRegression
/nfs/caicloud/spark-mllib-1.0-SNAPSHOT.jar
"hdfs://hadoop-namenode:9000/caicloud/spark/data"
- 通过Zeppelin,可以直接在命令行或UI编写简单的spark代码。
先创建zeppelin的工作负载:
$ kubectl create -f zeppelin-controller.yaml
deployment "zeppelin-controller"created
$ kubectl get pods -nspark-cluster -l component=zeppelin
NAME READY STATUS RESTARTS AGE
zeppelin-controller-5g25x 1/1 Running 0 5h
使用已创建的Zeppelin pod,设置WebUI的映射端口:
$ kubectl port-forward zeppelin-controller-5g25x 8080:8080
访问http://localhost:8080/,并提交测试代码。