For a more detailed guide on how to use, compose, and work with SparkAppliction
s, please refer to the
User Guide.
To get the Spark Operator, run the following commands:
$ make -p $GOPATH/src/k8s.io
$ cd $GOPATH/src/k8s.io
$ git clone [email protected]:GoogleCloudPlatform/spark-on-k8s-operator.git
The Spark Operator uses dep for dependency management. Please install dep
following
the instruction on the website if you don't have it available locally. To install the dependencies, run the following
command:
$ dep ensure
To update the dependencies, run the following command:
$ dep ensure -update
Before building the Spark Operator the first time, run the following commands to get the required Kubernetes code generators:
$ go get -u k8s.io/code-generator/cmd/deepcopy-gen
$ go get -u k8s.io/code-generator/cmd/defaulter-gen
To build the Spark Operator, run the following command:
$ make build
To build a Docker image of the Spark Operator, run the following command:
$ make image-tag=<image tag> image
To push the Docker image to Docker Hub, run the following command: console
$ make image-tag=<image tag> push
To install the Spark Operator on a Kubernetes cluster, run the following command:
$ kubectl apply -f manifest/
This will create a namespace sparkoperator
, setup RBAC for the Spark Operator to run in the namespace, and create a
Deployment named sparkoperator
in the namespace.
Due to a known issue in GKE, you will need to first grant yourself cluster-admin privileges before you can create custom roles and role bindings on a GKE cluster versioned 1.6 and up.
$ kubectl create clusterrolebinding <user>-cluster-admin-binding --clusterrole=cluster-admin --user=<user>@<domain>
Now you should see the Spark Operator running in the cluster by checking the status of the Deployment.
$ kubectl describe deployment sparkoperator
Spark Operator is typically deployed and run using manifest/spark-operator.yaml
through a Kubernetes Deployment
.
However, users can still run it outside a Kubernetes cluster and make it talk to the Kubernetes API server of a cluster
by specifying path to kubeconfig
, which can be done using the -kubeconfig
flag.
Spark Operator uses multiple workers in the SparkApplication
controller, the initializer, and the submission runner.
The number of worker threads to use in the three places are controlled using command-line flags -controller-threads
,
-initializer-threads
and -submission-threads
, respectively. The default values for the flags are 10, 10, and 3,
respectively.
The initializer is an optional component and can be enabled or disabled using the -enable-initializer
flag, which
defaults to true
. Since the initializer is an alpha feature, it won't function in Kubernetes clusters without alpha
features enabled. In this case, it can be disabled by adding the argument -enable-initializer=false
to
spark-operator.yaml.
To upgrade the Spark Operator, e.g., to use a newer version container image with a new tag, run the following command with the updated YAML file for the Deployment of the Spark Operator:
$ kubectl apply -f manifest/spark-operator.yaml
To run the Spark Pi example, run the following command:
$ kubectl apply -f examples/spark-pi.yaml
This will create a SparkApplication
object named spark-pi
. Check the object by running the following command:
$ kubectl get sparkapplications spark-pi -o=yaml
This will show something similar to the following:
apiVersion: sparkoperator.k8s.io/v1alpha1
kind: SparkApplication
metadata:
...
spec:
deps: {}
driver:
coreLimit: 200m
cores: 0.1
labels:
version: 2.3.0
memory: 512m
serviceAccount: spark
executor:
cores: 1
instances: 1
labels:
version: 2.3.0
memory: 512m
image: gcr.io/ynli-k8s/spark:v2.3.0
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
mainClass: org.apache.spark.examples.SparkPi
mode: cluster
restartPolicy: Never
type: Scala
status:
appId: spark-pi-2402118027
applicationState:
state: COMPLETED
completionTime: 2018-02-20T23:33:55Z
driverInfo:
podName: spark-pi-83ba921c85ff3f1cb04bef324f9154c9-driver
webUIAddress: 35.192.234.248:31064
webUIPort: 31064
webUIServiceName: spark-pi-2402118027-ui-svc
executorState:
spark-pi-83ba921c85ff3f1cb04bef324f9154c9-exec-1: COMPLETED
submissionTime: 2018-02-20T23:32:27Z
To check events for the SparkApplication
object, run the following command:
$ kubectl describe sparkapplications spark-pi
This will show the events similarly to the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SparkApplicationAdded 5m spark-operator SparkApplication spark-pi was added, enqueued it for submission
Normal SparkApplicationTerminated 4m spark-operator SparkApplication spark-pi terminated with state: COMPLETED
The Spark Operator submits the Spark Pi example to run once it receives an event indicating the SparkApplication
object was added.
The Spark Operator comes with an optional initializer for customizing Spark driver
and executor pods based on the specification in SparkApplication
objectse, e.g., mounting user-specified ConfigMaps.
The initializer works independently with or without the CRD controller. It works by
looking for certain custom annotations on Spark driver and executor Pods to perform its tasks. The annotations are
added by the CRD controller automatically based on application specifications in the SparkApplication
objects.
Alternatively, to use the initializer without the controller, the needed annotations can be added manually to the driver
and executor Pods using the following Spark configuration properties when submitting your Spark applications using the
spark-submit
script.
--conf spark.kubernetes.driver.annotations.[AnnotationName]=value
--conf spark.kubernetes.executor.annotations.[AnnotationName]=value
Currently the following annotations are supported:
Annotation | Value |
---|---|
sparkoperator.k8s.io/sparkConfigMap |
Name of the Kubernetes ConfigMap storing Spark configuration files (to which SPARK_CONF_DIR applies) |
sparkoperator.k8s.io/hadoopConfigMap |
Name of the Kubernetes ConfigMap storing Hadoop configuration files (to which HADOOP_CONF_DIR applies) |
sparkoperator.k8s.io/configMap.[ConfigMapName] |
Mount path of the ConfigMap named ConfigMapName |
sparkoperator.k8s.io/GCPServiceAccount.[SeviceAccountSecretName] |
Mount path of the secret storing GCP service account credentials (typically a JSON key file) named SeviceAccountSecretName |