Skip to content

Latest commit

 

History

History
214 lines (163 loc) · 7.42 KB

quick-start-guide.md

File metadata and controls

214 lines (163 loc) · 7.42 KB

Quick Start Guide

For a more detailed guide on how to use, compose, and work with SparkApplictions, please refer to the User Guide.

Table of Contents

  1. Build and Installation
  2. Configuration
  3. Upgrade
  4. Running the Examples
  5. Using the Initializer

Build and Installation

To get the Spark Operator, run the following commands:

$ make -p $GOPATH/src/k8s.io
$ cd $GOPATH/src/k8s.io
$ git clone [email protected]:GoogleCloudPlatform/spark-on-k8s-operator.git

The Spark Operator uses dep for dependency management. Please install dep following the instruction on the website if you don't have it available locally. To install the dependencies, run the following command:

$ dep ensure

To update the dependencies, run the following command:

$ dep ensure -update

Before building the Spark Operator the first time, run the following commands to get the required Kubernetes code generators:

$ go get -u k8s.io/code-generator/cmd/deepcopy-gen
$ go get -u k8s.io/code-generator/cmd/defaulter-gen

To build the Spark Operator, run the following command:

$ make build

To build a Docker image of the Spark Operator, run the following command:

$ make image-tag=<image tag> image

To push the Docker image to Docker Hub, run the following command: console

$ make image-tag=<image tag> push

To install the Spark Operator on a Kubernetes cluster, run the following command:

$ kubectl apply -f manifest/

This will create a namespace sparkoperator, setup RBAC for the Spark Operator to run in the namespace, and create a Deployment named sparkoperator in the namespace.

Due to a known issue in GKE, you will need to first grant yourself cluster-admin privileges before you can create custom roles and role bindings on a GKE cluster versioned 1.6 and up.

$ kubectl create clusterrolebinding <user>-cluster-admin-binding --clusterrole=cluster-admin --user=<user>@<domain>

Now you should see the Spark Operator running in the cluster by checking the status of the Deployment.

$ kubectl describe deployment sparkoperator

Configuration

Spark Operator is typically deployed and run using manifest/spark-operator.yaml through a Kubernetes Deployment. However, users can still run it outside a Kubernetes cluster and make it talk to the Kubernetes API server of a cluster by specifying path to kubeconfig, which can be done using the -kubeconfig flag.

Spark Operator uses multiple workers in the SparkApplication controller, the initializer, and the submission runner. The number of worker threads to use in the three places are controlled using command-line flags -controller-threads, -initializer-threads and -submission-threads, respectively. The default values for the flags are 10, 10, and 3, respectively.

The initializer is an optional component and can be enabled or disabled using the -enable-initializer flag, which defaults to true. Since the initializer is an alpha feature, it won't function in Kubernetes clusters without alpha features enabled. In this case, it can be disabled by adding the argument -enable-initializer=false to spark-operator.yaml.

Upgrade

To upgrade the Spark Operator, e.g., to use a newer version container image with a new tag, run the following command with the updated YAML file for the Deployment of the Spark Operator:

$ kubectl apply -f manifest/spark-operator.yaml

Running the Examples

To run the Spark Pi example, run the following command:

$ kubectl apply -f examples/spark-pi.yaml

This will create a SparkApplication object named spark-pi. Check the object by running the following command:

$ kubectl get sparkapplications spark-pi -o=yaml

This will show something similar to the following:

apiVersion: sparkoperator.k8s.io/v1alpha1
kind: SparkApplication
metadata:
  ...
spec:
  deps: {}
  driver:
    coreLimit: 200m
    cores: 0.1
    labels:
      version: 2.3.0
    memory: 512m
    serviceAccount: spark
  executor:
    cores: 1
    instances: 1
    labels:
      version: 2.3.0
    memory: 512m
  image: gcr.io/ynli-k8s/spark:v2.3.0
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
  mainClass: org.apache.spark.examples.SparkPi
  mode: cluster
  restartPolicy: Never
  type: Scala
status:
  appId: spark-pi-2402118027
  applicationState:
    state: COMPLETED
  completionTime: 2018-02-20T23:33:55Z
  driverInfo:
    podName: spark-pi-83ba921c85ff3f1cb04bef324f9154c9-driver
    webUIAddress: 35.192.234.248:31064
    webUIPort: 31064
    webUIServiceName: spark-pi-2402118027-ui-svc
  executorState:
    spark-pi-83ba921c85ff3f1cb04bef324f9154c9-exec-1: COMPLETED
  submissionTime: 2018-02-20T23:32:27Z

To check events for the SparkApplication object, run the following command:

$ kubectl describe sparkapplications spark-pi

This will show the events similarly to the following:

Events:
  Type    Reason                      Age   From            Message
  ----    ------                      ----  ----            -------
  Normal  SparkApplicationAdded       5m    spark-operator  SparkApplication spark-pi was added, enqueued it for submission
  Normal  SparkApplicationTerminated  4m    spark-operator  SparkApplication spark-pi terminated with state: COMPLETED

The Spark Operator submits the Spark Pi example to run once it receives an event indicating the SparkApplication object was added.

Using the Initializer

The Spark Operator comes with an optional initializer for customizing Spark driver and executor pods based on the specification in SparkApplication objectse, e.g., mounting user-specified ConfigMaps. The initializer works independently with or without the CRD controller. It works by looking for certain custom annotations on Spark driver and executor Pods to perform its tasks. The annotations are added by the CRD controller automatically based on application specifications in the SparkApplicationobjects. Alternatively, to use the initializer without the controller, the needed annotations can be added manually to the driver and executor Pods using the following Spark configuration properties when submitting your Spark applications using the spark-submit script.

--conf spark.kubernetes.driver.annotations.[AnnotationName]=value
--conf spark.kubernetes.executor.annotations.[AnnotationName]=value

Currently the following annotations are supported:

Annotation Value
sparkoperator.k8s.io/sparkConfigMap Name of the Kubernetes ConfigMap storing Spark configuration files (to which SPARK_CONF_DIR applies)
sparkoperator.k8s.io/hadoopConfigMap Name of the Kubernetes ConfigMap storing Hadoop configuration files (to which HADOOP_CONF_DIR applies)
sparkoperator.k8s.io/configMap.[ConfigMapName] Mount path of the ConfigMap named ConfigMapName
sparkoperator.k8s.io/GCPServiceAccount.[SeviceAccountSecretName] Mount path of the secret storing GCP service account credentials (typically a JSON key file) named SeviceAccountSecretName