You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/api.md
+9-7
Original file line number
Diff line number
Diff line change
@@ -52,10 +52,8 @@ A `SparkApplicationSpec` has the following top-level fields:
52
52
|`Driver`| N/A | A [`DriverSpec`](#driverspec) field. |
53
53
|`Executor`| N/A | An [`ExecutorSpec`](#executorspec) field. |
54
54
|`Deps`| N/A | A [`Dependencies`](#dependencies) field. |
55
-
|`RestartPolicy`| N/A | The policy regarding if and in which conditions the controller should restart a terminated application. Valid values are `Never`, `Always`, and `OnFailure`. |
55
+
|`RestartPolicy`| N/A | The policy regarding if and in which conditions the controller should restart a terminated application. |
56
56
|`NodeSelector`|`spark.kubernetes.node.selector.[labelKey]`| Node selector of the driver pod and executor pods, with key `labelKey` and value as the label's value. |
57
-
|`MaxSubmissionRetries`| N/A | The maximum number of times to retry a failed submission. |
58
-
|`SubmissionRetryInterval`| N/A | The unit of intervals in seconds between submission retries. Depending on the implementation, the actual interval between two submission retries may be a multiple of `SubmissionRetryInterval`, e.g., if linear or exponential backoff is used. |
59
57
|`MemoryOverheadFactor`|`spark.kubernetes.memoryOverheadFactor`| This sets the Memory Overhead Factor that will allocate memory to non-JVM memory. For JVM-based jobs this value will default to 0.10, for non-JVM jobs 0.40. Value of this field will be overridden by `Spec.Driver.MemoryOverhead` and `Spec.Executor.MemoryOverhead` if they are set. |
60
58
|`Monitoring`| N/A | This specifies how monitoring of the Spark application should be handled, e.g., how driver and executor metrics are to be exposed. Currently only exposing metrics to Prometheus is supported. |
61
59
@@ -135,12 +133,14 @@ A `SparkApplicationStatus` captures the status of a Spark application including
135
133
| Field | Note |
136
134
| ------------- | ------------- |
137
135
|`AppID`| A randomly generated ID used to group all Kubernetes resources of an application. |
138
-
|`SubmissionTime`| Time the application is submitted to run. |
136
+
|`LastSubmissionAttemptTime`| Time for the last application submission attempt. |
139
137
|`CompletionTime`| Time the application completes (if it does). |
140
138
|`DriverInfo`| A [`DriverInfo`](#driverinfo) field. |
141
139
|`AppState`| Current state of the application. |
142
140
|`ExecutorState`| A map of executor pod names to executor state. |
143
-
|`SubmissionRetries`| The number of submission retries for an application. |
141
+
|`ExecutionAttempts`| The number of attempts made for an application. |
142
+
|`SubmissionAttempts`| The number of submission attempts made for an application. |
143
+
144
144
145
145
#### `DriverInfo`
146
146
@@ -149,8 +149,10 @@ A `DriverInfo` captures information about the driver pod and the Spark web UI ru
149
149
| Field | Note |
150
150
| ------------- | ------------- |
151
151
|`WebUIServiceName`| Name of the service for the Spark web UI. |
152
-
|`WebUIPort`| Port on which the Spark web UI runs. |
153
-
|`WebUIAddress`| Address to access the web UI from outside the cluster. |
152
+
|`WebUIPort`| Port on which the Spark web UI runs on the Node. |
153
+
|`WebUIAddress`| Address to access the web UI from outside the cluster via the Node. |
154
+
|`WebUIIngressName`| Name of the ingress for the Spark web UI. |
155
+
|`WebUIIngressAddress`| Address to access the web UI via the Ingress. |
Copy file name to clipboardexpand all lines: docs/design.md
+6-10
Original file line number
Diff line number
Diff line change
@@ -39,24 +39,20 @@ When a `SparkApplication` object gets updated (i.e., when the `UpdateFunc` callb
39
39
40
40
The controller is also responsible for updating the status of a `SparkApplication` object with the help of the Spark pod monitor, which watches Spark pods and update the `SparkApplicationStatus` field of corresponding `SparkApplication` objects based on the status of the pods. The Spark pod monitor watches events of creation, updates, and deletion of Spark pods, creates status update messages based on the status of the pods, and sends the messages to the controller to process. When the controller receives a status update message, it gets the corresponding `SparkApplication` object from the cache store and updates the the `Status` accordingly.
41
41
42
-
As described in [API Definition](api.md), the `Status` field (of type `SparkApplicationStatus`) records the overall state of the application as well as the state of each executor pod. Note that the overall state of an application is determined by the driver pod state, except when submission fails, in which case no driver pod gets launched. Particulrly, the final application state is set to the termination state of the driver pod when applicable, i.e., `COMPLETED` if the driver pod completed or `FAILED` if the driver pod failed. If the driver pod gets deleted while running, the final application state is set to `FAILED`. If submission fails, the application state is set to `FAILED_SUBMISSION`.
42
+
As described in [API Definition](api.md), the `Status` field (of type `SparkApplicationStatus`) records the overall state of the application as well as the state of each executor pod. Note that the overall state of an application is determined by the driver pod state, except when submission fails, in which case no driver pod gets launched. Particulrly, the final application state is set to the termination state of the driver pod when applicable, i.e., `COMPLETED` if the driver pod completed or `FAILED` if the driver pod failed. If the driver pod gets deleted while running, the final application state is set to `FAILED`. If submission fails, the application state is set to `FAILED_SUBMISSION`. There are two terminal states: `COMPLETED` and `FAILED` which means that any Application in these states will never be retried by the Operator. All other states are non-terminal and based on the State as well as RestartPolicy (discussed below) can be retried.
43
43
44
44
As part of preparing a submission for a newly created `SparkApplication` object, the controller parses the object and adds configuration options for adding certain annotations to the driver and executor pods of the application. The annotations are later used by the mutating admission webhook to configure the pods before they start to run. For example,if a Spark application needs a certain Kubernetes ConfigMap to be mounted into the driver and executor pods, the controller adds an annotation that specifies the name of the ConfigMap to mount. Later the mutating admission webhook sees the annotation on the pods and mount the ConfigMap to the pods.
45
45
46
-
## Handling Application Restart
46
+
## Handling Application Restart And Failures
47
47
48
-
The operator provides a configurable option through the `RestartPolicy` field of `SparkApplicationSpec` (see the [API Definition](api.md) for more details) for specifying the application restart policy. The operator determines if an application should be restarted based on its termination state and the restart policy. As discussed above, the termination state of an application is based on the termination state of the driver pod. So effectively the decision is based on the termination state of the driver pod and the restart policy. Specifically, one of the following conditions applies:
48
+
The operator provides a configurable option through the `RestartPolicy` field of `SparkApplicationSpec` (see the [Configuring Automatic Application Restart and Failure Handling](user-guide.md) for more details) for specifying the application restart policy. The operator determines if an application should be restarted based on its termination state and the restart policy. As discussed above, the termination state of an application is based on the termination state of the driver pod. So effectively the decision is based on the termination state of the driver pod and the restart policy. Specifically, one of the following conditions applies:
49
49
50
-
* If the restart policy is `Never`, the application is not restarted upon terminating.
51
-
* If the restart policy is `Always`, the application gets restarted regardless of the termination state of the application.
52
-
* If the restart policy is `OnFailure`, the application gets restarted if and only if the application failed. Note that in case the driver pod gets deleted while running, the application is considered being failed as discussed above. In this case, the application gets restarted if the restart policy is `OnFailure`.
50
+
* If the restart policy type is `Never`, the application is not restarted upon terminating.
51
+
* If the restart policy type is `Always`, the application gets restarted regardless of the termination state of the application. Please note that such an Application will never end up in a terminal state of `COMPLETED` or `FAILED`.
52
+
* If the restart policy type is `OnFailure`, the application gets restarted if and only if the application failed and the retry limit is not reached. Note that in case the driver pod gets deleted while running, the application is considered being failed as discussed above. In this case, the application gets restarted if the restart policy is `OnFailure`.
53
53
54
54
When the operator decides to restart an application, it cleans up the Kubernetes resources associated with the previous terminated run of the application and enqueues the `SparkApplication` object of the application into the internal work queue, from which it gets picked up by a worker who will handle the submission. Note that instead of restarting the driver pod, the operator simply re-submits the application and lets the submission client create a new driver pod.
55
55
56
-
## Handling Retries of Failed Submissions
57
-
58
-
The submission of an application may fail for various reasons. Sometimes a submission may fail due to transient errors and a retry may succeed. The operator supports retries of failed submissions through a combination of the `MaxSubmissionRetries` field of `SparkApplicationSpec` and the `SubmissionRetries` field of `SparkApplicationStatus` (see the [API Definition](api.md) for more details). When the operator decides to retry a failed submission, it simply enqueues the `SparkApplication` object of the application into the internal work queue, from which it gets picked up by a worker who will handle the submission.
59
-
60
56
## Mutating Admission Webhook
61
57
62
58
The operator comes with an optional mutating admission webhook for customizing Spark driver and executor pods based on certain annotations on the pods added by the CRD controller. The annotations are set by the operator based on the application specifications. All Spark pod customization needs except for those natively support by Spark on Kubernetes are handled by the mutating admission webhook.
Copy file name to clipboardexpand all lines: docs/quick-start-guide.md
+19-6
Original file line number
Diff line number
Diff line change
@@ -77,12 +77,18 @@ A note about `metrics-labels`: In `Prometheus`, every unique combination of key-
77
77
Additionally, these metrics are best-effort for the current operator run and will be reset on an operator restart. Also some of these metrics are generated by listening to pod state updates for the driver/executors
78
78
and deleting the pods outside the operator might lead to incorrect metric values for some of these metrics.
79
79
80
+
## UI Access and Ingress
81
+
The operator, by default, makes the Spark UI accessible by creating a service of type `NodePort` which exposes the UI via the node running the driver.
82
+
The operator also supports creating an Ingress for the UI. This can be turned on by setting the `ingress-url-format` command-line flag. The `ingress-url-format`
83
+
should be a template like `{{$appName}}.ingress.cluster.com` and the operator will replace the `{{$appName}}` with the appropriate appName.
84
+
85
+
The operator also sets both `WebUIAddress` which uses the Node's public IP as well as `WebUIIngressAddress` as part of the `DriverInfo` field of the `SparkApplication`.
86
+
80
87
## Configuration
81
88
82
89
The operator is typically deployed and run using the Helm chart. However, users can still run it outside a Kubernetes cluster and make it talk to the Kubernetes API server of a cluster by specifying path to `kubeconfig`, which can be done using the `-kubeconfig` flag.
83
90
84
-
The operator uses multiple workers in the `SparkApplication` controller and and the submission runner.
85
-
The number of worker threads to use in the three places are controlled using command-line flags `-controller-threads` and `-submission-threads`, respectively. The default values for the flags are 10 and 3, respectively.
91
+
The operator uses multiple workers in the `SparkApplication` controller. The number of worker threads are controlled using command-line flag `-controller-threads` which has a default value of 10.
86
92
87
93
The operator enables cache resynchronization so periodically the informers used by the operator will re-list existing objects it manages and re-trigger resource events. The resynchronization interval in seconds can be configured using the flag `-resync-interval`, with a default value of 30 seconds.
Copy file name to clipboardexpand all lines: docs/user-guide.md
+25-11
Original file line number
Diff line number
Diff line change
@@ -357,15 +357,28 @@ A `SparkApplication` can be updated using the `kubectl apply -f <updated YAML fi
357
357
358
358
A `SparkApplication` can be checked using the `kubectl describe sparkapplications <name>` command. The output of the command shows the specification and status of the `SparkApplication` as well as events associated with it. The events communicate the overall process and errors of the `SparkApplication`.
359
359
360
-
### Configuring Automatic Application Restart
361
-
362
-
The operator supports automatic application restart with a configurable `RestartPolicy` using the optional field `.spec.restartPolicy`, whose valid values include `Never`, `OnFailure`, and `Always`. Upon termination of an application, the operator determines if the application is subject to restart based on its termination state and the `RestartPolicy` in the specification. If the application is subject to restart, the operator restarts it by submitting a new run of it. The old driver pod is deleted if it still exists before submitting the new run, and a new driver pod is created by the submission client so effectively the driver gets restarted.
363
-
364
-
### Configuring Automatic Application Re-submission on Submission Failures
365
-
366
-
The operator supports automatically retrying failed submissions. When the operator failed to submit an
367
-
application, it determines if the application is subject to a submission retry based on if the optional field
368
-
`.spec.maxSubmissionRetries` is set and has a positive value and the number of times it has already retried. If the maximum submission retries has not been reached, the operator retries submitting the application using a linear backoff with the interval specified by `.spec.submissionRetryInterval`. If `.spec.submissionRetryInterval` is not set, the operator retries submitting the application immediately.
360
+
### Configuring Automatic Application Restart and Failure Handling
361
+
362
+
The operator supports automatic application restart with a configurable `RestartPolicy` using the optional field
363
+
`.spec.restartPolicy`. The following is an example of a sample `RestartPolicy`:
364
+
365
+
```yaml
366
+
restartPolicy:
367
+
type: OnFailure
368
+
onFailureRetries: 3
369
+
onFailureRetryInterval: 10
370
+
onSubmissionFailureRetries: 5
371
+
onSubmissionFailureRetryInterval: 20
372
+
```
373
+
The valid types of restartPolicy include `Never`, `OnFailure`, and `Always`. Upon termination of an application,
374
+
the operator determines if the application is subject to restart based on its termination state and the
375
+
`RestartPolicy`in the specification. If the application is subject to restart, the operator restarts it by
376
+
submitting a new run of it. For `OnFailure`, the Operator further supports setting limits on number of retries
377
+
via the `onFailureRetries` and `onSubmissionFailureRetries` fields. Additionally, if the submission retries has not been reached,
378
+
the operator retries submitting the application using a linear backoff with the interval specified by
379
+
`onFailureRetryInterval`and `onSubmissionFailureRetryInterval` which are required for both `OnFailure` and `Always` `RestartPolicy`.
380
+
The old resources like driver pod, ui service/ingress etc. are deleted if it still exists before submitting the new run, and a new driver pod is created by the submission
381
+
client so effectively the driver gets restarted.
369
382
370
383
## Running Spark Applications on a Schedule using a ScheduledSparkApplication
371
384
@@ -395,7 +408,8 @@ spec:
395
408
cores: 1
396
409
instances: 1
397
410
memory: 512m
398
-
restartPolicy: Never
411
+
restartPolicy:
412
+
type: Never
399
413
```
400
414
401
415
The concurrency of runs of an application is controlled by `.spec.concurrencyPolicy`, whose valid values are `Allow`, `Forbid`, and `Replace`, with `Allow` being the default. The meanings of each value is described below:
@@ -417,4 +431,4 @@ To customize the operator, you can follow the steps below:
417
431
2. Create docker images to be used for Spark with [docker-image tool](https://spark.apache.org/docs/latest/running-on-kubernetes.html#docker-images).
418
432
3. Create a new operator image based on the above image. You need to modify the `FROM` tag in the [Dockerfile](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/Dockerfile) with your Spark image.
419
433
4. Build and push your operator image built above.
420
-
5. Deploy the new image by modifying the [/manifest/spark-operator.yaml](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/manifest/spark-operator.yaml) file and specfiying your operator image.
434
+
5. Deploy the new image by modifying the [/manifest/spark-operator.yaml](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/manifest/spark-operator.yaml) file and specfiying your operator image.
0 commit comments