questsul
diff --git a/‎docs/api.md
+9-7 b/‎docs/api.md
+9-7
diff --git a/‎docs/design.md
+6-10 b/‎docs/design.md
+6-10
diff --git a/‎docs/quick-start-guide.md
+19-6 b/‎docs/quick-start-guide.md
+19-6
diff --git a/‎docs/user-guide.md
+25-11 b/‎docs/user-guide.md
+25-11
diff --git a/‎docs/who-is-using.md
+1 b/‎docs/who-is-using.md
+1
diff --git a/‎examples/spark-pi.yaml
-1 b/‎examples/spark-pi.yaml
-1
diff --git a/‎examples/spark-pyfiles.yaml
+6-2 b/‎examples/spark-pyfiles.yaml
+6-2
@@ -52,10 +52,8 @@ A `SparkApplicationSpec` has the following top-level fields:
 | `Driver` | N/A | A [`DriverSpec`](#driverspec) field. |
 | `Executor` | N/A | An [`ExecutorSpec`](#executorspec) field. |
 | `Deps` | N/A | A [`Dependencies`](#dependencies) field. |
-| `RestartPolicy` | N/A | The policy regarding if and in which conditions the controller should restart a terminated application. Valid values are `Never`, `Always`, and `OnFailure`. |
+| `RestartPolicy` | N/A | The policy regarding if and in which conditions the controller should restart a terminated application. |
 | `NodeSelector` | `spark.kubernetes.node.selector.[labelKey]` | Node selector of the driver pod and executor pods, with key `labelKey` and value as the label's value. |
-| `MaxSubmissionRetries` | N/A | The maximum number of times to retry a failed submission. |
-| `SubmissionRetryInterval` | N/A | The unit of intervals in seconds between submission retries. Depending on the implementation, the actual interval between two submission retries may be a multiple of `SubmissionRetryInterval`, e.g., if linear or exponential backoff is used. |
 | `MemoryOverheadFactor` | `spark.kubernetes.memoryOverheadFactor` | This sets the Memory Overhead Factor that will allocate memory to non-JVM memory. For JVM-based jobs this value will default to 0.10, for non-JVM jobs 0.40. Value of this field will be overridden by `Spec.Driver.MemoryOverhead` and `Spec.Executor.MemoryOverhead` if they are set. |
 | `Monitoring` | N/A | This specifies how monitoring of the Spark application should be handled, e.g., how driver and executor metrics are to be exposed. Currently only exposing metrics to Prometheus is supported. |
 
@@ -135,12 +133,14 @@ A `SparkApplicationStatus` captures the status of a Spark application including
 | Field | Note |
 | ------------- | ------------- |
 | `AppID` | A randomly generated ID used to group all Kubernetes resources of an application. |
-| `SubmissionTime` | Time the application is submitted to run. |
+| `LastSubmissionAttemptTime` | Time for the last application submission attempt. |
 | `CompletionTime` | Time the application completes (if it does). |
 | `DriverInfo` | A [`DriverInfo`](#driverinfo) field. |
 | `AppState` | Current state of the application. |
 | `ExecutorState` | A map of executor pod names to executor state. |
-| `SubmissionRetries` | The number of submission retries for an application. |
+| `ExecutionAttempts` | The number of attempts made for an application. |
+| `SubmissionAttempts` | The number of submission attempts made for an application. |
+
 
 #### `DriverInfo`
 
@@ -149,8 +149,10 @@ A `DriverInfo` captures information about the driver pod and the Spark web UI ru
 | Field | Note |
 | ------------- | ------------- |
 | `WebUIServiceName` | Name of the service for the Spark web UI. |
-| `WebUIPort` | Port on which the Spark web UI runs. |
-| `WebUIAddress` | Address to access the web UI from outside the cluster. |
+| `WebUIPort` | Port on which the Spark web UI runs on the Node. |
+| `WebUIAddress` | Address to access the web UI from outside the cluster via the Node. |
+| `WebUIIngressName` | Name of the ingress for the Spark web UI. |
+| `WebUIIngressAddress` | Address to access the web UI via the Ingress. |
 | `PodName` | Name of the driver pod. |
 
 ### `ScheduledSparkApplicationSpec`
 
@@ -39,24 +39,20 @@ When a `SparkApplication` object gets updated (i.e., when the `UpdateFunc` callb
 
 The controller is also responsible for updating the status of a `SparkApplication` object with the help of the Spark pod monitor, which watches Spark pods and update the `SparkApplicationStatus` field of corresponding `SparkApplication` objects based on the status of the pods. The Spark pod monitor watches events of creation, updates, and deletion of Spark pods, creates status update messages based on the status of the pods, and sends the messages to the controller to process. When the controller receives a status update message, it gets the corresponding `SparkApplication` object from the cache store and updates the the `Status` accordingly. 
 
-As described in [API Definition](api.md), the `Status` field (of type `SparkApplicationStatus`) records the overall state of the application as well as the state of each executor pod. Note that the overall state of an application is determined by the driver pod state, except when submission fails, in which case no driver pod gets launched. Particulrly, the final application state is set to the termination state of the driver pod when applicable, i.e., `COMPLETED` if the driver pod completed or `FAILED` if the driver pod failed. If the driver pod gets deleted while running, the final application state is set to `FAILED`. If submission fails, the application state is set to `FAILED_SUBMISSION`.
+As described in [API Definition](api.md), the `Status` field (of type `SparkApplicationStatus`) records the overall state of the application as well as the state of each executor pod. Note that the overall state of an application is determined by the driver pod state, except when submission fails, in which case no driver pod gets launched. Particulrly, the final application state is set to the termination state of the driver pod when applicable, i.e., `COMPLETED` if the driver pod completed or `FAILED` if the driver pod failed. If the driver pod gets deleted while running, the final application state is set to `FAILED`. If submission fails, the application state is set to `FAILED_SUBMISSION`.  There are two terminal states: `COMPLETED` and `FAILED` which means that any Application in these states will never be retried by the Operator. All other states are non-terminal and based on the State as well as RestartPolicy (discussed below) can be retried.
 
 As part of preparing a submission for a newly created `SparkApplication` object, the controller parses the object and adds configuration options for adding certain annotations to the driver and executor pods of the application. The annotations are later used by the mutating admission webhook to configure the pods before they start to run. For example,if a Spark application needs a certain Kubernetes ConfigMap to be mounted into the driver and executor pods, the controller adds an annotation that specifies the name of the ConfigMap to mount. Later the mutating admission webhook sees the annotation on the pods and mount the ConfigMap to the pods.
 
-## Handling Application Restart
+## Handling Application Restart And Failures
 
-The operator provides a configurable option through the `RestartPolicy` field of `SparkApplicationSpec` (see the [API Definition](api.md) for more details) for specifying the application restart policy. The operator determines if an application should be restarted based on its termination state and the restart policy. As discussed above, the termination state of an application is based on the termination state of the driver pod. So effectively the decision is based on the termination state of the driver pod and the restart policy. Specifically, one of the following conditions applies:
+The operator provides a configurable option through the `RestartPolicy` field of `SparkApplicationSpec` (see the [Configuring Automatic Application Restart and Failure Handling](user-guide.md) for more details) for specifying the application restart policy. The operator determines if an application should be restarted based on its termination state and the restart policy. As discussed above, the termination state of an application is based on the termination state of the driver pod. So effectively the decision is based on the termination state of the driver pod and the restart policy. Specifically, one of the following conditions applies:
 
-* If the restart policy is `Never`, the application is not restarted upon terminating.
-* If the restart policy is `Always`, the application gets restarted regardless of the termination state of the application.
-* If the restart policy is `OnFailure`, the application gets restarted if and only if the application failed. Note that in case the driver pod gets deleted while running, the application is considered being failed as discussed above. In this case, the application gets restarted if the restart policy is `OnFailure`.
+* If the restart policy type is `Never`, the application is not restarted upon terminating.
+* If the restart policy type is `Always`, the application gets restarted regardless of the termination state of the application. Please note that such an Application will never end up in a terminal state of `COMPLETED` or `FAILED`.
+* If the restart policy type  is `OnFailure`, the application gets restarted if and only if the application failed and the retry limit is not reached. Note that in case the driver pod gets deleted while running, the application is considered being failed as discussed above. In this case, the application gets restarted if the restart policy is `OnFailure`.
 
 When the operator decides to restart an application, it cleans up the Kubernetes resources associated with the previous terminated run of the application and enqueues the `SparkApplication` object of the application into the internal work queue, from which it gets picked up by a worker who will handle the submission. Note that instead of restarting the driver pod, the operator simply re-submits the application and lets the submission client create a new driver pod.
 
-## Handling Retries of Failed Submissions
-
-The submission of an application may fail for various reasons. Sometimes a submission may fail due to transient errors and a retry may succeed. The operator supports retries of failed submissions through a combination of the `MaxSubmissionRetries` field of `SparkApplicationSpec` and the `SubmissionRetries` field of `SparkApplicationStatus` (see the [API Definition](api.md) for more details). When the operator decides to retry a failed submission, it simply enqueues the `SparkApplication` object of the application into the internal work queue, from which it gets picked up by a worker who will handle the submission.   
-
 ## Mutating Admission Webhook
 
 The operator comes with an optional mutating admission webhook for customizing Spark driver and executor pods based on certain annotations on the pods added by the CRD controller. The annotations are set by the operator based on the application specifications. All Spark pod customization needs except for those natively support by Spark on Kubernetes are handled by the mutating admission webhook.
 
@@ -77,12 +77,18 @@ A note about `metrics-labels`: In `Prometheus`, every unique combination of key-
 Additionally, these metrics are best-effort for the current operator run and will be reset on an operator restart. Also some of these metrics are generated by listening to pod state updates for the driver/executors
 and deleting the pods outside the operator might lead to incorrect metric values for some of these metrics.
 
+## UI Access and Ingress
+The operator, by default, makes the Spark UI accessible by creating a service of type `NodePort` which exposes the UI via the node running the driver.
+The operator also supports creating an Ingress for the UI. This can be turned on by setting the `ingress-url-format` command-line flag. The `ingress-url-format`
+should be a template like `{{$appName}}.ingress.cluster.com` and the operator will replace the `{{$appName}}` with the appropriate appName.
+
+The operator also sets both `WebUIAddress` which uses the Node's public IP as well as `WebUIIngressAddress` as part of the `DriverInfo` field of the `SparkApplication`.
+
 ## Configuration
 
 The operator is typically deployed and run using the Helm chart. However, users can still run it outside a Kubernetes cluster and make it talk to the Kubernetes API server of a cluster by specifying path to `kubeconfig`, which can be done using the `-kubeconfig` flag.
 
-The operator uses multiple workers in the `SparkApplication` controller and and the submission runner.
-The number of worker threads to use in the three places are controlled using command-line flags `-controller-threads` and `-submission-threads`, respectively. The default values for the flags are 10 and 3, respectively.
+The operator uses multiple workers in the `SparkApplication` controller. The number of worker threads are controlled using command-line flag `-controller-threads` which has a default value of 10.
 
 The operator enables cache resynchronization so periodically the informers used by the operator will re-list existing objects it manages and re-trigger resource events. The resynchronization interval in seconds can be configured using the flag `-resync-interval`, with a default value of 30 seconds.
 
@@ -150,10 +156,15 @@ spec:
   mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
   mainClass: org.apache.spark.examples.SparkPi
   mode: cluster
-  restartPolicy: Never
+  restartPolicy:
+      type: OnFailure
+      onFailureRetries: 3
+      onFailureRetryInterval: 10
+      onSubmissionFailureRetries: 5
+      onSubmissionFailureRetryInterval: 20
   type: Scala
 status:
-  appId: spark-pi-2402118027
+  sparkApplicationId: spark-5f4ba921c85ff3f1cb04bef324f9154c9
   applicationState:
     state: COMPLETED
   completionTime: 2018-02-20T23:33:55Z
@@ -162,9 +173,11 @@ status:
     webUIAddress: 35.192.234.248:31064
     webUIPort: 31064
     webUIServiceName: spark-pi-2402118027-ui-svc
+    webUIIngressName: spark-pi-ui-ingress
+    webUIIngressAddress: spark-pi.ingress.cluster.com
   executorState:
     spark-pi-83ba921c85ff3f1cb04bef324f9154c9-exec-1: COMPLETED
-  submissionTime: 2018-02-20T23:32:27Z
+  LastSubmissionAttemptTime: 2018-02-20T23:32:27Z
 ```
 
 To check events for the `SparkApplication` object, run the following command:
@@ -200,4 +213,4 @@ $ kubectl apply -f manifest/spark-operator-with-webhook.yaml
 
 This will create a Deployment named `sparkoperator` and a Service named `spark-webhook` for the webhook in namespace `spark-operator`.
 
-If the operator is installed via the Helm chart using the default settings (i.e. with webhook enabled), the above steps are all automated for you.
+If the operator is installed via the Helm chart using the default settings (i.e. with webhook enabled), the above steps are all automated for you.
@@ -357,15 +357,28 @@ A `SparkApplication` can be updated using the `kubectl apply -f <updated YAML fi
 
 A `SparkApplication` can be checked using the `kubectl describe sparkapplications <name>` command. The output of the command shows the specification and status of the `SparkApplication` as well as events associated with it. The events communicate the overall process and errors of the `SparkApplication`. 
 
-### Configuring Automatic Application Restart
-
-The operator supports automatic application restart with a configurable `RestartPolicy` using the optional field `.spec.restartPolicy`, whose valid values include `Never`, `OnFailure`, and `Always`. Upon termination of an application, the operator determines if the application is subject to restart based on its termination state and the `RestartPolicy` in the specification. If the application is subject to restart, the operator restarts it by submitting a new run of it. The old driver pod is deleted if it still exists before submitting the new run, and a new driver pod is created by the submission client so effectively the driver gets restarted. 
-
-### Configuring Automatic Application Re-submission on Submission Failures
-
-The operator supports automatically retrying failed submissions. When the operator failed to submit an
-application, it determines if the application is subject to a submission retry based on if the optional field 
-`.spec.maxSubmissionRetries` is set and has a positive value and the number of times it has already retried. If the maximum submission retries has not been reached, the operator retries submitting the application using a linear backoff with the interval specified by `.spec.submissionRetryInterval`. If `.spec.submissionRetryInterval` is not set, the operator retries submitting the application immediately.
+### Configuring Automatic Application Restart and Failure Handling
+
+The operator supports automatic application restart with a configurable `RestartPolicy` using the optional field
+`.spec.restartPolicy`. The following is an example of a sample `RestartPolicy`:
+
+ ```yaml
+  restartPolicy:
+     type: OnFailure
+     onFailureRetries: 3
+     onFailureRetryInterval: 10
+     onSubmissionFailureRetries: 5
+     onSubmissionFailureRetryInterval: 20
+```
+The valid types of restartPolicy include `Never`, `OnFailure`, and `Always`. Upon termination of an application,
+the operator determines if the application is subject to restart based on its termination state and the
+`RestartPolicy` in the specification. If the application is subject to restart, the operator restarts it by
+submitting a new run of it. For `OnFailure`, the Operator further supports setting limits on number of retries
+via the `onFailureRetries` and `onSubmissionFailureRetries` fields. Additionally, if the  submission retries has not been reached,
+the operator retries submitting the application using a linear backoff with the interval specified by
+`onFailureRetryInterval` and `onSubmissionFailureRetryInterval` which are required for both `OnFailure` and `Always` `RestartPolicy`.
+The old resources like driver pod, ui service/ingress etc. are deleted if it still exists before submitting the new run, and a new  driver pod is created by the submission
+client so effectively the driver gets restarted.
 
 ## Running Spark Applications on a Schedule using a ScheduledSparkApplication 
 
@@ -395,7 +408,8 @@ spec:
       cores: 1
       instances: 1
       memory: 512m
-    restartPolicy: Never
+    restartPolicy:
+      type: Never
 ```
 
 The concurrency of runs of an application is controlled by `.spec.concurrencyPolicy`, whose valid values are `Allow`, `Forbid`, and `Replace`, with `Allow` being the default. The meanings of each value is described below:
@@ -417,4 +431,4 @@ To customize the operator, you can follow the steps below:
 2. Create docker images to be used for Spark with [docker-image tool](https://spark.apache.org/docs/latest/running-on-kubernetes.html#docker-images).
 3. Create a new operator image based on the above image. You need to modify the `FROM` tag in the [Dockerfile](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/Dockerfile) with your Spark image.
 4. Build and push your operator image built above. 
-5. Deploy the new image by modifying the [/manifest/spark-operator.yaml](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/manifest/spark-operator.yaml) file and specfiying your operator image.
+5. Deploy the new image by modifying the [/manifest/spark-operator.yaml](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/manifest/spark-operator.yaml) file and specfiying your operator image.
@@ -5,3 +5,4 @@
 | Microsoft (MileIQ) |@dharmeshkakadia| Production | AI & Analytics |
 | CERN|@mrow4a| Evaluation | Data Mining & Analytics |
 | Lightbend |@yuchaoran2011| Evaluation | Data Infrastructure & Operations |
+| Lyft |@kumare3| Evaluation | ML & Data Infrastructure |
@@ -50,4 +50,3 @@ spec:
     volumeMounts:
       - name: "test-volume"
         mountPath: "/tmp"
-  restartPolicy: Never
@@ -28,6 +28,12 @@ spec:
   image: "gcr.io/spark-operator/spark-py:v2.4.0"
   imagePullPolicy: Always
   mainApplicationFile: local:///opt/spark/examples/src/main/python/pyfiles.py
+  restartPolicy:
+    type: OnFailure
+    onFailureRetries: 3
+    onFailureRetryInterval: 10
+    onSubmissionFailureRetries: 5
+    onSubmissionFailureRetryInterval: 20
   arguments:
     - python2.7
   deps:
@@ -46,5 +52,3 @@ spec:
     memory: "512m"
     labels:
       version: 2.4.0
-  restartPolicy: Never
-