From c07b040de00ea422f59284a7978a9ef04fa1e580 Mon Sep 17 00:00:00 2001
From: Yinan Li <ynli@google.com>
Date: Mon, 20 Aug 2018 15:11:11 -0700
Subject: [PATCH] Added documentation for #255

---
 README.md          |  2 +-
 docs/api.md        | 25 +++++++++++++++++++++++++
 docs/user-guide.md | 27 ++++++++++++++++++++++-----
 3 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/README.md b/README.md
index 1eb28e843..155049708 100644
--- a/README.md
+++ b/README.md
@@ -61,7 +61,7 @@ Spark Operator currently supports the following list of features:
 * Supports automatic retries of failed submissions with optional linear back-off.
 * Supports mounting local Hadoop configuration as a Kubernetes ConfigMap automatically via `sparkctl`.
 * Supports automatically staging local application dependencies to Google Cloud Storage (GCS) via `sparkctl`.
-* Supports collecting and exporting application metrics to Prometheus. 
+* Supports collecting and exporting application-level metrics and driver/executor metrics to Prometheus. 
 
 ## Motivations
 
diff --git a/docs/api.md b/docs/api.md
index 04ddb8120..11473f549 100644
--- a/docs/api.md
+++ b/docs/api.md
@@ -20,6 +20,8 @@ ScheduledSparkApplication
     |__ ExecutorSpec
         |__ SparkPodSpec
     |__ Dependencies
+    |__ MonitoringSpec
+        |__ PrometheusSpec
 |__ SparkApplicationStatus
     |__ DriverInfo    
 ```
@@ -55,6 +57,8 @@ A `SparkApplicationSpec` has the following top-level fields:
 | `MaxSubmissionRetries` | N/A | The maximum number of times to retry a failed submission. |
 | `SubmissionRetryInterval` | N/A | The unit of intervals in seconds between submission retries. Depending on the implementation, the actual interval between two submission retries may be a multiple of `SubmissionRetryInterval`, e.g., if linear or exponential backoff is used. |
 | `MemoryOverheadFactor` | `spark.kubernetes.memoryOverheadFactor` | This sets the Memory Overhead Factor that will allocate memory to non-JVM memory. For JVM-based jobs this value will default to 0.10, for non-JVM jobs 0.40. Value of this field will be overridden by `Spec.Driver.MemoryOverhead` and `Spec.Executor.MemoryOverhead` if they are set. |
+| `Monitoring` | N/A | This specifies how monitoring of the Spark application should be handled, e.g., how driver and executor metrics are to be exposed. Currently only exposing metrics to Prometheus is supported. |
+
 
 #### `DriverSpec`
 
@@ -102,6 +106,27 @@ A `Dependencies` specifies the various types of dependencies of a Spark applicat
 | `Jars` | `spark.jars` or `--jars` | List of jars the application depends on. |
 | `Files` | `spark.files` or `--files` | List of files the application depends on. |
 
+#### `MonitoringSpec`
+
+A `MonitoringSpec` specifies how monitoring of the Spark application should be handled, e.g., how driver and executor metrics are to be exposed. Currently only exposing metrics to Prometheus is supported.
+
+| Field | Spark configuration property or `spark-submit` option | Note |
+| ------------- | ------------- | ------------- |
+| `ExposeDriverMetrics` | N/A | This specifies if driver metrics should be exposed. Defaults to `false`. |
+| `ExposeExecutorMetrics` | N/A | This specifies if executor metrics should be exposed. Defaults to `false`. |
+| `MetricsProperties` | N/A | If specified, this contains the content of a custom `metrics.properties` that configures the Spark metrics system. Otherwise, the content of `spark-docker/conf/metrics.properties` will be used. |
+| `PrometheusSpec` | N/A | If specified, this configures how metrics are exposed to Prometheus. |
+
+#### `PrometheusSpec`
+
+A `PrometheusSpec` configures how metrics are exposed to Prometheus.
+
+| Field | Spark configuration property or `spark-submit` option | Note |
+| ------------- | ------------- | ------------- |
+| `JmxExporterJar` | N/A | This specifies the path to the [Prometheus JMX exporter](https://github.com/prometheus/jmx_exporter) jar. |
+| `Port` | N/A | If specified, the value will be used in the Java agent configuration for the Prometheus JMX exporter. The Java agent gets bound to the specified port if specified or `8090` otherwise by default. |
+| `Configuration` | N/A | If specified, this contains the content of a custom Prometheus configuration used by the Prometheus JMX exporter. Otherwise, the content of `spark-docker/conf/prometheus.yaml` will be used. |
+
 ### `SparkApplicationStatus`
 
 A `SparkApplicationStatus` captures the status of a Spark application including the state of every executors.
diff --git a/docs/user-guide.md b/docs/user-guide.md
index 95c1ed6ab..afbe76fb3 100644
--- a/docs/user-guide.md
+++ b/docs/user-guide.md
@@ -365,7 +365,7 @@ Note that the mutating admission webhook is needed to use this feature. Please r
 ### Python Support
 
 Python support can be enabled by setting `.spec.mainApplicationFile` with path to your python application.
-Optionaly, `.spec.pythonVersion` parameter can be used to set the major Python version of the docker image used 
+Optionaly, the `.spec.pythonVersion` field can be used to set the major Python version of the docker image used 
 to run the driver and executor containers. Below is an example showing part of a `SparkApplication` specification:
 
 ```yaml
@@ -398,6 +398,27 @@ spark.sparkContext.addPyFile(dep_file_path)
 Note that Python binding for PySpark will be available in Apache Spark 2.4, 
 and currently requires building a custom Docker image from the Spark master branch.
 
+### Monitoring
+
+The Spark Operator supports using the Spark metric system to expose metrics to a varity of sinks. Particularly, it is able to automatically configure the metric system to expose metrics to [Prometheus](https://prometheus.io/). Specifically, the field `.spec.monitoring` specifies how application monitoring is handled and particularly how metrics are to be reported. The metric system is configured through the configuration file `metrics.properties`, which gets its content from the field `.spec.monitoring.metricsProperties`. The content of [metrics.properties](../spark-docker/conf/metrics.properties) will be used by default if `.spec.monitoring.metricsProperties` is not specified. You can choose to enable or disable reporting driver and executor metrics using the fields `.spec.monitoring.exposeDriverMetrics` and `.spec.monitoring.exposeExecutorMetrics`, respectively. 
+
+Further, the field `.spec.monitoring.prometheus` specifies how metrics are exposed to Prometheus using the [Prometheus JMX exporter](https://github.com/prometheus/jmx_exporter). When `.spec.monitoring.prometheus` is specified, the operator automatically configures the JMX exporter to run as a Java agent. The only required field of `.spec.monitoring.prometheus` is `jmxExporterJar`, which specified the path to the Prometheus JMX exporter Java agent jar in the container. If you use the image `gcr.io/spark-operator/spark:v2.3.1-gcs-prometheus`, the jar is located at `/prometheus/jmx_prometheus_javaagent-0.3.1.jar`. The field `.spec.monitoring.prometheus.port` specifies the port the JMX exporter Java agent binds to and defaults to `8090` if not specified. The field `.spec.monitoring.prometheus.configuration` specifies the content of the configuration to be used with the JMX exporter. The content of [metrics.properties](../spark-docker/conf/prometheus.yaml) will be used by default if `.spec.monitoring.prometheus.configuration` is not specified.    
+
+Below is an example that shows how to configure the metric system to expose metrics to Prometheus using the Prometheus JMX exporter. Note that the JMX exporter Java agent jar is listed as a dependency and will be downloaded to where `.spec.dep.jarsDownloadDir` points to in Spark 2.3.x, which is `/var/spark-data/spark-jars` by default. Things will be different in Spark 2.4 as dependencies will be downloaded to the local working directory instead in Spark 2.4. A complete example can be found in [examples/spark-pi-prometheus.yaml](../examples/spark-pi-prometheus.yaml).
+
+```yaml
+spec:
+  dep:
+    jars:
+    - http://central.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.3.1/jmx_prometheus_javaagent-0.3.1.jar
+  monitoring:
+    exposeDriverMetrics: true
+    prometheus:
+      jmxExporterJar: "/var/spark-data/spark-jars/jmx_prometheus_javaagent-0.3.1.jar"    
+```
+
+The operator automatically adds the annotations such as `prometheus.io/scrape=true` on the driver and/or executor pods so the metrics exposed on the pods can be scraped by the Prometheus server in the same cluster.
+
 ## Working with SparkApplications
 
 ### Creating a New SparkApplication
@@ -520,11 +541,7 @@ a restart policy of `Never` as the example above shows.
 To customize the Spark Operator, you can follow the steps below:
 
 1. Compile Spark distribution with Kubernetes support as per [Spark documentation](https://spark.apache.org/docs/latest/building-spark.html#building-with-kubernetes-support).
-
 2. Create docker images to be used for Spark with [docker-image tool](https://spark.apache.org/docs/latest/running-on-kubernetes.html#docker-images).
-
 3. Create a new Spark Operator image based on the above image. You need to modify the `FROM` tag in the [Dockerfile](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/Dockerfile) with your Spark image.
-
 4. Build and push your Spark Operator image built above. 
-
 5. Deploy the new image by modifying the [/manifest/spark-operator.yaml](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/manifest/spark-operator.yaml) file and specfiying your Spark Operator image.