add spark example

vamosraghava · Nov 26, 2020 · 4b9793e · 4b9793e
1 parent 5638d08
commit 4b9793e
Show file tree

Hide file tree

Showing 15 changed files with 1 addition and 95 deletions.
diff --git a/02-Extract.ipynb b/02-Extract.ipynb
@@ -1,95 +1 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Acquiring data (extraction)\n",
-    "\n",
-    "<img src='images/flow-extract.png' width=800>\n",
-    "\n",
-    "> Note: in some organizations, there is a data discovery system, like https://www.amundsen.io/amundsen/ upstream from this step. We're not covering that area due to scope constraints\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Goal: use SQL to efficiently retrieve data for further work\n",
-    "\n",
-    "### Legacy Tools\n",
-    "\n",
-    "Mostly: Apache Hive\n",
-    "\n",
-    "### Current Tools\n",
-    "\n",
-    "* SparkSQL\n",
-    "* Presto\n",
-    "* *Hive Metastore*\n",
-    "\n",
-    "### Rising/Future Tools\n",
-    "\n",
-    "* Kartothek, Intake\n",
-    "* BlazingSQL\n",
-    "* Dask-SQL\n",
-    "\n",
-    "*There are more non-SQL options, but support for SQL is a requirement in most large organizations, so we're sticking with SQL-capable tools for now*\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import pyspark"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "spark = pyspark.sql.SparkSession.builder.appName(\"demo\").getOrCreate()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "spark.sql(\"SELECT * FROM parquet.`data/california`\").show()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.8"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
+{"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.8"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# Acquiring data (extraction)\n\n<img src='images/flow-extract.png' width=800>\n\n> Note: in some organizations, there is a data discovery system, like https://www.amundsen.io/amundsen/ upstream from this step. We're not covering that area due to scope constraints\n","metadata":{}},{"cell_type":"markdown","source":"## Goal: use SQL to efficiently retrieve data for further work\n\n### Legacy Tools\n\nMostly: Apache Hive\n\n### Current Tools\n\n* SparkSQL\n* Presto\n* *Hive Metastore*\n\n### Rising/Future Tools\n\n* Kartothek, Intake\n* BlazingSQL\n* Dask-SQL\n\n*There are more non-SQL options, but support for SQL is a requirement in most large organizations, so we're sticking with SQL-capable tools for now*\n","metadata":{}},{"cell_type":"code","source":"import pyspark","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"spark = pyspark.sql.SparkSession.builder.appName(\"demo\").getOrCreate()","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"spark.sql(\"SELECT * FROM parquet.`data/california`\").show()","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"query = \"\"\"\nSELECT origin, mean(delay) as delay, count(1) \nFROM parquet.`data/california` \nGROUP BY origin\nHAVING count(1) > 500\nORDER BY delay DESC\n\"\"\"\nspark.sql(query).show()","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"query = \"\"\"\nSELECT *\nFROM parquet.`data/california` \nWHERE origin in (\n    SELECT origin \n    FROM parquet.`data/california` \n    GROUP BY origin \n    HAVING count(1) > 500\n)\n\"\"\"\nspark.sql(query).write.mode('overwrite').option('header', 'true').csv('data/refined_flights/')","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"! head data/refined_flights/*.csv","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"","metadata":{},"execution_count":null,"outputs":[]}]}
diff --git a/data/california/._SUCCESS.crc b/data/california/._SUCCESS.crc
diff --git a/data/california/._committed_2595799468439767928.crc b/data/california/._committed_2595799468439767928.crc
diff --git a/data/california/._started_2595799468439767928.crc b/data/california/._started_2595799468439767928.crc
diff --git a/...id-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-105-1-c000.snappy.parquet.crc b/...id-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-105-1-c000.snappy.parquet.crc
diff --git a/...id-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-106-1-c000.snappy.parquet.crc b/...id-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-106-1-c000.snappy.parquet.crc
diff --git a/...id-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-107-1-c000.snappy.parquet.crc b/...id-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-107-1-c000.snappy.parquet.crc
diff --git a/...id-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-108-1-c000.snappy.parquet.crc b/...id-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-108-1-c000.snappy.parquet.crc
diff --git a/...id-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-109-1-c000.snappy.parquet.crc b/...id-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-109-1-c000.snappy.parquet.crc
diff --git a/...id-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-110-1-c000.snappy.parquet.crc b/...id-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-110-1-c000.snappy.parquet.crc
diff --git a/...id-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-111-1-c000.snappy.parquet.crc b/...id-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-111-1-c000.snappy.parquet.crc
diff --git a/...id-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-112-1-c000.snappy.parquet.crc b/...id-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-112-1-c000.snappy.parquet.crc
diff --git a/data/california/SUCCESS.crc b/data/california/SUCCESS.crc
diff --git a/data/california/committed_2595799468439767928.crc b/data/california/committed_2595799468439767928.crc
diff --git a/data/california/started_2595799468439767928.crc b/data/california/started_2595799468439767928.crc