Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API
Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API
Jacky Li
jacky.likun@huawei.com
2015-7-20 (OSCON)
2
SparkSQL Overview
SparkSQL:
A module for structured data processing
• DataFrame API: write less code
• DataSource API: read less data
• Catalyst:let optimizer do the hard
work
3
DataFrame API:write less code
Using MapReduce API: From MapReduce
Using Spark to Spark
RDD API:
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
4
DataSource API:connect to more data
built-in external
JDBC
{ JSON }
and more …
Source:databricks
33
5
Catalyst: Query optimization
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date > ”2015-01-01”)
join
join scan
filter
(users)
scan filter scan
(users) (events)
Spark Python DF
Spark Scala DF
RDD Python
RDD Scala
0 2 4 6 8 10
Runtime performance of aggregating 10 million int pairs
(secs)
Source:databricks
l2X performance improvement
lAll language achieve the same performance! 10
7
Learn more
Programming Guide:
http://spark.apache.org/docs/latest/sql-programming-
guide.html
Spark Meetup: http://www.meetup.com/spark-users/
8
Write your own Data
Source Lib
9
Why use Data Source API
Use case
n leverage spark ecosystem on legacy
data source
n improve scalability and performance
of standalone data source
n unified access point to all data
10
How to write a Data Source Library
Implement 3 interfaces:
n BaseRelation: base class to extend, every relation is associated with a schema definition
n RelationProvider: factory for creating the concrete relation
n Scan: multiple types of TableScan interface are designed, choose one to implement
TableScan
support full record scan
RDD[Row] buildScan()
BaseRelation RelationProvider
PrunedScan
support column pruning
RDD[Row] buildScan(cols)
schema BaseRelation createRelation()
11
Example: Apache Avro
CREATE TABLE episodes USING
com.databricks.spark.avro SELECT episodes from … where …
OPTIONS (path "episodes.avro")
package com.databricks.spark.avro
Base Relation
PrunedFilterScan
• Schema
• buildScan
RelationProvider
AvroRelation
AvroRelationProvider
RDD[Row] buildScan(cols, filters)
createRelation()
AvroRelation deriving from PrunedFilterScan, which supports
column pruning and filter pruning.
12
Advanced data source: more syntax and optimization
Add new DDL Add more query Add new RDD type to
Ex. bulkloading optimization rules access your data source
Ex. Predict pushdown
Add New catalog
13
Big Data in Huawei
14
Big Data in Huawei
15
Spark in Huawei
Stream Interactive Deep
Analytics Query
Batch Job
Analytics
Astro Carbon
Stream Engine HBase Cube HDFS
Data Ingest
16
Astro:
SparkSQL on HBase
17
What is HBase
A big sorted table, which is splited into many parts (regions) stored in a HDFS cluster
R1 R4 R3 R5 R2
19
Existing Solutions
Solution 1: Native Application Solution 2: MR Application Solution 3: Purpose-built engine
SQL
Engine
HBase
Na>ve
API
HBase
MR
API
HBase
Na>ve
API
Region Server Region Server Region Server Region Server Region Server Region Server
https://github.com/Huawei-Spark/ Astro
Spark-SQL-on-HBase
HBase
21
Astro Logical Architecture
Region
Info
Catalyst
(Astro
Spark
Driver
extension)
Hbase
Master
RDD P1 RDD P4 RDD P3 RDD P5 RDD P2
R1 R4 R3 R5 R2
22
Features
• Scala/Java/Python multi language support
• SQL and DataFrame API compatible
• Query Optimization: predicate pushdown, aggregation pushdown,
region pruning, rowkey jumping, …
• More SQL capabilities: insert, update, bulk load, …
• Join HBase table with other data like Parquet
• CLI tool to execute SQL command
23
Data Models
CREATE TABLE table_name (col1 TYPE1, col2 TYPE2, …, PRIMARY KEY (col1, col2))
MAPPED BY (hbase_tablename, COLS=[col3=cf1.cq2, col4=cf2.cq1, col5=cf2.cq2])
ns
Row
Key Column
Family
cf1 Column
Family
cf2
Hbase Table
Qualifier1 Qualifier2 Qualifier3 Qualifier1 Qualifier2
Row
Key1 I like using
Row
Key2 SQL
Row
Key3 on
HBase
24
Query Optimization
1. Region pruning by analyzing rowkey
e.g. select … where key<3 or key>10
2. Use Hbase Filter to push down filter while scan, thus data transfer is minimized
e.g. select … where value > 100
3. Implements Hbase Custom Filter to jump to the required full/partial key directly
e.g. select … where (key1 < 3 AND key2 > 5) OR (key1 = 8 AND key 2 < 4)
4. Use Hbase coprocessor to push down computation and minimal data transfer
e.g. select … sum(col) … group by col
25
Query Optimization Example
Region
Spark
Driver
Info
scan
Hbase
Master
Spark
Master
skip
27
Demo
• Demo1: Create and query table with existing HBase table
• Demo2: Create and query table with new HBase table
28
Demo1: create table with existing HBase table
n Create SparkSQL table map to existing Hbase table
n A single column map to hbase rowkey
ns
Row
Key f
Hbase Table
c1 c2
Row
Key1
Row
Key2
Row
Key3
29
Demo2: create table with new HBase table
n Create and query SparkSQL table map to Hbase table
n Multiple columns map to hbase table rowkey
n Bulk data sample data into SparkSQL table, which stores in Hbase table
ns
Row
Key teacher
Hbase Table
name age
Row
Key1
Row
Key2
Row
Key3
30
Carbon:
SparkSQL on Cube
31
Before Spark: Standalone Cube Engine
34
Data Model
Create Cube and
Original data load data Cube Store in HDFS
Use SparkSQL to
D1 D2 D3 … M1 M2 M3 …
2. Load data….
35
Cube File Format
Parquet, ORC: Store and Query complex nested structure data
• Columnar format, support nested data structure
• Predicate push down: scan only required column, partition, min/max index
• Schema evolution
36
Push Down Example: Multi-dimension filter and agg
SELECT state, plan, terminal, sum(traffic) FROM user_cube WHERE plan= 4G and state= CA GROUP BY terminal
Project Spark
OLAP Planner
500
400
300 Impala
Impala
Carbon
SparkOLAP
200
100
0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15
Query includes: 1/many dim filter, 1/many dim group by and agg, distinct count
38
all ! "thanks"
questions.foreach( answer(_) )
39