Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API

SparkSQL Data Source
Extension Practice in Huawei
Jacky Li
jacky.likun@huawei.com
2015-7-20 (OSCON)
HUAWEI TECHNOLOGIES CO., LTD.

Agenda
l SparkSQL and Data Source API
l Write your own Data Source Lib
l Big Data in Huawei
l SparkSQL extension in Huawei
l Astro: SparkSQL on HBase
l Carbon: SparkSQL on Cube
2
SparkSQL Overview
SparkSQL:
A module for structured data processing
• DataFrame API: write less code
• DataSource API: read less data
• Catalyst：let optimizer do the hard
work
3
DataFrame API：write less code
Using MapReduce API: From MapReduce
Using Spark to Spark
RDD API:
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);

private Text word = new Text();
public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
Using Spark DataFrame API:
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
word.set(itr.nextToken()); .map(word => (word, 1))
output.collect(word, one); data.groupBy(“name”) \
.reduceByKey(_ + _)
} .agg(avg(“age”))
counts.saveAsTextFile("hdfs://...")
}
}
public static class WorkdCountReduce extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
From data engineer to data scientist, who is not
}
sum += values.next().get();
familiar with functional programming
output.collect(key, new IntWritable(sum));
}
}
4
DataSource API：connect to more data
lEasy to use API for your to loading/saving DataFrames

lWork together with SparkSQL query optimizer to allow efficient improvement.
Data Sources supported by DataFrames
e.g. avoid reading unnecessary data by filter pushed down
built-in external
JDBC
{ JSON }
and more …
Source：databricks
33
5
Catalyst: Query optimization
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date > ”2015-01-01”)
logical plan optimized plan optimized plan

with intelligent data sources
filter join
join
join scan
filter
(users)
scan filter scan
(users) (events)
scan scan scan

(users) (events) (events)
Source：databricks
35
6
Performance
Spark Python DF
Spark Scala DF
RDD Python
RDD Scala
0 2 4 6 8 10
Runtime performance of aggregating 10 million int pairs
(secs)
Source：databricks
l2X performance improvement
lAll language achieve the same performance! 10
7
Learn more
Programming Guide:
http://spark.apache.org/docs/latest/sql-programming-
guide.html

Spark Meetup: http://www.meetup.com/spark-users/

8
Write your own Data
Source Lib
9
Why use Data Source API
Data source API
n Uniform way to Access Data sources

n Pluggable data sources
http://spark-packages.org/
n API is still young
Use case
n leverage spark ecosystem on legacy
data source
n improve scalability and performance
of standalone data source
n unified access point to all data
10
How to write a Data Source Library
Implement 3 interfaces:
n BaseRelation: base class to extend, every relation is associated with a schema definition
n RelationProvider: factory for creating the concrete relation
n Scan: multiple types of TableScan interface are designed, choose one to implement
TableScan
support full record scan
RDD[Row] buildScan()
BaseRelation RelationProvider
PrunedScan support column pruning
RDD[Row] buildScan(cols)
schema BaseRelation createRelation()
PrunedFilterScan supports column and filter

RDD[Row] buildScan(cols, filters) pruning
11
Example: Apache Avro
CREATE TABLE episodes USING
com.databricks.spark.avro SELECT episodes from … where …
OPTIONS (path "episodes.avro")
package com.databricks.spark.avro
Base Relation PrunedFilterScan
• Schema • buildScan
RelationProvider
AvroRelation
AvroRelationProvider RDD[Row] buildScan(cols, filters)
createRelation()
AvroRelation deriving from PrunedFilterScan, which supports
column pruning and filter pruning.
12
Advanced data source: more syntax and optimization
SparkSQL Catalyst Framework

Parser/Analyzer Optimizer Execution
SQL or
DataFrame Parser Catalyst RDD Compute
Resolve Relation Optimizer rules
Add new DDL Add more query Add new RDD type to
Ex. bulkloading optimization rules access your data source
Ex. Predict pushdown
Add New catalog
13
Big Data in Huawei
14
Big Data in Huawei
Network Carrier Consumer Enterprise

n Resource Utilization n Campaign n Hadoop/Spark Distribution
n Customer Care n Realtime n Cloud Service
n Market Insight Recommendation
n Data Monetization n Community Analysis
15
Spark in Huawei
Stream Interactive Deep
Analytics Query
Batch Job
Analytics
Astro Carbon
Stream Engine HBase Cube HDFS
Realtime Event Near-realtime data
Data Ingest
16
Astro:
SparkSQL on HBase
17
What is HBase
A big sorted table, which is splited into many parts (regions) stored in a HDFS cluster
Region Server Region Server Region Server

Region Region Region Region Region
1 4 3 5 2
18
How to access HBase data
Native API: get, put, scan Hadoop API: InputFormat,
Filter, coprocessor OutputFormat
User have to choose: Hbase Master

1. Native: Write complex program ZooKeeper Hbase Master
HBase Client Hbase Master
2. Hadoop: Easy but sacrifice
performance
Region Server Region Server Region Server
R1 R4 R3 R5 R2
19
Existing Solutions
Solution 1: Native Application Solution 2: MR Application Solution 3: Purpose-built engine
HBase Applica>on MR Applica>on SQL Applica>on
SQL Engine

HBase Na>ve API HBase MR API HBase Na>ve API
Region Server Region Server Region Server Region Server Region Server Region Server
Pros: Pros: Pros:

l Flexible l Hadoop friendly l Support SQL
l High performance if do it right l Support SQL using Hive/Impala l High performance
Cons: Cons: Cons:
l Low productivity l Low performance l Partial distributed
20
Introducing Astro
Spark Applica>on
Astro = SQL on HBase + SQL ML Graph Stream Community

Fully distributed + Package
Spark Ecosystem
Spark
https://github.com/Huawei-Spark/ Astro
Spark-SQL-on-HBase
HBase
21
Astro Logical Architecture
Region
Info
Catalyst (Astro Spark Driver
extension) Hbase Master
Spark Executor Spark Executor Spark Executor
RDD P1 RDD P4 RDD P3 RDD P5 RDD P2
Hbase Hbase Hbase Hbase Hbase

Client Client Client Client Client
HBase Region Server HBase Region Server HBase Region Server
R1 R4 R3 R5 R2
22
Features
• Scala/Java/Python multi language support
• SQL and DataFrame API compatible
• Query Optimization: predicate pushdown, aggregation pushdown,
region pruning, rowkey jumping, …
• More SQL capabilities: insert, update, bulk load, …
• Join HBase table with other data like Parquet
• CLI tool to execute SQL command
23
Data Models
CREATE TABLE table_name (col1 TYPE1, col2 TYPE2, …, PRIMARY KEY (col1, col2))
MAPPED BY (hbase_tablename, COLS=[col3=cf1.cq2, col4=cf2.cq1, col5=cf2.cq2])
SparkSQL Table col1 col2 col3 col4 col5

ns
ns
Row Key Column Family cf1 Column Family cf2
Hbase Table
Qualifier1 Qualifier2 Qualifier3 Qualifier1 Qualifier2
Row Key1 I like using
Row Key2 SQL
Row Key3 on HBase
24
Query Optimization
1. Region pruning by analyzing rowkey
e.g. select … where key<3 or key>10

2. Use Hbase Filter to push down filter while scan, thus data transfer is minimized
e.g. select … where value > 100

3. Implements Hbase Custom Filter to jump to the required full/partial key directly
e.g. select … where (key1 < 3 AND key2 > 5) OR (key1 = 8 AND key 2 < 4)

4. Use Hbase coprocessor to push down computation and minimal data transfer
e.g. select … sum(col) … group by col
25
Query Optimization Example
Region
Spark Driver Info
scan
Hbase Master
Spark Master
skip
Spark Executor Spark Executor Spark Executor

HBase Region Server HBase Region Server HBase Region Server

copsr copsr copsr
R1 R4 R3 R5 R2

Agg push down

Rowkey jumping Region pruning Push down filter 26
Project Info
§ Open Source project
§ Spark external package (WIP)
§ Github Repo: https://github.com/Huawei-Spark/Spark-SQL-on-HBase
Includes:
§ Design Doc
§ Source Tree
§ Test Cases
§ CLI tool
§ Project Lead:
§ Yan Zhou (yan.zhou.sc@huawei.com)
§ Bing Xiao (bing@huawei.com)
27
Demo
• Demo1: Create and query table with existing HBase table
• Demo2: Create and query table with new HBase table
Code can be found at

https://github.com/Huawei-Spark/Spark-SQL-on-Hbase/tree/
master/examples
28
Demo1: create table with existing HBase table
n Create SparkSQL table map to existing Hbase table
n A single column map to hbase rowkey
SparkSQL Table rowkey a b

ns
ns
Row Key f
Hbase Table
c1 c2
Row Key1
Row Key2
Row Key3
29
Demo2: create table with new HBase table
n Create and query SparkSQL table map to Hbase table
n Multiple columns map to hbase table rowkey
n Bulk data sample data into SparkSQL table, which stores in Hbase table
SparkSQL Table grade class subject teacher_name teacher_age

ns
ns
Row Key teacher
Hbase Table
name age
Row Key1
Row Key2
Row Key3
30
Carbon:
SparkSQL on Cube
31
Before Spark: Standalone Cube Engine
§ In-house built storage and query engine

§ Runs on single machine
§ Highly optimized GUI
§ Data Load Engine loads data to File system JDBC/ODBC
in Binary data format.
§ Supports in memory and file mode. Cube Engine
§ MDX , SQL and API based Query Interface Cache
Data File system

Input raw Load Binary data Schema
format
data Engine
32
Motivations
§ Functionality
§ OLAP style analytics over Big Data like slicing, dicing
§ BI tool integration
§ Scalability
§ Re-use existing High performance engine, but utilize distributed
computing framework to scale out
§ Reliability
§ Utilize industry-proven storage layer - HDFS
Solution: leverage Spark to make it distributed and ecosystem friendly

33
Carbon Logical Architecture

Language：
JDBC
• Use SparkSQL
Spark • New DDL：CREATE CUBE, LOAD INTO CUBE
SparkSQL OLAP Planner
DataSource API Compute：
OLAP RDD OLAP RDD OLAP RDD • Customized optimization rules based on Catalyst
Partition Partition Partition
• Cube Processor: Cube data scan, jump, agg, etc

Cube Engine
Storage：
Cube Cube Cube
Processor Processor Processor • Cube File, multidimensional index built-in
• Store Schema in HDFS/Hive Metastore
Distributed Data Store (HDFS) Cube Loader

Cube File Cube File Cube File Encoding Encoding Encoding
34
Data Model
Create Cube and
Original data load data Cube Store in HDFS
Use SparkSQL to
1. Create cube … Column1 C2 C3 C4
D1 D2 D3 … M1 M2 M3 …
2. Load data….
Star Schema Cube Metadata Distributed Cube File
35
Cube File Format
Parquet, ORC: Store and Query complex nested structure data
• Columnar format, support nested data structure
• Predicate push down: scan only required column, partition, min/max index
• Schema evolution
Cube File: Store and Query Multi-Dimensional Tabular data (star-schema)

• Columnar format, with native Multi-Dimensional Key support
• More push down: multi dimension filter, group by, distinct count, …
• Trade query time with pre-process time to a more organized data
36
Push Down Example: Multi-dimension filter and agg
SELECT state, plan, terminal, sum(traffic) FROM user_cube WHERE plan= 4G and state= CA GROUP BY terminal
Project Spark
OLAP Planner
Spark Core Spark Core Spark Core

Agg Agg
RDD RDD RDD
Partition Partition Partition
Small
DataSource API
intermediate
Partial Agg result
Cube Engine
Cube Cube Cube
Filter Scan and project 4
Filter Processor Processor Processor
columns with Partial Agg
Distributed Data Store (HDFS)

Scan
Cube File Cube File Cube File
Before optimization After optimization

37
Performance
12 Billion records，20 dimension，4 measure，total 1.5TB
• Carbon: Cube file 380GB
• Impala: Parquet file 336GB
600
500
400
300 Impala
Impala
Carbon
SparkOLAP
200
100
0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15
Query includes: 1/many dim filter, 1/many dim group by and agg, distinct count
38
all ! "thanks"
questions.foreach( answer(_) )
39

Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API

Uploaded by

Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API

Uploaded by

SparkSQL Data Source

Extension Practice in Huawei

HUAWEI TECHNOLOGIES CO., LTD.

private final static IntWritable one = new IntWritable(1);

public void map(LongWritable key, Text value,

public static class WorkdCountReduce extends MapReduceBase

public void reduce(Text key, Iterator<IntWritable> values,

lEasy to use API for your to loading/saving DataFrames

logical plan optimized plan optimized plan

scan scan scan

Data source API

n Uniform way to Access Data sources

PrunedFilterScan  supports column and filter

SparkSQL Catalyst Framework

Resolve Relation  Optimizer rules 

Network Carrier Consumer Enterprise

Realtime Event Near-realtime data

Region Server Region Server Region Server

User have to choose: Hbase Master

Region Server Region Server Region Server

HBase Applica>on MR Applica>on SQL Applica>on

Pros: Pros: Pros:

Astro = SQL on HBase + SQL ML Graph Stream Community

Spark Executor Spark Executor Spark Executor

Hbase Hbase Hbase Hbase Hbase

HBase Region Server HBase Region Server HBase Region Server

SparkSQL Table col1 col2 col3 col4 col5

Spark Executor Spark Executor Spark Executor

HBase Region Server HBase Region Server HBase Region Server

Agg push down

Code  can  be  found  at   

SparkSQL Table rowkey a b

SparkSQL Table grade class subject teacher_name teacher_age

§ In-house built storage and query engine

Data File system

Solution: leverage Spark to make it distributed and ecosystem friendly

Distributed Data Store (HDFS) Cube Loader

1. Create cube … Column1 C2 C3 C4

Star Schema Cube Metadata Distributed Cube File

Cube  File:  Store  and  Query  Multi-Dimensional  Tabular  data  (star-schema) 

Spark Core Spark Core Spark Core

Distributed Data Store (HDFS)

Before optimization After optimization

You might also like

PrunedFilterScan supports column and filter

Resolve Relation Optimizer rules

Code can be found at

Cube File: Store and Query Multi-Dimensional Tabular data (star-schema)