0% found this document useful (0 votes)
16 views39 pages

Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API

The document discusses SparkSQL and the Data Source API. It provides an overview of SparkSQL, the DataFrame API, and the DataSource API. It then discusses how to write your own Data Source library by implementing interfaces like BaseRelation, RelationProvider, and Scan.

Uploaded by

Mukesh Panchal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
16 views39 pages

Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API

The document discusses SparkSQL and the Data Source API. It provides an overview of SparkSQL, the DataFrame API, and the DataSource API. It then discusses how to write your own Data Source library by implementing interfaces like BaseRelation, RelationProvider, and Scan.

Uploaded by

Mukesh Panchal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 39

SparkSQL Data Source

Extension Practice in Huawei

 Jacky Li
jacky.likun@huawei.com
2015-7-20 (OSCON)

HUAWEI TECHNOLOGIES CO., LTD.


Agenda  
l SparkSQL  and  Data  Source  API  
l Write  your  own  Data  Source  Lib  
l Big  Data  in  Huawei  
l SparkSQL  extension  in  Huawei  
l Astro:  SparkSQL  on  HBase  
l Carbon:  SparkSQL  on  Cube  

2
SparkSQL Overview  

SparkSQL:    
A  module  for  structured  data  processing  
• DataFrame  API:  write  less  code  
• DataSource  API:  read  less  data  
• Catalyst:let  optimizer  do  the  hard  
work  

3
DataFrame  API:write  less  code  
Using MapReduce API: From MapReduce
Using Spark to Spark
RDD API:
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();

public void map(LongWritable key, Text value,


OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
Using Spark DataFrame API:
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
word.set(itr.nextToken()); .map(word => (word, 1))
output.collect(word, one); data.groupBy(“name”) \
.reduceByKey(_ + _)
} .agg(avg(“age”))
counts.saveAsTextFile("hdfs://...")
}
}

public static class WorkdCountReduce extends MapReduceBase


implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,


OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
From    data  engineer  to  data  scientist,  who  is  not  
}
sum += values.next().get();
familiar  with  functional  programming  
output.collect(key, new IntWritable(sum));
}
}

4
DataSource  API:connect  to  more  data  

lEasy to use API for your to loading/saving DataFrames


lWork together with SparkSQL query optimizer to allow efficient improvement.
Data Sources supported by DataFrames
e.g. avoid reading unnecessary data by filter pushed down

built-in external

JDBC

{ JSON }

and more …

Source:databricks
33

5
Catalyst:  Query  optimization  
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date > ”2015-01-01”)

logical plan optimized plan optimized plan


with intelligent data sources
filter join

join

join scan
filter
(users)
scan filter scan
(users) (events)

scan scan scan


(users) (events) (events)
Source:databricks
35
6
Performance  

Spark Python DF

Spark Scala DF

RDD Python

RDD Scala

0 2 4 6 8 10
Runtime performance of aggregating 10 million int pairs
(secs)
Source:databricks
l2X performance improvement
lAll language achieve the same performance! 10

7
Learn  more  
Programming  Guide:  
http://spark.apache.org/docs/latest/sql-programming-
guide.html  

Spark  Meetup:  http://www.meetup.com/spark-users/  

8
 Write  your  own  Data  
Source  Lib  

9
Why  use  Data  Source  API  

Data source API

n Uniform way to Access Data sources


n Pluggable data sources
http://spark-packages.org/
n API is still young

Use case
n leverage spark ecosystem on legacy
data source
n improve scalability and performance
of standalone data source
n unified access point to all data

10
How  to  write  a  Data  Source  Library  

Implement 3 interfaces:
n BaseRelation: base class to extend, every relation is associated with a schema definition
n RelationProvider: factory for creating the concrete relation
n Scan: multiple types of TableScan interface are designed, choose one to implement

TableScan 
support full record scan
RDD[Row] buildScan() 

BaseRelation RelationProvider
PrunedScan  support column pruning
RDD[Row] buildScan(cols) 
schema BaseRelation createRelation()

PrunedFilterScan  supports column and filter


RDD[Row] buildScan(cols, filters)  pruning

11
Example:  Apache  Avro  
CREATE TABLE episodes USING
com.databricks.spark.avro SELECT episodes from … where …
OPTIONS (path "episodes.avro")

package com.databricks.spark.avro
Base Relation  PrunedFilterScan 
• Schema  • buildScan 
RelationProvider 

AvroRelation 
AvroRelationProvider  RDD[Row] buildScan(cols, filters) 
createRelation() 
AvroRelation deriving from PrunedFilterScan, which supports
column pruning and filter pruning.
12
Advanced  data  source:  more  syntax  and  optimization  

SparkSQL Catalyst Framework


Parser/Analyzer Optimizer Execution
SQL or
DataFrame Parser  Catalyst  RDD Compute 

Resolve Relation  Optimizer rules 

Add new DDL Add more query Add new RDD type to
Ex. bulkloading optimization rules access your data source
Ex. Predict pushdown
Add New catalog

13
Big  Data  in  Huawei  

14
Big  Data  in  Huawei  

Network Carrier Consumer Enterprise


n Resource Utilization n Campaign n Hadoop/Spark Distribution
n Customer Care n Realtime n Cloud Service
n Market Insight Recommendation
n Data Monetization n Community Analysis

15
Spark  in  Huawei  
Stream Interactive Deep
Analytics Query
Batch Job
Analytics

Astro Carbon
Stream Engine HBase Cube HDFS

Realtime Event Near-realtime data

Data Ingest
16
Astro:    
SparkSQL  on  HBase  

17
What  is  HBase  
A big sorted table, which is splited into many parts (regions) stored in a HDFS cluster

Region  Server   Region  Server   Region  Server  


Region Region Region Region Region
1 4 3 5 2
18
How  to  access  HBase  data    
Native API: get, put, scan Hadoop API: InputFormat,
Filter, coprocessor OutputFormat

User have to choose: Hbase  Master  


1. Native: Write complex program ZooKeeper   Hbase  Master  
HBase  Client   Hbase  Master  
2. Hadoop: Easy but sacrifice
performance

Region  Server   Region  Server   Region  Server  

R1 R4 R3 R5 R2

19
Existing  Solutions  
Solution 1: Native Application Solution 2: MR Application Solution 3: Purpose-built engine

HBase  Applica>on   MR  Applica>on   SQL  Applica>on  

SQL  Engine  
 
HBase  Na>ve  API   HBase  MR  API   HBase  Na>ve  API  

Region  Server   Region  Server   Region  Server   Region  Server   Region  Server   Region  Server  

Pros: Pros: Pros:


l Flexible l Hadoop friendly l Support SQL
l High performance if do it right l Support SQL using Hive/Impala l High performance
Cons: Cons: Cons:
l Low productivity l Low performance l Partial distributed
20
Introducing  Astro  
Spark  Applica>on    

Astro = SQL on HBase + SQL   ML   Graph   Stream   Community  


Fully distributed + Package  
Spark Ecosystem
Spark  

https://github.com/Huawei-Spark/ Astro  
Spark-SQL-on-HBase

HBase  

21
Astro  Logical  Architecture  
Region  
Info
Catalyst  (Astro   Spark  Driver  
extension)   Hbase  Master  

Spark  Executor   Spark  Executor   Spark  Executor  

RDD  P1 RDD  P4 RDD  P3 RDD  P5 RDD  P2

Hbase   Hbase   Hbase   Hbase   Hbase  


Client Client Client Client Client

HBase  Region  Server   HBase  Region  Server   HBase  Region  Server  

R1 R4 R3 R5 R2

22
Features  
• Scala/Java/Python  multi  language  support  
• SQL  and  DataFrame  API  compatible  
• Query  Optimization:  predicate  pushdown,  aggregation  pushdown,  
region  pruning,  rowkey  jumping,  …  
• More  SQL  capabilities:  insert,  update,  bulk  load,  …  
• Join  HBase  table  with  other  data  like  Parquet  
• CLI  tool  to  execute  SQL  command  

23
Data  Models  
CREATE TABLE table_name (col1 TYPE1, col2 TYPE2, …, PRIMARY KEY (col1, col2))
MAPPED BY (hbase_tablename, COLS=[col3=cf1.cq2, col4=cf2.cq1, col5=cf2.cq2])

SparkSQL Table col1 col2 col3 col4 col5


ns

ns
Row  Key Column  Family  cf1 Column  Family  cf2
Hbase Table
Qualifier1 Qualifier2 Qualifier3 Qualifier1 Qualifier2
Row  Key1 I like using
Row  Key2 SQL
Row  Key3 on  HBase
24
Query  Optimization  
1.  Region  pruning  by  analyzing  rowkey  
         e.g.  select  …  where  key<3  or  key>10  

2.  Use  Hbase  Filter  to  push  down  filter  while  scan,  thus  data  transfer  is  minimized  
 e.g.  select  …  where  value  >  100  

3.  Implements  Hbase  Custom  Filter  to  jump  to  the  required  full/partial  key  directly  
 e.g.  select  …  where  (key1  <  3  AND  key2  >  5)  OR  (key1  =  8  AND  key  2  <  4)  

4.  Use  Hbase  coprocessor  to  push  down  computation  and  minimal  data  transfer  
 e.g.  select  …  sum(col)  …  group  by  col  
25
Query  Optimization  Example  
Region  
Spark  Driver   Info
scan
Hbase  Master  
Spark  Master  
skip

Spark  Executor   Spark  Executor   Spark  Executor  


 

HBase  Region  Server   HBase  Region  Server   HBase  Region  Server  


  copsr   copsr   copsr
R1     R4   R3     R5   R2    
               
         
         

Agg push down


Rowkey jumping Region pruning Push down filter 26
Project Info
§ Open Source project
§ Spark external package (WIP)
§ Github Repo: https://github.com/Huawei-Spark/Spark-SQL-on-HBase
Includes:
§ Design Doc
§ Source Tree
§ Test Cases
§ CLI tool
§ Project Lead:
§ Yan Zhou (yan.zhou.sc@huawei.com)
§ Bing Xiao (bing@huawei.com)

27
Demo  
• Demo1:  Create  and  query  table  with  existing  HBase  table  
• Demo2:  Create  and  query  table  with  new  HBase  table  

Code  can  be  found  at      


https://github.com/Huawei-Spark/Spark-SQL-on-Hbase/tree/
master/examples  

28
Demo1:  create  table  with  existing  HBase  table  
n Create SparkSQL table map to existing Hbase table
n A single column map to hbase rowkey

SparkSQL Table rowkey a b


ns

ns
Row  Key f
Hbase Table
c1 c2
Row  Key1
Row  Key2
Row  Key3

29
Demo2:  create  table  with  new  HBase  table  
n Create and query SparkSQL table map to Hbase table
n Multiple columns map to hbase table rowkey
n Bulk data sample data into SparkSQL table, which stores in Hbase table

SparkSQL Table grade class subject teacher_name teacher_age


ns

ns
Row  Key teacher
Hbase Table
name age
Row  Key1
Row  Key2
Row  Key3

30
 Carbon:    
SparkSQL  on  Cube  

31
Before Spark: Standalone Cube Engine

§ In-house built storage and query engine


§ Runs on single machine
§ Highly optimized GUI
§ Data Load Engine loads data to File system JDBC/ODBC
in Binary data format.
§ Supports in memory and file mode. Cube Engine
§ MDX , SQL and API based Query Interface Cache

Data File system


Input raw Load Binary data Schema
format
data Engine
32
Motivations
§ Functionality
§ OLAP style analytics over Big Data like slicing, dicing
§ BI tool integration
§ Scalability
§ Re-use existing High performance engine, but utilize distributed
computing framework to scale out
§ Reliability
§ Utilize industry-proven storage layer - HDFS

Solution: leverage Spark to make it distributed and ecosystem friendly


33
Carbon Logical Architecture

Language:  
JDBC
• Use  SparkSQL  
Spark • New  DDL:CREATE  CUBE,  LOAD  INTO  CUBE  
SparkSQL OLAP Planner  
DataSource API Compute:  
OLAP RDD OLAP RDD OLAP RDD • Customized  optimization  rules  based  on  Catalyst  
Partition Partition Partition
• Cube  Processor:  Cube  data  scan,  jump,  agg,  etc  

Cube Engine
Storage:  
Cube Cube Cube
Processor Processor Processor • Cube  File,  multidimensional  index  built-in  
• Store  Schema  in  HDFS/Hive  Metastore  

Distributed Data Store (HDFS) Cube Loader


Cube File Cube File Cube File Encoding Encoding Encoding

34
Data  Model  
Create Cube and
Original data load data Cube Store in HDFS

Use SparkSQL to

1. Create cube … Column1 C2 C3 C4

D1 D2 D3 … M1 M2 M3 …
2. Load data….

Star Schema Cube Metadata Distributed Cube File

35
Cube  File  Format  
Parquet,  ORC:  Store  and  Query  complex  nested  structure  data  
• Columnar  format,  support  nested  data  structure  
• Predicate  push  down:  scan  only  required  column,  partition,  min/max  index  
• Schema  evolution  

Cube  File:  Store  and  Query  Multi-Dimensional  Tabular  data  (star-schema)  


• Columnar  format,  with  native  Multi-Dimensional  Key  support    
• More  push  down:  multi  dimension  filter,  group  by,  distinct  count,  …  
• Trade  query  time  with  pre-process  time  to  a  more  organized  data  

36
Push  Down  Example:  Multi-dimension  filter  and  agg  
SELECT state, plan, terminal, sum(traffic) FROM user_cube WHERE plan= 4G and state= CA GROUP BY terminal

Project Spark
OLAP Planner

Spark Core Spark Core Spark Core


Agg Agg
RDD RDD RDD
Partition Partition Partition
Small
DataSource API
intermediate
Partial Agg result
Cube Engine
Cube Cube Cube
Filter Scan and project 4
Filter Processor Processor Processor
columns with Partial Agg

Distributed Data Store (HDFS)


Scan
Cube File Cube File Cube File

Before optimization After optimization


37
Performance  
12 Billion records,20 dimension,4 measure,total 1.5TB
• Carbon: Cube file 380GB
• Impala: Parquet file 336GB
600

500

400

300 Impala
Impala
Carbon
SparkOLAP

200

100

0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15

Query includes: 1/many dim filter, 1/many dim group by and agg, distinct count
38
all  !  "thanks"  
questions.foreach(  answer(_)  )  

39

You might also like