Class 1 – Data Management in a Big Data Environment
- big data, more than what your computer can hold
3 Vs of Big Data
Veracity 真实性
Volume
- cloud storage solutions
- partitioning/sharding
- parallel processing
Velocity
- identify KPI, pick the important pieces out of the flowing data
Variety
- putting an image in a relational database does not make sense, want to store it somewhere to be
able to process it
- graph database, LinkedIn, stores network of relationships as it is
- json. file, the best way to transfer data on the Internet -> document databases
Two Approaches for Big Data
- 8 GB csv file, but 4 GB RAM
- Scale OUT solution: borrow someone else’s laptop, concept of cloud computing;
borrow a few laptops and connect them together, so collectively, behaves as a higher GB RAM
laptop, doing computation where it is spread across different laptops
- Scale UP solution: scale up your RAM so that I can do my analysis, buy a new laptop,
waste of money
(Hadoop and Spark will not be in final exam)
Intro to Cloud Computing
- want some RAM to connect to the cloud server
- costs of client-server model
- setup, upgrade, licensing, security, internet, connectivity, operational cost
- buying infrastructures as a service, instead of taking the pain to install in computer
Class 2 – Setting Up RDS and Connection to Cloud
- uncheck enable storage autoscaling
- uncheck enable enhanced monitoring
- host: Endpoint
- port: 5432
- username: postgres
- password: postgres
- edit inbound rules
- add rule, PostgreSQL, Anywhere IPv4
Class 3: Disk Management and Indexes
- yml file, create an environment full of needed packages
- Bugs: %%sql, py2neo, issues with updates
How does Postgres store/search
- also store data in pages like books, so each page will have couple of tuples
- need index to overcome the issue of where Postgres goes through all pages, so that it can go
straight away to that page that mention about something that we are searching for
- cares about parallel seq scan (going through all the pages) vs. index only scan and execution
time
Use of explain analyze
Experiment query without index
- 1256 millisecond
Experiment query with index (syntax to create index)
- 0.055 millisecond, save a lot of time
When do you need to deal about indexing as a data scientist/analyst?
- want to do everything you can to make the query run fast
- data retrieval
- time series application
Why don’t I index everything?
- takes up disk space, could be a lot in real life
- only index when it is needed
Disk Space it takes ( syntax for doing that )
• Exp a broad search query ( need for different indexes default btree not always work)
Different kinds of index
• Btree
- based on the values that is there in that column
- if searching for value 15, go to topmost node, if value is less than 1, go to the left, greater, right
- pulling out and throwing away a lot of branches, because I know for sure the value I am
looking for will not be there
• Syntax
• What the Operator class is about: Use it when the column is text vs. numeric type.
• Disk Space it takes
• Hash
- algorithm that can be applied to a value, different strings will get different value
- assign hash to different buckets, each hash goes through different bucket
- can completely skip over other buckets, when searching for a string, because we know our
value will only be in that bucket
• When to use
- HASH ONLY SUPPORTS EQUALITY
- Bitmap Index Scan
- if look for pattern matching, there is no performance improvement
- applied for cases when there are less unique values in a column (traffic light) provided that you
are always going to use equality for the index that you use
• Experiment index on equality (Syntax for that)
• Experiment index on the pattern (not useful)
• Disk Space it takes
• Gin
• When to use
- full-text search
- convert to the format of tsvector, supports different languages, tsquery (to pick the word that is
needed)
- can completely ignore the other rows or pages
• Without index - search for word new
• full-text search functionality (syntax for that )
• Idea on how gin index works
• Query without index
• Query with index ( index syntax)
• Query with index and after view (view syntax)
• Query on search pattern - doesn’t speed.
• That is when to use trigrams
• Use Trigrams index ( syntax for that ) (look for patterns)
• Gist
• Brin (is not important for finals) (for time stamp)
Class 4: (de)Normalization & Data Warehousing
What is a Data Warehouse?
- why do normalization?
- want to build a database in a standard fashion or with less to no redundancy
- people usually don’t go for 4NF or 5NF, more for theory
- avoid anomalies by not having redundancy
- split data into small tables
- data analysts need to bring in different sources of files and databases to do analysis
- not working on a database with tables, but data warehouse, full of information that they need
for their analysis
- need to do a lot of joins (very expensive)
- ETL, extracting, transforming, and loading the information
- OLTP, online transaction processing
- OLAP, online analytical processing
- databases are designed for transaction processing, databases are designed for analytical
processing
- star schema/snowflake schema
- fact table that being associated with other dimension tables,
WHY Data Warehouse?
HOW do we build a data warehouse?
- Twitter schema
- dump in assignment 1
- understanding relationships between the tweets and changes in price
- build a mini warehouse first
- in simple terms, creating a materialized view
- entire query is done in the back end, data is stored on disk, creates a physical table, can quickly
give wizard and apply indexing (but won’t be able to do it with a simple view)
- materialized view’s downside: if information changes next day, view is not up-to-date
- so, make sure to update materialized view everyday
- or to use view
- go for materialized view if customers are not concerned about getting most up-to-date
information
- if not worried about speed or performance, go for view
Differences
- DW vs RMDBS
- ELT: mainly for data lakes, need not to know now
Class 5: Beyond RDBMS
Where are we with RDBMS?
- most companies rely on a hybrid solution (Postgres and Cassandra)
Advantages & Disadvantages of RDBMS
- pros: ACID
- cons: scaling is generally difficult
What is NoSQL?
- Not Only SQL
- schema-free, no foreign key constraints, can enter a column value that is not defined
- BASE, won’t be in a situation where you enter the query and it is not showing up
- internal English, NoSQL made by collecting multiple servers, so comments are registered in
different server
- no guarantee that is going to be consistent every time, but eventually will become consistent
Different types of NoSQL?
- key value store
- document store
- column based
- graph based
Introduction to graph database
- relation network
- GQL is not mature enough
- CQL to use Neo4j, putting out the thought that you think
Understanding the graph model
- nodes: person & animal
- labels: similar to node
- name is a property, Gittu is a property value, percentage is a property
- like: relationship type, hate
- can be multiple key-value pairs
- can add any property, whatever is adding is for a node
Setting up and connecting to graph database
Class 6: CQL
- don’t use multiple match clauses if you don’t need subqueries
Class 8 MQL 1
- every document is stored as a separate field in MD “Documents”
- may contain nested fields
- “Schema”
- projection is more like selecting the column from the table
- only select the values I am interested in
- the underscore id will be there by default, or specify ‘_id’: 0
- sort = order by, 1 is ascending, -1 is descending
- count_documents, count *, all rows that are there
- find
- projection
- sort (1, ascending, -1, descending)
- limit
- count_document
- skip
- distinct
Comparison
$gt
$gte
$lt
$lte
$ne
$eq
$in
$nin
Logical
$and (implicitly, careful)
$or (explicitly), or to use $in, to be simplified
$exists
Don’t have to know about insert_many
[{},{}], array of documents, how to deal with subdocuments
$all
Full-text search
Don’t look into the last indexes (optional)
Won’t need to write a big SQL code
Make sure to connect to Neo4j Browser
MongoDB is from Jupyter
Put schema in pdf file