0% found this document useful (0 votes)

18 views8 pages

BAIT 580A Class Notes

The document covers various aspects of data management in big data environments, including the 3 Vs of big data: volume, velocity, and variety, along with cloud computing concepts. It discusses approaches for handling large datasets, indexing strategies in PostgreSQL, and the importance of normalization in data warehousing. Additionally, it introduces NoSQL databases and graph databases, outlining their characteristics and use cases.

Uploaded by

eggnesxi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views8 pages

BAIT 580A Class Notes

Uploaded by

eggnesxi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Class 1 – Data Management in a Big Data Environment

- big data, more than what your computer can hold

3 Vs of Big Data
Veracity 真实性
Volume
- cloud storage solutions
- partitioning/sharding
- parallel processing

Velocity
- identify KPI, pick the important pieces out of the flowing data

Variety
- putting an image in a relational database does not make sense, want to store it somewhere to be
able to process it
- graph database, LinkedIn, stores network of relationships as it is
- json. file, the best way to transfer data on the Internet -> document databases

Two Approaches for Big Data

- 8 GB csv file, but 4 GB RAM
- Scale OUT solution: borrow someone else’s laptop, concept of cloud computing;
borrow a few laptops and connect them together, so collectively, behaves as a higher GB RAM
laptop, doing computation where it is spread across different laptops
- Scale UP solution: scale up your RAM so that I can do my analysis, buy a new laptop,
waste of money
(Hadoop and Spark will not be in final exam)

Intro to Cloud Computing

- want some RAM to connect to the cloud server
- costs of client-server model
- setup, upgrade, licensing, security, internet, connectivity, operational cost
- buying infrastructures as a service, instead of taking the pain to install in computer
Class 2 – Setting Up RDS and Connection to Cloud
- uncheck enable storage autoscaling
- uncheck enable enhanced monitoring

- host: Endpoint
- port: 5432
- username: postgres
- password: postgres

- edit inbound rules

- add rule, PostgreSQL, Anywhere IPv4
Class 3: Disk Management and Indexes
- yml file, create an environment full of needed packages
- Bugs: %%sql, py2neo, issues with updates

How does Postgres store/search

- also store data in pages like books, so each page will have couple of tuples
- need index to overcome the issue of where Postgres goes through all pages, so that it can go
straight away to that page that mention about something that we are searching for
- cares about parallel seq scan (going through all the pages) vs. index only scan and execution
time

Use of explain analyze

Experiment query without index
- 1256 millisecond

Experiment query with index (syntax to create index)

- 0.055 millisecond, save a lot of time

When do you need to deal about indexing as a data scientist/analyst?

- want to do everything you can to make the query run fast
- data retrieval
- time series application

Why don’t I index everything?

- takes up disk space, could be a lot in real life
- only index when it is needed

Disk Space it takes ( syntax for doing that )

• Exp a broad search query ( need for different indexes default btree not always work)

Different kinds of index

• Btree
- based on the values that is there in that column
- if searching for value 15, go to topmost node, if value is less than 1, go to the left, greater, right
- pulling out and throwing away a lot of branches, because I know for sure the value I am
looking for will not be there

• Syntax
• What the Operator class is about: Use it when the column is text vs. numeric type.
• Disk Space it takes

• Hash
- algorithm that can be applied to a value, different strings will get different value
- assign hash to different buckets, each hash goes through different bucket
- can completely skip over other buckets, when searching for a string, because we know our
value will only be in that bucket

• When to use
- HASH ONLY SUPPORTS EQUALITY
- Bitmap Index Scan
- if look for pattern matching, there is no performance improvement
- applied for cases when there are less unique values in a column (traffic light) provided that you
are always going to use equality for the index that you use

• Experiment index on equality (Syntax for that)

• Experiment index on the pattern (not useful)
• Disk Space it takes

• Gin
• When to use
- full-text search
- convert to the format of tsvector, supports different languages, tsquery (to pick the word that is
needed)
- can completely ignore the other rows or pages

• Without index - search for word new

• full-text search functionality (syntax for that )
• Idea on how gin index works
• Query without index
• Query with index ( index syntax)
• Query with index and after view (view syntax)

• Query on search pattern - doesn’t speed.

• That is when to use trigrams
• Use Trigrams index ( syntax for that ) (look for patterns)

• Gist

• Brin (is not important for finals) (for time stamp)

Class 4: (de)Normalization & Data Warehousing
What is a Data Warehouse?
- why do normalization?
- want to build a database in a standard fashion or with less to no redundancy
- people usually don’t go for 4NF or 5NF, more for theory
- avoid anomalies by not having redundancy
- split data into small tables
- data analysts need to bring in different sources of files and databases to do analysis
- not working on a database with tables, but data warehouse, full of information that they need
for their analysis
- need to do a lot of joins (very expensive)

- ETL, extracting, transforming, and loading the information

- OLTP, online transaction processing
- OLAP, online analytical processing
- databases are designed for transaction processing, databases are designed for analytical
processing

- star schema/snowflake schema

- fact table that being associated with other dimension tables,
WHY Data Warehouse?
HOW do we build a data warehouse?
- Twitter schema
- dump in assignment 1
- understanding relationships between the tweets and changes in price

- build a mini warehouse first

- in simple terms, creating a materialized view
- entire query is done in the back end, data is stored on disk, creates a physical table, can quickly
give wizard and apply indexing (but won’t be able to do it with a simple view)
- materialized view’s downside: if information changes next day, view is not up-to-date
- so, make sure to update materialized view everyday
- or to use view
- go for materialized view if customers are not concerned about getting most up-to-date
information
- if not worried about speed or performance, go for view

Differences
- DW vs RMDBS
- ELT: mainly for data lakes, need not to know now
Class 5: Beyond RDBMS
Where are we with RDBMS?
- most companies rely on a hybrid solution (Postgres and Cassandra)
Advantages & Disadvantages of RDBMS
- pros: ACID
- cons: scaling is generally difficult

What is NoSQL?
- Not Only SQL
- schema-free, no foreign key constraints, can enter a column value that is not defined
- BASE, won’t be in a situation where you enter the query and it is not showing up
- internal English, NoSQL made by collecting multiple servers, so comments are registered in
different server
- no guarantee that is going to be consistent every time, but eventually will become consistent

Different types of NoSQL?

- key value store
- document store
- column based
- graph based

Introduction to graph database

- relation network
- GQL is not mature enough
- CQL to use Neo4j, putting out the thought that you think

Understanding the graph model

- nodes: person & animal
- labels: similar to node

- name is a property, Gittu is a property value, percentage is a property

- like: relationship type, hate
- can be multiple key-value pairs
- can add any property, whatever is adding is for a node

Setting up and connecting to graph database

Class 6: CQL
- don’t use multiple match clauses if you don’t need subqueries
Class 8 MQL 1
- every document is stored as a separate field in MD “Documents”
- may contain nested fields
- “Schema”

- projection is more like selecting the column from the table

- only select the values I am interested in
- the underscore id will be there by default, or specify ‘_id’: 0
- sort = order by, 1 is ascending, -1 is descending

- count_documents, count *, all rows that are there

- find
- projection
- sort (1, ascending, -1, descending)
- limit
- count_document
- skip
- distinct

Comparison
$gt
$gte
$lt
$lte
$ne
$eq
$in
$nin

Logical
$and (implicitly, careful)
$or (explicitly), or to use $in, to be simplified

$exists
Don’t have to know about insert_many

[{},{}], array of documents, how to deal with subdocuments

$all

Full-text search
Don’t look into the last indexes (optional)
Won’t need to write a big SQL code
Make sure to connect to Neo4j Browser
MongoDB is from Jupyter
Put schema in pdf file

SC4x W3L1 TopicsInDatabases v2
No ratings yet
SC4x W3L1 TopicsInDatabases v2
37 pages
From Data To Insights Course Summary
No ratings yet
From Data To Insights Course Summary
67 pages
Introduction To Dbms
No ratings yet
Introduction To Dbms
37 pages
Data Engineering For Beginners
No ratings yet
Data Engineering For Beginners
129 pages
Data - Engineering & InterView Grooming Course
No ratings yet
Data - Engineering & InterView Grooming Course
13 pages
Designing Data Intensive Applications
25% (4)
Designing Data Intensive Applications
61 pages
001 - OpenEdge Getting Started Database Essentials Gsdbe
No ratings yet
001 - OpenEdge Getting Started Database Essentials Gsdbe
142 pages
Final Lec
No ratings yet
Final Lec
22 pages
Introduction-to-Data-Storage-and-Retrieval
No ratings yet
Introduction-to-Data-Storage-and-Retrieval
26 pages
Build Your Own Database From Scratch 1n
No ratings yet
Build Your Own Database From Scratch 1n
120 pages
Databaser
No ratings yet
Databaser
137 pages
Roadmap For Data Engineering
No ratings yet
Roadmap For Data Engineering
33 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
21 pages
SQL Tutorial1
No ratings yet
SQL Tutorial1
25 pages
Roadmap For Jobs
No ratings yet
Roadmap For Jobs
10 pages
Cassandra Data Modeling Best Practices
No ratings yet
Cassandra Data Modeling Best Practices
57 pages
Data Analysis
No ratings yet
Data Analysis
40 pages
Inside RavenDB 3 0
No ratings yet
Inside RavenDB 3 0
187 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
153 pages
DBMS Piyushwairale
No ratings yet
DBMS Piyushwairale
50 pages
T 8 TVIV3 SFX
No ratings yet
T 8 TVIV3 SFX
2 pages
Adbms Notes
No ratings yet
Adbms Notes
17 pages
Become A Big Data Engineer 1
No ratings yet
Become A Big Data Engineer 1
7 pages
1 Introduction
No ratings yet
1 Introduction
38 pages
Data Engineering Placement Assurance Program
No ratings yet
Data Engineering Placement Assurance Program
19 pages
Cheat Sheet v2
No ratings yet
Cheat Sheet v2
3 pages
NoSQL Databases Overview
No ratings yet
NoSQL Databases Overview
149 pages
20 - 04 - 2024 Cheatsheet
No ratings yet
20 - 04 - 2024 Cheatsheet
3 pages
Database Topics
No ratings yet
Database Topics
3 pages
IE494 - Big - Data - Processing - Course - File - Autumn24 - PMJ - PM Jat
No ratings yet
IE494 - Big - Data - Processing - Course - File - Autumn24 - PMJ - PM Jat
5 pages
NoSQL Database For Software
No ratings yet
NoSQL Database For Software
49 pages
Curriculum
No ratings yet
Curriculum
10 pages
Coforge Graduate Engineer Trainee - Technical Interview Answers (Himanshu Tripathi)
No ratings yet
Coforge Graduate Engineer Trainee - Technical Interview Answers (Himanshu Tripathi)
12 pages
Handouts PDF
No ratings yet
Handouts PDF
293 pages
Day1 - Introduction To Database
100% (1)
Day1 - Introduction To Database
29 pages
Digitization Week 3
No ratings yet
Digitization Week 3
13 pages
ADMT Syllabus
No ratings yet
ADMT Syllabus
4 pages
cs186 Notes
No ratings yet
cs186 Notes
31 pages
Bcis5420 - Lecture Note - ch6 - Big Data Technologies
No ratings yet
Bcis5420 - Lecture Note - ch6 - Big Data Technologies
24 pages
Discover Introduction To Bioinformatics and Clinical Scientific Computing 1st Edition One-Click Ebook Download
100% (1)
Discover Introduction To Bioinformatics and Clinical Scientific Computing 1st Edition One-Click Ebook Download
17 pages
Cloud Data Engineering Program Overview
No ratings yet
Cloud Data Engineering Program Overview
5 pages
Dbms 1
No ratings yet
Dbms 1
23 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
SQL Indexes
No ratings yet
SQL Indexes
20 pages
001-2023-0921 DLMDSBDT01 Course Book
No ratings yet
001-2023-0921 DLMDSBDT01 Course Book
124 pages
GCP Data Engineer
No ratings yet
GCP Data Engineer
8 pages
DINLect 1
No ratings yet
DINLect 1
69 pages
Mastercard Data Engineer Interview Questions
No ratings yet
Mastercard Data Engineer Interview Questions
16 pages
Database Systems: Advantages & Disadvantages
No ratings yet
Database Systems: Advantages & Disadvantages
87 pages
W Java135
No ratings yet
W Java135
10 pages
PDF Document BIDA 2
No ratings yet
PDF Document BIDA 2
21 pages
SQL For Beginners
No ratings yet
SQL For Beginners
171 pages
Advance SQL Cou
No ratings yet
Advance SQL Cou
18 pages
Lecture1 Intro To DBMS
No ratings yet
Lecture1 Intro To DBMS
32 pages
DL Custom Engineering Document
No ratings yet
DL Custom Engineering Document
1 page
Tree Cutting Investigation at Kalang-Kalang
No ratings yet
Tree Cutting Investigation at Kalang-Kalang
4 pages
The Witch
No ratings yet
The Witch
182 pages
A2 - Mock Answer Key & Listening Script C
No ratings yet
A2 - Mock Answer Key & Listening Script C
4 pages
E-Book Speakfirst A
No ratings yet
E-Book Speakfirst A
34 pages
Demand Forecasting Slides Provided by The Guest Mr. Ashutosh
No ratings yet
Demand Forecasting Slides Provided by The Guest Mr. Ashutosh
33 pages
Raul Gomez Jattin
No ratings yet
Raul Gomez Jattin
2 pages
L5 U1-U4
No ratings yet
L5 U1-U4
2 pages
Volcanic Ash Soils Water Retention Analysis
No ratings yet
Volcanic Ash Soils Water Retention Analysis
18 pages
F.T PVC Inovyn 675la
No ratings yet
F.T PVC Inovyn 675la
2 pages
Hans Kelsen's Normativism
No ratings yet
Hans Kelsen's Normativism
88 pages
Alevel Art Coursework Brief
No ratings yet
Alevel Art Coursework Brief
4 pages
Ethiopian Chemical Industry's Environmental Impact
No ratings yet
Ethiopian Chemical Industry's Environmental Impact
20 pages
Therapy With A Map A Cognitive Analytic Approach To Helping Relationships Steve Potter PDF Download
100% (1)
Therapy With A Map A Cognitive Analytic Approach To Helping Relationships Steve Potter PDF Download
88 pages
The Handbook of Contemporary Cambodia PDF
100% (1)
The Handbook of Contemporary Cambodia PDF
7 pages
Waits Corbijn 77 11 Photographs by Anton Corbijn Curiosities by Tom Waits 1st edition Edition Anton Corbijn ebook digital reading kit
100% (1)
Waits Corbijn 77 11 Photographs by Anton Corbijn Curiosities by Tom Waits 1st edition Edition Anton Corbijn ebook digital reading kit
124 pages
Atherosclerosis and Brain Health Insights
No ratings yet
Atherosclerosis and Brain Health Insights
31 pages
Biography of Hazrat Umar Farooq (RA)
No ratings yet
Biography of Hazrat Umar Farooq (RA)
5 pages
Adverbs Manner
No ratings yet
Adverbs Manner
2 pages
Rotta Research Laboratorium SPA & Anor V Ho Tack Sien & Ors
No ratings yet
Rotta Research Laboratorium SPA & Anor V Ho Tack Sien & Ors
34 pages
ENGLISH-9 Q3 Wk2 USLeM-RTP
No ratings yet
ENGLISH-9 Q3 Wk2 USLeM-RTP
12 pages
Rationale for Research Title Defense
100% (1)
Rationale for Research Title Defense
2 pages
Mutual Contract Termination Agreement
No ratings yet
Mutual Contract Termination Agreement
1 page
Paste - N059 (December 2009 & January 2010) - Great Expectations
No ratings yet
Paste - N059 (December 2009 & January 2010) - Great Expectations
84 pages
Kokoku Intech Introduction (ICE&future Mobilities)
No ratings yet
Kokoku Intech Introduction (ICE&future Mobilities)
13 pages
Volkswagen New Beetle Factory Repair Manual 1998-2010 Emission Applications
No ratings yet
Volkswagen New Beetle Factory Repair Manual 1998-2010 Emission Applications
6 pages
Citect OPC Server PDF
No ratings yet
Citect OPC Server PDF
18 pages
DoorContact (MPS 45-Honeywell)
No ratings yet
DoorContact (MPS 45-Honeywell)
2 pages
Field Study Questionnaire
No ratings yet
Field Study Questionnaire
26 pages
IFRS 3 - Business Combinations
No ratings yet
IFRS 3 - Business Combinations
13 pages

BAIT 580A Class Notes

Uploaded by

BAIT 580A Class Notes

Uploaded by

Class 1 – Data Management in a Big Data Environment

- big data, more than what your computer can hold

Two Approaches for Big Data

Intro to Cloud Computing

- edit inbound rules

How does Postgres store/search

Use of explain analyze

Experiment query with index (syntax to create index)

When do you need to deal about indexing as a data scientist/analyst?

Why don’t I index everything?

Disk Space it takes ( syntax for doing that )

Different kinds of index

• Experiment index on equality (Syntax for that)

• Without index - search for word new

• Query on search pattern - doesn’t speed.

• Brin (is not important for finals) (for time stamp)

- ETL, extracting, transforming, and loading the information

- star schema/snowflake schema

- build a mini warehouse first

Different types of NoSQL?

Introduction to graph database

Understanding the graph model

- name is a property, Gittu is a property value, percentage is a property

Setting up and connecting to graph database

- projection is more like selecting the column from the table

- count_documents, count *, all rows that are there

[{},{}], array of documents, how to deal with subdocuments

You might also like