Introduction To Data Engineering
Introduction To Data Engineering
Semana 1
Key players in the data ecosystem
Data Engineer
Develop and mantain data architectures and provide data to business analysts
Data Analyst
Translates data and numbers into plain language so organizations can make desitions
They answer the questions that can be answered from the data
Data Scientist
Analyze data for actionable insights
They answer predictive questions, like how many followers am i likely to get next month?
Proccesing data
Storing data
———
We dont play as much time playing with the data as organizing it. Data Analysts and Data Scientist use the data that comes
from data engineering.
The goal of Data Engineering is to make quality data available for analytics and decision-making. And it does this by collecting
raw source data, processing data so it becomes usable, storing data, and making quality data available to users securely.
reliable
complies to regulations
Technical skills:
operating systems
Datawarehouses: Oracle Exdata, IBM Db2 Warehouse on Cloud, IBM Netezza Performance Server, Amazon RedSchif
Data pipelines:
apache beam
airflow
dataflow
AWS
improvado
Languages:
query languages
programming languages
hive
apache spark
Functional skills:
convert business requirements into technical specifications
work with the complete software development lifecycle: ideation → architecture → design → prototyping → testing →
deployment → monitoring
Semana 2
Overview of the data engineering ecosystem
Infrastrucuture, tools, frameworks and processes.
Semi-structure
Unstructured
Complex, qualitive information, imposible to reduce to rows and columns. Example, photos, social media posts
relational database
non-relational database
apis
web services
data sterams
social platforms
sensor devices
Data Repositories
transactional or OLTP (online transactional processing system)
include relational and non-relational databases, data warehouses, data marts, data lakes and big data stores
Data Integration
Collated → Processed → Cleansed → Integrated → Access for users
Data Pipelines
A set of tools and processes that cover the entire journey of data from source to destination systems.
Languages
query
programming
images
structured data
sources: SQL databases, OLTP Systems, spreadsheets, online forms, sensors, network and web server logs
semi-structured data
ex: e-mails, xml, binary executables, TCP/IP packets, zipped files, integration of data
unstructured data
no rules
sources: web pages, social media feeds, imagines in varied file formats, video and audio files, documents and pdf files,
pppt, media logs, surveys
stored in files and docs for manual analysis or in NoSQL with analysis tools
CSV, TSV
Delimited text files are text files used to store data as text in which each line, or row, has values separated by a delimiter; where
a delimiter is a sequence of one or more characters for specifying the boundary between independent entities or values.
Any character can be used to separate the values, but most common delimiters are the comma, tab, colon, vertical bar, and
space.
Comma-separated values (or CSVs) and tab-separated values (or TSVs) are the most commonly used file types in this
category.
Each row, or horizontal line, in the text file has a set of values separated by the delimiter, and represents a record.
The first row works as a column header, where each column can have a different type of data.
For example, a column can be of date type, while another can be a string or integer type data.
Delimited files allow field values of any length and are considered a standard format for providing straightforward information
schema.
They can be processed by almost all existing applications.
XLSX
XLSX uses the open file format, which means it is generally accessible to most other applications.
It can use and save all functions available in Excel and is also known to be one of the more secure file formats as it cannot
save malicious code.
Extensible Markup Language, or XML, is a markup language with set rules for encoding data.
XML is platform independent and programming language independent and therefore simplifies data sharing between various
systems.
PDF
Document Format, or PDF, is a file format developed by Adobe to present documents independent of application software,
hardware, and operating systems, which means it can be viewed the same way on any device.
This format is frequently used in legal and financial documents and can also be used to fill in data such as forms.
JSON
JavaScript Object Notation, or JSON, is a text-based open standard designed for transmitting structured data over the web.
The file format is a language-independent data format that can be read in any programming language.
JSON is easy to use, is compatible with a wide range of browsers, and is considered as one of the best tools for sharing data of
any size and type, even audio and video.
That is one reason, many APIs and Web Services return data as JSON.
Sources of data
Relational databases
can be used as source for analysis
APIS
Examples:
Data lookup and validation APIs for cleaning and co-relating data
Web scraping
extract relevant data from unstructured sources
popular tools:
beautiful soup
scrapy
pandas
selenium
advantages:
its syntax allows developers to write programs with fewer lines of code using basic keywords
Programming languages
→ designed for deveoping applications and controlling application behavior (python, R, java)
Python
multi platform
libraries:
open-source programming language and enviroment for data analysis, data visualization, machine learning and statistics
platform independent
highly extensible
Java
platform independent
data analysis, cleaning importing and exporting data, statistical analysis, data visualization
Shell scripting
→ ideal for repetitive and time-consuming operational tasks (unix/linux shell, power shell)
Unix/Linux shell
series of unix commanfs written in a plain text file to accomplish a specific task
file maniupulation
program execution
installation scripts
running batches
PoweShell
microsoft
optimized for structured data formats such as json, csv, xml and rest apis, websites and office apps
object-based → filter, sort, measure, group and compare objects as they pass through a data pipeline
data mining, building GUIs, creating charts, dashboards and interactive reports
databases
data warehouses
Data bases
→ Collection of data for input, storage, search, retrieval and modification of data
relational
flat files
sql
non-relational
nosql
volume diversity
big data
Data warehoouse
→ A data warehouse works as a central repository that merges information coming from disparate sources and consolidates it
through the extract, transform, and load process, also known as the ETL process, into one comprehensive database for
data marts
relational
data lakes
RDBMS
A relational database is a collection of data organized into a table structure, where the tables can be linked, or related, based
on data common to each.
Tables are made of rows and columns, where rows are the “records”, and the columns the “attributes”.
(Como en genexus :D )
Relationship between tables minimizes data redundancy
Use cases:
oltp applications
iot solutions
Limitaitions:
migration between two RDBMS (source and destination must have identical schemas)
NoSQL
→ non-relational database design that provides flexible schemas for the storage and retrieval of data
efficient and cost-effective scale-out architecture that provides additional capacity and performance with the addition of new
nodes
key-value store
document based
column based
graph based
Key-Value Store
the key represents an attribute of the data and is a unique indentifier
Examples:
redis
memcached
dynamoDB
Document-based
store each record within a single document
great for eCommerce platforms, medical records storage, CRM platforms, analytics platforms
NOT GREAT;
multi-operation transactions
Examples:
mongoDB
DocumentDB
CouchDB
Cloudant
Column-based
stored in cells grouped as columns of data instead of rows
all cells corresponding to a column are saved as a continuos disk entry, making access and search easier and faster
great for systems that requiere heavy write requests, storing time-series data, weather data and iot data
NOT GREAT:
complex queries
Examples:
cassandra
hbase
Graph-based
use a graphical model to represent and store data
useful for visualizing, analyzing and finding connections between different pieces of data
Nodes = data
Arrows = relationships
NOT GREAT:
Examples:
neo4j
cosmosDB
crm
hr
erp
non-relational
3 tier architecture
OLAP Server (process and analyze information coming from database servers)
lower costs
examples:
teradata
oracle exadata
ibm db2
netezza
amazon redshift
google bigquery
cloudera
snowflake
Three types:
Dependent
subsection of an enterprise data warehouse where data has already been cleaned and transformed
Independent
created from sources other than an enterprise data warehouse, such as internal operational systems and external
systems
Hybrid
combine inputs from data warehouses, operational systems and external systems
Data lake
→ store large amounts of structured, semi-structured and unstructured data in their native format
data is classified
can be deployed using cloud object storage or large-scale distributed systems and relational and non-relational database
systems
ability to repurpose data in several different ways and wide-ranging use cases
Examples:
amazon
cloudera
ibm
informatica
microsoft
oracle
Steps:
batch processing: large chunks of data moved from source to destination at scheduled intervals
stream processing: data pulled in real-time from source, transformed in transit and loaded into data repository
Transforming data:
enriching data
Loading data:
Load verification:
server performance
load failures
The destination system for an ELT pipeline is most likely a data lake, though it can also be a data warehouse.
ELT is useful for processing large sets of unstructured and non-relational data. It is ideal for data lakes where
transformations on the data are applied once the raw data is loaded into the data lake.
Advantages:
Data pipeline
Encompasses the entire journey of moving data from one system to another, including the ETL process
TOOLS:
IBM offers a host of data integration tools targeting a range of enterprise integration scenarios:
IBM InfoSphere DataStage all target a range of enterprise data integration scenarios.
https://s3-us-west-2.amazonaws.com/secure.notion-static.com/7f635a7e-3fdd-422a-8744-4f0e26240b21/1.2.2.9_Hands
-on_Lab_Provision_an_instance_of_IBM_Db2_Lite_plan.md.html
The 5 V's:
Apache Hadoop
is a collection of tools that provides distributed storage and processing of big data
node = 1 computer
reliable, scalable and cost-effective solution for storing data with no format requirements
data offload and consolidation (move the computation closer to the node in which the data resides)
HDFS:
Hadoop Distributed File System: is a storage system for big data that runs on multiple commodity hardware connected
through a network.
Splits large files accross multiple computers, allowing parallel acces to them
Higher availability
Better scalability
Data locality
Portability
Apache Hive
is a data warehouse for data query and analysis built on top og Hadoop. This software is used for reading, writing, and
managing large data set files that are stored directly in either HDFS or other data storage systems such as Apache HBase
Queries have high latency → less appropiate for apps that need very fast response times
Read-based → not suitable for transaction processing that involves write operations
Apache Spark
is a distributed analytics framework for complex, real-time data analytics. This general-purpose data processing engine is
designed to extract and process large volumes of data for a wide range of applications
Applications include:
streams processing
machine learning
data integration
ETL
interfaces for major programming languages such as java, python, R and SQL
can access data in a large varierty of data sources, including HDFS and Hive
Semana 3
Architecting the Data Platform (5 layers)
Data ingestion or data collection layer
Responsible for connecting to the source systems and bringing the data from there systems into the data platform.
Tools:
Data flow, IBM Streams, IBM Streaming Analytics on Cloud, amazon Kinesis, Kafka
Make data available for processing in both streaming and batch modes
Needs to be:
Tools (databases):
relational databases: Db2 IBM, Microsoft SQL Server, MySQL, Oracle, PostgreSQL
Integration tools:
Open Studio
SnapLogic
provide a way for analysts and data scientists to work with data in the data platform
Transformation tasks:
Structuring
Normalization: cleaning the database of unused data and reducing redundancy and inconsistency
Denormalization: combining data from multiple tables into a single table so that it can be queried more efficiently
Data cleaning
→ In big data systems, data can first be stored in Hadoop and then processed in a data processing engine like Spark
consumers = BI analysts, business stakeholders, data scientists, data analyists, other apps and services
querying tools and programming languages, apis, dashboards, jupyter notebooks, microsoft power bi
input
storage
modification
two types
RDBMS
NoSQL
data warehouse
data mart
high-volume
high-velocity
diverse types
split large files across multiple computers. Computations run in parallel on each node where data is stored
data lake
analytical systems: need complex queries to be applied to large amounts of historical data aggregated from transacional
systems. They need faster response times to complex queries
schema design, indexing, and partitioning strategies have a big role to play in performance of systems based on how data
is getting used
scalability
normalization
Storage considerations:
performance: throughput and latency
integrity: data must be safe from corruption, loss, and outside attack
recoverability: storage solution should ensure you can recover your data in the event of failures and natural disasters
access control
multizone encryption
data management
monitoring systems
USA:
Security
Levels:
pyisical infrastructure
network
application
data
Integrity through validating that your resources are trustworthy and have not been tampered with
Availability by ensuring authorized users have access to resources when they need it
Network Security
firewalls
security protocols
Application Security
needs to be built in the foundation of the app
secure design
security testing
Data Security
Data is either at rest in storage or in transit.
authentication systems
data at rest:
stored physically,
data in transit:
provide reports and alerts that help enterprises react to security violations in time
commands to specify:
non-relational DB can be queried using SQL or SQL-like query tools. Some non-relational databases come with their
own querying tools such as CQL for Cassandra and GraphQL for Neo4J
Web scraping
downloading specific data from web pages based on defined parameters
Data streams
instruments, iot devices, sensors, GPS data from cars, apps
data streams and feeds are also used for extracting data from social media sites and interactive platforms
data streams are a popular source for aggregating constant streams of data flowing from sources
Data exchange
allow the exchange of data between data providers and data consumers
Importing data
structured data
relational databases
NoSQL
NoSQL
unstructured data
NoSQL
data lakes
data exploration
transformation
validation
Transformation
actions that change the the form and schema of your data
visualization tools
missing values
duplicate data
need to be removed
irrelevant data
data type conversion to ensure values in a field are stored as the data type of that field
outliers are values that are very different from the others and needs to be examined
in-built formulae
openRefine
open-source
import and export in a wide variety of formats: TSV, CSV, XLS, XML, JSON
clean data, transform its format, extend data with web services and external data
google dataPrep
visually explore, clean and prepare both structured and unstructured data for analysis
transforms large amounts of raw data into consumable, quality information that is ready for analytics
trifacta wrangler
takes messy, real-wordl data and cleans and rearranges it into data tables
collaboration features
python
jupyter notebook: data cleaning and transformation, statistically modeling and data visualization
numpy: fast, versatile, easy to use. Large multi-dimensional arrays and matrices, high-level math functions
pandas: fast and easy data analysis. Complex operations such as merging, joinging and transforming huge chunks of
data. Helps prevent common errors that result from misaligned data coming from different sources
dplyr: powerful library for data wrangling with a precise and straightforward syntax
jsonlite: a robust json parsing tools, great for interacting with web APIs
Lab: Load data into the IBM Db2 database from a CSV file
Congratulations! You have successfully loaded data from a CSV file into the IBM Db2 instance you created in the previous lab.
Now that you have learned about the process of importing data into a data repository from varied sources, you will load data from a CSV file
into the IBM Db2 database instance you created in the previous lab.
https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0100EN-SkillsNetwork/labs/1.3.2.5_Hands-on_Lab_Load_data_int
o_Db2_Database3_from_CSV_file.md.html?origin=www.coursera.org
Congratulations! You have successfully loaded data from a CSV file into the IBM Db2 instance you created in the previous lab.
select distinct Dealer from CarSaleDetails; /*it shows only the Dealer column*/
select count(distinct Dealer) from CarSaleDetails; /*total number of unique, or distinct, car dealers*/
Aggregation
select sum(Sale_Amount) as sum_all from CarSaleDetails; /* calculates the sum of a numeric column */
select stddev(Sale_Amount) as std_all from CarSaleDetails; /* standard deviation to see how spread out the cost of a used car is
*/
Slicing data
Sorting data
Filtering patterns
select * from CarSaleDetails where Pin like "871%"; /* returns records that match a data value PARTIALLY. This can return 8710, 87
11, ... */
Grouping data
select sum(Sale_Amount) as area_sum, Pin from CarSaleDetails group by Pin; /* Total amount spent by customers, pincode-wise */
app failures
tool incompatibilities
Performance metrics:
latency
failures
traffic
Troubleshooting:
collect information
check if we're working with all the right versions of software and source codes
system outages
capacity utilization
application slowdown
performance of queries
Best practices:
capacity planning: determining the optimal hardware and software resources required for perfomance
database indexing: locating data without searching each row in a database resulting in faster querying
database partitioning: dividing large tables into smaller, individual tables, improving performance and data manageablitiy.
Queries run faster.
database normalization
Monitoring systems
help collect quantitative data about our systems and apps in real time
visibility to the performance of data pipelines, data platforms, databases, apps, tools, queries, scheduled jobs, and more
Data base monitoring tools: frequent snapshots of the performance indicators of a databse
Application performance management tools: measure and monitor the performance of apps and amount of resources
utilized by each process
Tools for monitoring query performance: gather statistics about query throughput, exeecution, performance, resource
utilization and utilization patterns for better planning and allocation of resources
Job-level runtime monitoring: breaks up the job into a series of logical steps, so that errors can be detected before
completion
Monitoring amount of data being processed: through a data pipeline helps to asses if size of workload is slowing down
the system
Maintenance schedules
Preventive maintenance routines generate data that we can use to identify systems and procedures responsible for faults and
low availability.
Time-based. That is, they could be planned as scheduled activities at pre-fixed time intervals.
Condition-based, which means they are performed when there is a specific issue or when a decrease in performance has
been noted or flagged.
Congratulations! You have successfully run some SQL queries to help you explore and understand your dataset.
Now that you have learned how querying techniques can help you to explore and analyze your data, you will run some basic
SQL queries on the data you loaded into your database instance in the previous lab. For this, you will use the in-built SQL
editor available in your Lite account.
https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0100EN-SkillsNetwork/labs/1.3.3.3_Hands-on_Lab
_Explore_your_dataset_using_SQL_queries.md.html?origin=www.coursera.org
CCPA (california)
industry-specific regulations
HIPAA
SOX (finance)
Compliance
Compliance covers the processes and procedures through which an organization adheres to regulations and conducts its
operations in a legal and ethical manner. Organizations need to establish controls and checks in order to comply with
regulations
Data Lifecycle
Data acquisition stage
identify data that needs to be collected and the legal basis for procuring the data
identify the amount of data you need to meed your defined purposes
flesh out details of how exactly you are going to process personal data
establish specific measures you will take to prevent internal and external security breaches
identify third-party vendors in your supply chain that will have access to the collected data
establish how you will hold third-party vendors contractually accountable to regulations
define how you will ensure deleted data is removed from all locations, including third-party systems
Technology as an enabler
authentication and access control
the process of anonymization abstracts the presentation layer without changing the data in the database physically
hosting options
data erasure → is a software-based method of permanently clearing data from a system by overwriting. This is different
from a simple deletion of data since deleted data can still be retrieved.
(Source: https://blogs.gartner.com/nick-heudecker/hyping-dataops/ )
A small team working on a simpler or limited number of use cases can meet business requirements efficiently. As data
pipelines and data infrastructures get more complex, and data teams and consumers grow in size, you need development
processes and efficient collaboration between teams to govern the data and analytics lifecycle. From data ingestion and data
processing to analytics and reporting, you need to reduce data defects, ensure shorter cycle times, and ensure 360-degree
access to quality data for all stakeholders.
DataOps helps you achieve this through metadata management, workflow and test automation, code repositories,
collaboration tools, and orchestration to help manage complex tasks and workflows. Using the DataOps methodology ensures
all activities occur in the right order the
right security permissions. It helps set in a continual process that allows you to cut wastages, streamline steps, automate
processes, increase throughput, and improve continually.
Several DataOps Platforms are available in the market, some of the popular ones being IBM DataOps, Nexla, Switchboard,
Streamsets, and Infoworks.
DataOps Methodology
The purpose of the DataOps Methodology is to enable an organization to utilize a repeatable process to build and deploy
analytics and data pipelines. Successful implementation of this methodology allows an organization to know, trust, and use
data to drive value.
It ensures that the data used in problem-solving and decision making is relevant, reliable, and traceable and improves the
probability of achieving desired business outcomes. And it does so by tackling the challenges associated with inefficiencies in
accessing, preparing, integrating, and making data available.
The Establish DataOps Phase provides guidance on how to set up the organization for success in managing data.
The Iterate DataOps Phase delivers the data for one defined sprint.
The Improve DataOps Phase ensures learnings from each sprint is channeled back to continually improve the DataOps
process.
The figure below presents a high-level overview of these phases and the key activities within each of these phases.
Automate metadata management and catalog data assets, making them easy to access.
Trace data lineage to establish its credibility and for compliance and audit purposes.
Automate workflows and jobs in the data lifecycle to ensure data integrity, relevancy, and security.
Streamline the workflow and processes to ensure data access and delivery needs can be met at optimal speed.
Ensure a business-ready data pipeline that is always available for all data consumers and business stakeholders.
Build a data-driven culture in the organization through automation, data quality, and governance.
As a data practitioner, using the methodology can help you reduce development time, cut wastages and duplication of effort,
increase your productivity and throughput, and ensure that your actions produce the best possible quality of data.
With DataOps, data professionals, consumers, and stakeholders can collaborate more effectively towards the shared goal of
creating valuable insights for business. While implementing the methodology will require systemic change, time, and resources,
but in the end, it makes data and analytics more efficient and reliable.
Interestingly, it also opens up additional career opportunities for you as a data engineer. DataOps Engineers are technical
professionals that focus on the development and deployment lifecycle rather than the product itself. And as you grow in
experience, you can move into more specialist roles within DataOps, contributing to defining the data strategy, developing and
deploying business processes, establishing performance metrics, and measuring performance.