0% found this document useful (0 votes)
32 views165 pages

Cloudcomputing Merged

Uploaded by

meerjaafsar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
32 views165 pages

Cloudcomputing Merged

Uploaded by

meerjaafsar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 165

UNIT – 1

Basics of Cloud computing


Introduction to cloud computing: Introduction, Characteristics of cloud computing, Cloud Models,
Cloud Services Examples, Cloud Based services and applications
Cloud concepts and Technologies: Virtualization, Load balancing, Scalability and Elasticity,
Deployment, Replication, Monitoring, Software defined, Network function virtualization, Map Reduce,
Identity and Access Management, services level Agreements, Billing.
Cloud Services and Platforms: Compute Services, Storage Services, Database Services, Application
services, Content delivery services, Analytics Services, Deployment and Management Services,
Identity and Access Management services, Open Source Private Cloud software.

***************

Cloud Computing provides us means of accessing the applications as utilities over


the Internet. It allows us to create, configure, and customize the applications online.
What is Cloud?
The term Cloud refers to a Network or Internet. In other words, we can say that
Cloud is something, which is present at remote location. Cloud can provide services over
public and private networks, i.e., WAN, LAN or VPN.
Applications such as e-mail, web conferencing, customer relationship management
(CRM) execute on cloud.
What is Cloud Computing?
Cloud Computing refers to manipulating, configuring, and accessing the hardware
and software resources remotely. It offers online data storage, infrastructure, and
application.

Cloud computing offers platform independency, as the software is not required to


be installed locally on the PC. Hence, the Cloud Computing is making our business
applications mobile and collaborative.

History of Cloud Computing


The concept of Cloud Computing came into existence in the year 1950 with
implementation of mainframe computers, accessible via thin/static clients. Since then,
cloud computing has been evolved from static clients to dynamic ones and from software
to services. The following diagram explains the evolution of cloud computing:

Page | 1
Benefits
Cloud Computing has numerous advantages. Some of them are listed below -
• One can access applications as utilities, over the Internet.
• One can manipulate and configure the applications online at any time.
• It does not require to install a software to access or manipulate cloud application.
• Cloud Computing offers online development and deployment tools, programming
runtime environment through PaaS model.
• Cloud resources are available over the network in a manner that provide platform
independent access to any type of clients.
• Cloud Computing offers on-demand self-service. The resources can be used
without interaction with cloud service provider.
• Cloud Computing is highly cost effective because it operates at high efficiency with
optimum utilization. It just requires an Internet connection
• Cloud Computing offers load balancing that makes it more reliable.

Page | 2
Characteristics of Cloud Computing
There are four key characteristics of cloud computing. They are shown in the
following diagram:

On Demand Self Service


Cloud Computing allows the users to use web services and resources on demand.
One can logon to a website at any time and use them.
Broad Network Access
Since cloud computing is completely web based, it can be accessed from anywhere
and at any time.
Resource Pooling
Cloud computing allows multiple tenants to share a pool of resources. One can
share single physical instance of hardware, database and basic infrastructure.
Rapid Elasticity
It is very easy to scale the resources vertically or horizontally at any time. Scaling of
resources means the ability of resources to deal with increasing or decreasing demand.
The resources being used by customers at any given point of time are automatically
monitored.
Measured Service
In this service cloud provider controls and monitors all the aspects of cloud service.
Resource optimization, billing, and capacity planning etc. depend on it.

Cloud Models:
There are certain services and models working behind the scene making the cloud
computing feasible and accessible to end users. Following are the working models for
cloud computing:
Page | 3
• Deployment Models
• Service Models
Deployment Models
Deployment models define the type of access to the cloud, i.e., how the cloud is
located? Cloud can have any of the four types of access: Public, Private, Hybrid, and
Community.

Public Cloud
The public cloud allows systems and services to be easily accessible to the general
public. Public cloud may be less secure because of its openness.
Private Cloud
The private cloud allows systems and services to be accessible within an
organization. It is more secured because of its private nature.
Community Cloud
The community cloud allows systems and services to be accessible by a group of
organizations.
Hybrid Cloud
The hybrid cloud is a mixture of public and private cloud, in which the critical
activities are performed using private cloud while the non-critical activities are performed
using public cloud.
Service Models
Cloud computing is based on service models. These are categorized into three
basic service models which are -
• Infrastructure-as–a-Service (IaaS)
• Platform-as-a-Service (PaaS)
• Software-as-a-Service (SaaS)
Anything-as-a-Service (XaaS) is yet another service model, which includes Network-
as-a-Service, Business-as-a-Service, Identity-as-a-Service, Database-as-a-Service or Strategy-
as-a-Service.
The Infrastructure-as-a-Service (IaaS) is the most basic level of service. Each of
the service models inherit the security and management mechanism from the underlying
model, as shown in the following diagram:

Page | 4
Infrastructure-as-a-Service (IaaS)
IaaS provides access to fundamental resources such as physical machines, virtual
machines, virtual storage, etc.
Platform-as-a-Service (PaaS)
PaaS provides the runtime environment for applications, development and
deployment tools, etc.
Software-as-a-Service (SaaS)
SaaS model allows to use software applications as a service to end-users.

Risks related to Cloud Computing


Although cloud Computing is a promising innovation with various benefits in the
world of computing, it comes with risks. Some of them are discussed below:
Security and Privacy
It is the biggest concern about cloud computing. Since data management and
infrastructure management in cloud is provided by third-party, it is always a risk to
handover the sensitive information to cloud service providers.
Although the cloud computing vendors ensure highly secured password protected
accounts, any sign of security breach may result in loss of customers and businesses.
Lock In
It is very difficult for the customers to switch from one Cloud Service Provider
(CSP) to another. It results in dependency on a particular CSP for service.
Isolation Failure
This risk involves the failure of isolation mechanism that separates storage,
memory, and routing between the different tenants.
Management Interface Compromise
In case of public cloud provider, the customer management interfaces are
accessible through the Internet.
Insecure or Incomplete Data Deletion
It is possible that the data requested for deletion may not get deleted. It happens
because either of the following reasons
• Extra copies of data are stored but are not available at the time of deletion
• Disk that stores data of multiple tenants is destroyed.
Virtualization In Cloud Computing

Page | 5
We need to understand the meaning of the word virtual. The word virtual means
that it is a representation of something physically present elsewhere.
Similarly, Virtualization in Cloud Computing is a technology that allows us to
create virtual resources such as servers, networks, and storage in the cloud. All these
resources are allocated from a physical machine that runs somewhere in the world, and
we'll get the software to provision and manage these virtual resources. These physical
machines are operated by cloud providers, who take care of maintenance, and hardware
supplies.
• Virtualization refers to the partitioning the resources of a physical system (such
as computing, storage, network and memory) into multiple virtual resources.
• Key enabling technology of cloud computing that allow pooling of resources.
• In cloud computing, resources are pooled to serve multiple users using
multi-tenancy.

Some of virtualization in cloud computing examples are as follows


• EC2 service from Amazon Web Service
• Compute engine from Google Cloud
• Azure Virtual Machines from Microsoft Azure
Concept Behind Virtualization
The main concept behind virtualization is Hypervisor. Hypervisor is a software
that partitions the hardware resources on the physical machine and runs Virtual
Machine.
It is typically installed on the server's hardware and divides the resources for
Virtual machines (VMs).
The server running hypervisor is called the Host, and the VMs using its resources
are called Guest Operating Systems.

Page | 6
The VMs function like digital files inside the physical device and they can be
moved from one system to another, thereby increasing the portability. There are many
open-source and paid Hypervisors available. Cloud providers use them based on their
requirements and business needs.
Virtualization Work in Cloud Computing

Important Terminologies of Virtualization


1. Virtual Machine (VM)
The virtual machine that simulates an actual computer, these VMs come with an
operating system (OS) already installed and executes the application that is installed
inside them. These virtual machines are controlled and managed by the Hypervisor.
2. Hypervisor
A hypervisor is software that partitions the hardware resources on the physical
machine and runs Virtual Machine on them. This is responsible to create and provision
virtual resources when there is a request.
Type-1 Hypervisor:
Type-I or the native hypervisors run directly on the host hardware and control the
hardware and monitor the guest operating systems.

Type-2 Hypervisor:
Type 2 hypervisors or hosted hypervisors run on top of a conventional
(main/host) operating system and monitor the guest operating systems.

Page | 7
3. Virtualization software
A tool that works on deploying virtualization on the device, this is the software
that the user interacts with for specifying virtual resources requirements. This software
communicates with the hypervisor for the resource requirements.
4. Virtual Networking
The Virtual Networking, the network that is configured inside the servers is
separated logically these networks can be scaled across multiple servers, and these
networks can be controlled by the software.

Page | 8
Types of Virtualization:
Full Virtualization
Full Virtualization is virtualization in which the guest operating system is unaware
that it is in a virtualized environment, and therefore hardware is virtualized by the host
operating system so that the guest can issue commands to what it thinks is actual
hardware, but really are just simulated hardware devices created by the host.
Para-Virtualization
Para-Virtualization is virtualization in which the guest operating system (the one
being virtualized) is aware that it is a guest and accordingly has drivers that, instead of
issuing hardware commands, simply issues commands directly to the host operating
system. This will include things such as memory management as well.
Hardware Virtualization
Hardware assisted virtualization is enabled by hardware features such as Intel’s
Virtualization Technology (VT-x) and AMD’s AMD-V.
In hardware assisted virtualization, privileged and sensitive calls are set to
automatically trap to the hypervisor. Thus, there is no need for either binary translation
or para-virtualization.

Load balancing in Cloud Computing


Cloud load balancing is defined as the method of splitting workloads and
computing properties in a cloud computing. It enables enterprise to manage workload
demands or application demands by distributing resources among numerous computers,
networks or servers. Cloud load balancing includes holding the circulation of workload
traffic and demands that exist over the Internet.
As the traffic on the internet growing rapidly, which is about 100% annually of the
present traffic. Hence, the workload on the server growing so fast which leads to the
overloading of servers mainly for popular web server. There are two elementary solutions
to overcome the problem of overloading on the servers-
• First is a single-server solution in which the server is upgraded to a higher
performance server. However, the new server may also be overloaded soon,
demanding another upgrade. Moreover, the upgrading process is arduous and
expensive.

Page | 9
• Second is a multiple-server solution in which a scalable service system on a cluster
of servers is built. That’s why it is more cost effective as well as more scalable to
build a server cluster system for network services.
Load balancing is beneficial with almost any type of service, like HTTP, SMTP, DNS, FTP,
and POP/IMAP. It also rises reliability through redundancy. The balancing service is
provided by a dedicated hardware device or program. Cloud-based servers farms can
attain more precise scalability and availability using server load balancing.
Load balancing solutions can be categorized into two types –
1. Software-based load balancers: Software-based load balancers run on standard
hardware (desktop, PCs) and standard operating systems.
2. Hardware-based load balancer: Hardware-based load balancers are dedicated
boxes which include Application Specific Integrated Circuits (ASICs) adapted for a
particular use. ASICs allows high speed promoting of network traffic and are
frequently used for transport-level load balancing because hardware-based load
balancing is faster in comparison to software solution.
Load Balancing Algorithms
• Round Robin load balancing
• Weighted Round Robin load balancing
• Low Latency load balancing
• Least Connections load balancing
• Priority load balancing
• Overflow load balancing

(a) Round Robin Load Balancing (b) Weighted Round Robin Load
Balancing

Page | 10
(c)Low Latency Load Balancing (d) Least connections Load Balancing

(e)Priority Load Balancing (f) Overload Load Balancing


Load Balancing- Persistence Approaches
• Since load balancing can route successive requests from a user session to different
servers, maintaining the state or the information of the session is important.
• Persistence Approaches
• Sticky sessions
• Session Database
• Browser cookies
• URL re-writing
Cloud Elasticity
Cloud elasticity is a system’s ability to increase (or decrease) its varying
capacity-related needs such as storage, networking, and computing based on specific
criteria (think: total load on the system).
Simply put, elasticity adapts to both the increase and decrease in workload by
provisioning and de-provisioning resources in an capacity.

Here are some of its distinctive characteristics:


• Matches the allocated resources with the actual resources in real-time
• Widely used in e-commerce and retail, software as a service (SaaS), DevOps, mobile,
and other cloud environments with ever-changing infrastructure demands

Page | 11
Example of cloud elasticity : Cloud elasticity refers to scaling up (or scaling down) the
computing capacity as needed. It basically helps you understand how well your
architecture can adapt to the workload in real time.
For example, 100 users log in to your website every hour. A single server can easily
handle this volume of traffic. However, what happens if 5000 users log in at the same time?
If your existing architecture can quickly and automatically provision new web servers to
handle this load, your design is elastic.
As you can imagine, cloud elasticity comes in handy when your business
experiences sudden spikes in user activity and, with it, a drastic increase in workload
demand – as happens in businesses such as streaming services or e-commerce
marketplaces.
Take the video streaming service Netflix, for example. Here’s how Netflix’s
architecture leverages the power of elasticity to scale up and down:

Cloud Scalability
Cloud scalability only adapts to the workload increase through the incremental
provision of resources without impacting the system’s overall performance. This is built in
as part of the infrastructure design instead of makeshift resource allocation (as with cloud
elasticity).
Below are some of its main features:
• Typically handled by adding resources to existing instances, also known as scaling
up or vertical scaling, or by adding more copies of existing instances, also known
as scaling out or horizontal scaling
• Allows companies to implement big data models for machine learning (ML) and
data analysis
• Handles rapid and unpredictable changes in a scalable capacity
• Generally more granular and targeted than elasticity in terms of sizing

Page | 12
• Ideal for businesses with a predictable and preplanned workload where capacity
planning and performance are relatively stable
Example of cloud scalability : Cloud scalability has many examples and use cases. It
allows you to scale up or scale out to meet the increasing workloads. You can scale up a
platform or architecture to increase the performance of an individual server.
Usually, this means that hardware costs increase linearly with demand. On the flip
side, you can also add multiple servers to a single server and scale out to enhance server
performance and meet the growing demand.
Another good example of cloud scalability is a call center. A call center requires a
scalable application infrastructure as new employees join the organization and customer
requests increase incrementally. As a result, organizations need to add new server features
to ensure consistent growth and quality performance.

Scalability vs. Elasticity: A comparative analysis


Scalability and elasticity are the two sides of the same coin with some notable
differences. Below is a detailed comparative analysis of scalability vs. elasticity:

S.No Cloud Elasticity Cloud Scalability

It is used just to fulfil the sudden requirement in It is used to fulfil the static boost in
1
the workload for a short period. the workload.

It is preferred to satisfy the dynamic


It is preferred to handle growth in
2 modifications, where the required resources can
the workload in an organisation.
improve or reduce.

Cloud elasticity is generally used by small


Cloud scalability is utilised by big
3 enterprises whose workload expands only for a
enterprises.
specific period.

It is a short term event that is used to deal with It is a long term event that is used
4
an unplanned or sudden growth in demand. to deal with an expected growth in

Page | 13
demand.

Types of scalability: Typically, there are three types of scalability:


1. Vertical scaling (scaling up) : This type of scalability is best-suited when you
experience increased workloads and add resources to the existing infrastructure to
improve server performance. If you’re looking for a short-term solution to your immediate
needs, vertical scaling may be your calling.

2. Horizontal scaling (scaling out): It enables companies to add new elements to


their existing infrastructure to cope with ever-increasing workload demands. However, this
horizontal scaling is designed for the long term and helps meet current and future resource
needs, with plenty of room for expansion.

3. Diagonal scaling : Diagonal scaling involves horizontal and vertical scaling. It’s more
flexible and cost-effective as it helps add or remove resources as per existing workload
requirements. Adding and upgrading resources according to the varying system load and
demand provides better throughput and optimizes resources for even better performance.

Cloud Deployment Model: It works as your virtual computing environment with a


choice of deployment model depending on how much data you want to store and who has
access to the Infrastructure.

Page | 14
• Cloud application deployment design is an iterative process that involves:
• Deployment Design
• The variables in this step include the number of servers in each tier,
computing, memory and storage capacities of severs, server
interconnection, load balancing and replication strategies.
• Performance Evaluation
• To verify whether the application meets the performance
requirements with the deployment.
• Involves monitoring the workload on the application and measuring
various workload parameters such as response time and throughput.
• Utilization of servers (CPU, memory, disk, I/O, etc.) in each tier is also
monitored.
• Deployment Refinement
• Various alternatives can exist in this step such as vertical scaling (or
scaling up), horizontal scaling (or scaling out), alternative server
interconnections, alternative load balancing and replication
strategies, for instance.

Different Types Of Cloud Computing Deployment Models


Most cloud hubs have tens of thousands of servers and storage devices to enable
fast loading. It is often possible to choose a geographic area to put the data "closer" to
users. Thus, deployment models for cloud computing are categorized based on their
location. To know which model would best fit the requirements of your organization, let us
first learn about the various types.

Public Cloud: The name says it all. It is accessible to the public. Public deployment
models in the cloud are perfect for organizations with growing and fluctuating demands. It
also makes a great choice for companies with low-security concerns.
Thus, you pay a cloud service provider for networking services, compute
virtualization & storage available on the public internet. It is also a great delivery model for
the teams with development and testing. Its configuration and deployment are quick and
easy, making it an ideal choice for test environments.

Page | 15
Benefits of Public Cloud
o Minimal Investment - As a pay-per-use service, there is no large upfront cost and is
ideal for businesses who need quick access to resources
o No Hardware Setup - The cloud service providers fully fund the entire Infrastructure
o No Infrastructure Management - This does not require an in-house team to utilize
the public cloud.
Private Cloud: Companies that look for cost efficiency and greater control over data &
resources will find the private cloud a more suitable choice.
It means that it will be integrated with your data center and managed by your IT
team. Alternatively, you can also choose to host it externally. The private cloud offers
bigger opportunities that help meet specific organizations' requirements when it comes to
customization. It's also a wise choice for mission-critical processes that may have
frequently changing requirements.

Benefits of Private Cloud


o Data Privacy - It is ideal for storing corporate data where only authorized personnel
gets access
o Security - Segmentation of resources within the same Infrastructure can help with
better access and higher levels of security.
o Supports Legacy Systems - This model supports legacy systems that cannot
access the public cloud.
Community Cloud : The community cloud operates in a way that is similar to the public
cloud. There's just one difference - it allows access to only a specific set of users who

Page | 16
share common objectives and use cases. This type of deployment model of cloud
computing is managed and hosted internally or by a third-party vendor. However, you can
also choose a combination of all three.

Benefits of Community Cloud


o Smaller Investment - A community cloud is much cheaper than the private & public
cloud and provides great performance
o Setup Benefits - The protocols and configuration of a community cloud must align
with industry standards, allowing customers to work much more efficiently.
Hybrid Cloud : As the name suggests, a hybrid cloud is a combination of two or more
cloud architectures. While each model in the hybrid cloud functions differently, it is all part
of the same architecture. Further, as part of this deployment of the cloud computing model,
the internal or external providers can offer resources.
Let's understand the hybrid model better. A company with critical data will prefer
storing on a private cloud, while less sensitive data can be stored on a public cloud. The
hybrid cloud is also frequently used for 'cloud bursting'. It means, supposes an organization
runs an application on-premises, but due to heavy load, it can burst into the public cloud.
AD

Benefits of Hybrid Cloud


o Cost-Effectiveness - The overall cost of a hybrid solution decreases since it majorly
uses the public cloud to store data.
o Security - Since data is properly segmented, the chances of data theft from
attackers are significantly reduced.

Page | 17
o Flexibility - With higher levels of flexibility, businesses can create custom solutions
that fit their exact requirements
Replication:
• Replication is used to create and maintain multiple copies of the data in the cloud.
• Cloud enables rapid implementation of replication solutions for disaster recovery
for organizations.
• With cloud-based data replication organizations can plan for disaster recovery
without making any capital expenditures on purchasing, configuring and managing
secondary site locations.
• Types:
• Array-based Replication
• Network-based Replication
• Host-based Replication

Array Based Replication


Cloud Monitoring:
Cloud monitoring is the process of reviewing and managing the operational
workflow and processes within a cloud infrastructure or asset. It’s generally implemented
through automated monitoring software that gives central access and control over the
cloud infrastructure.
Admins can review the operational status and health of cloud servers and
components.
Concerns arise based on the type of cloud structure you have, and your strategy for
using it. If you’re using a public cloud service, you tend to have limited control and visibility
for managing and monitoring the infrastructure. A private cloud, which most large
organizations use, provides the internal IT department more control and flexibility, with
added consumption benefits.
Regardless of the type of cloud structure your company uses, monitoring is critical
to performance and security.
How Cloud Monitoring Works
The cloud has many moving parts, and it’s important to ensure everything works
together seamlessly to optimize performance. Cloud monitoring primarily includes
functions such as:
▶ Website monitoring: Tracking the processes, traffic, availability and resource
utilization of cloud-hosted websites
▶ Virtual machine monitoring: Monitoring the virtualization infrastructure and
individual virtual machines
▶ Database monitoring: Monitoring processes, queries, availability, and consumption
of cloud database resources
Page | 18
▶ Virtual network monitoring: Monitoring virtual network resources, devices,
connections, and performance
▶ Cloud storage monitoring: Monitoring storage resources and their processes
provisioned to virtual machines, services, databases, and applications
Software Defined Networking
• Software-Defined Networking (SDN) is a networking architecture that separates the
control plane from the data plane and centralizes the network controller.
• Conventional network architecture
• The control plane and data plane are coupled. Control plane is the part of
the network that carries the signalling and routing message traffic while the
data plane is the part of the network that carries the payload data traffic.
• SDN Architecture
• The control and data planes are decoupled and the network controller is
centralized.
• Centralized Network Controller
• With decoupled the control and data planes and centralized network
controller, the network administrators can rapidly configure the network.
• Programmable Open APIs
• SDN architecture supports programmable open APIs for interface between
the SDN application and control layers (Northbound interface). These open
APIs that allow implementing various network services such as routing,
quality of service (QoS), access control, etc.
• Standard Communication Interface (OpenFlow)
• SDN architecture uses a standard communication interface between the
control and infrastructure layers(Southbound interface). OpenFlow, which is
defined by the Open Networking Foundation (ONF) is the broadly accepted
SDN protocol for the Southbound interface.

OpenFlow
• OpenFlow is the broadly accepted SDN protocol for the Southbound interface.
• With OpenFlow, the forwarding plane of the network devices can be directly
accessed and manipulated.
• OpenFlow uses the concept of flows to identify network traffic based on
pre-defined match rules.
• Flows can be programmed statically or dynamically by the SDN control software.
• OpenFlow protocol is implemented on both sides of the interface between the
controller and the network devices.
Page | 19
OpenFlow switch comprising of one or more flow tables and a group table, which perform
packet lookups and forwarding, and OpenFlow channel to an external controller.
Network Function Virtualization:
• Network Function Virtualization (NFV) is a technology that leverages virtualization
to consolidate the heterogeneous network devices onto industry standard high
volume servers, switches and storage.
• Relationship to SDN
• NFV is complementary to SDN as NFV can provide the infrastructure on
which SDN can run.
• NFV and SDN are mutually beneficial to each other but not dependent.
• Network functions can be virtualized without SDN, similarly, SDN can run
without NFV.
• NFV comprises of network functions implemented in software that run on
virtualized resources in the cloud.
• NFV enables a separation the network functions which are implemented in software
from the underlying hardware.
NFV Architecture
• Key elements of the NFV architecture are
• Virtualized Network Function (VNF): VNF is a software implementation of a
network function which is capable of running over the NFV Infrastructure
(NFVI).
• NFV Infrastructure(NFVI): NFVI includes compute, network and storage
resources that are virtualized.
• NFV Management and Orchestration: NFV Management and Orchestration
focuses on all virtualization-specific management tasks and covers the
orchestration and lifecycle management of physical and/or software
resources that support the infrastructure virtualization, and the lifecycle
management of VNFs.

Page | 20
MapReduce: MapReduce is a processing technique and a program model for
distributed computing based on java.
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
• Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
• Secondly, reduce task, which takes the output from a map as an input and
combines those data tuples into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always performed after the map job.
• The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes.
• Under the MapReduce model, the data processing primitives are called mappers
and reducers. Decomposing a data processing application
into and is sometimes nontrivial.
• But, once we write an application in the MapReduce form, scaling the application to
run over hundreds, thousands, or even tens of thousands of machines in a cluster is
merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
The Algorithm
• Generally MapReduce paradigm is based on sending the computer to where the
data resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage,
and reduce stage.
o Map stage − The map or mapper’s job is to process the input data. Generally
the input data is in the form of file or directory and is stored in the Hadoop
file system (HDFS). The input file is passed to the mapper function line by
line. The mapper processes the data and creates several small chunks of
data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from
the mapper. After processing, it produces a new set of output, which will be
stored in the HDFS.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
• The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the nodes.

Page | 21
• Most of the computing takes place on nodes with data on local disks that reduces
the network traffic.
• After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.

Terminology
▶ PayLoad − Applications implement the Map and the Reduce functions, and form the
core of the job.
▶ Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value
pair.
▶ NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
▶ DataNode − Node where data is presented in advance before any processing takes
place.
▶ MasterNode − Node where JobTracker runs and which accepts job requests from
clients.
▶ SlaveNode − Node where Map and Reduce program runs.
▶ JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
▶ Task Tracker − Tracks the task and reports status to JobTracker.
▶ Job − A program is an execution of a Mapper and Reducer across a dataset.
▶ Task − An execution of a Mapper or a Reducer on a slice of data.
▶ Task Attempt − A particular instance of an attempt to execute a task on a
SlaveNode.
Service level agreements in Cloud computing
A Service Level Agreement (SLA) is the bond for performance negotiated between
the cloud services provider and the client. Earlier, in cloud computing all Service Level
Agreements were negotiated between a client and the service consumer.
Service level agreements are also defined at different levels which are mentioned
below:
Customer-based SLA
Service-based SLA
Multilevel SLA
Few Service Level Agreements are enforceable as contracts, but mostly are agreements or
contracts which are more along the lines of an Operating Level Agreement (OLA) and may
not have the restriction of law. It is fine to have an attorney review the documents before
making a major agreement to the cloud service provider. Service Level Agreements usually
specify some parameters which are mentioned below:
Page | 22
Availability of the Service (uptime)
Latency or the response time
Service components reliability
Each party accountability
Warranties
Each individual component has its own Service Level Agreements. Below are two major
Service Level Agreements (SLA) described:
1. Windows Azure SLA – Window Azure has different SLA’s for compute and storage.
For compute, there is a guarantee that when a client deploys two or more role
instances in separate fault and upgrade domains, client’s internet facing roles will
have external connectivity minimum 99.95% of the time. Moreover, all of the role
instances of the client are monitored and there is guarantee of detection 99.9% of
the time when a role instance’s process is not runs and initiates properly.
2. SQL Azure SLA – SQL Azure clients will have connectivity between the database
and internet gateway of SQL Azure. SQL Azure will handle a “Monthly Availability” of
99.9% within a month. Monthly Availability Proportion for a particular tenant
database is the ratio of the time the database was available to customers to the
total time in a month. Time is measured in some intervals of minutes in a 30-day
monthly cycle. Availability is always remunerated for a complete month. A portion
of time is marked as unavailable if the customer’s attempts to connect to a
database are denied by the SQL Azure gateway.
SLA Lifecycle

Steps in SLA Lifecycle


1. Discover service provider: This step involves identifying a service provider that can
meet the needs of the organization and has the capability to provide the required
service. This can be done through research, requesting proposals, or reaching out
to vendors.
2. Define SLA: In this step, the service level requirements are defined and agreed upon
between the service provider and the organization. This includes defining the

Page | 23
service level objectives, metrics, and targets that will be used to measure the
performance of the service provider.
3. Establish Agreement: After the service level requirements have been defined, an
agreement is established between the organization and the service provider
outlining the terms and conditions of the service. This agreement should include
the SLA, any penalties for non-compliance, and the process for monitoring and
reporting on the service level objectives.
4. Monitor SLA violation: This step involves regularly monitoring the service level
objectives to ensure that the service provider is meeting their commitments. If any
violations are identified, they should be reported and addressed in a timely manner.
5. Terminate SLA: If the service provider is unable to meet the service level objectives,
or if the organization is not satisfied with the service provided, the SLA can be
terminated. This can be done through mutual agreement or through the
enforcement of penalties for non-compliance.
6. Enforce penalties for SLA Violation: If the service provider is found to be in
violation of the SLA, penalties can be imposed as outlined in the agreement. These
penalties can include financial penalties, reduced service level objectives, or
termination of the agreement.
Advantages of SLA
1. Improved communication: A better framework for communication between the
service provider and the client is established through SLAs, which explicitly outline
the degree of service that a customer may anticipate. This can make sure that
everyone is talking about the same things when it comes to service expectations.
2. Increased accountability: SLAs give customers a way to hold service providers
accountable if their services fall short of the agreed-upon standard. They also hold
service providers responsible for delivering a specific level of service.
3. Better alignment with business goals: SLAs make sure that the service being given
is in line with the goals of the client by laying down the performance goals and
service level requirements that the service provider must satisfy.
4. Reduced downtime: SLAs can help to limit the effects of service disruptions by
creating explicit protocols for issue management and resolution.
5. Better cost management: By specifying the level of service that the customer can
anticipate and providing a way to track and evaluate performance, SLAs can help to
limit costs. Making sure the consumer is getting the best value for their money can
be made easier by doing this.

Disadvantages of SLA
1. Complexity: SLAs can be complex to create and maintain, and may require
significant resources to implement and enforce.
2. Rigidity: SLAs can be rigid and may not be flexible enough to accommodate
changing business needs or service requirements.

Page | 24
3. Limited service options: SLAs can limit the service options available to the
customer, as the service provider may only be able to offer the specific services
outlined in the agreement.
4. Misaligned incentives: SLAs may misalign incentives between the service provider
and the customer, as the provider may focus on meeting the agreed-upon service
levels rather than on providing the best service possible.
5. Limited liability: SLAs are not legal binding contracts and often limited the liability
of the service provider in case of service failure.
Identity and Access Management
• Identity and Access Management (IDAM) for cloud describes the authentication
and authorization of users to provide secure access to cloud resources.
• Organizations with multiple users can use IDAM services provided by the cloud
service provider for management of user identifiers and user permissions.
• IDAM services allow organizations to centrally manage users, access
permissions, security credentials and access keys.
• Organizations can enable role-based access control to cloud resources and
applications using the IDAM services.
• IDAM services allow creation of user groups where all the users in a group have
the same access permissions.
• Identity and Access Management is enabled by a number of technologies such
as OpenAuth, Role-based Access Control (RBAC), Digital Identities, Security
Tokens, Identity Providers, etc.
Billing
Cloud service providers offer a number of billing models described as follows:
• Elastic Pricing
• In elastic pricing or pay-as-you-use pricing model, the customers are
charged based on the usage of cloud resources.
• Fixed Pricing
• In fixed pricing models, customers are charged a fixed amount per month
for the cloud resources.
• Spot Pricing
• Spot pricing models offer variable pricing for cloud resources which is
driven by market demand.

Cloud Services & Platforms


Cloud Reference Model:

Page | 25
• Infrastructure & Facilities Layer: Includes the physical infrastructure such as
datacenter facilities, electrical and mechanical equipment, etc.
• Hardware Layer: Includes physical compute, network and storage hardware.
• Virtualization Layer: Partitions the physical hardware resources into multiple
virtual resources that enabling pooling of resources.
• Platform & Middleware Layer: Builds upon the IaaS layers below and provides
standardized stacks ofservices such as database service, queuing service,
application frameworksand run-time environments, messaging services,
monitoring services, analytics services, etc.
• Service Management Layer: Provides APIs for requesting, managing and
monitoring cloud resources.
• Applications Layer: Includes SaaS applications such as Email, cloud storage
application, productivity applications, management portals, customer self-service
portals, etc.

Compute Services
• Compute services provide dynamically scalable compute capacity in the cloud.
• Compute resources can be provisioned on-demand in the form of virtual
machines. Virtual machines can be created from standard images provided by
the cloud service provider or custom images created by the users.
• Compute services can be accessed from the web consoles of these services that
provide graphical user interfaces for provisioning, managing and monitoring
these services.
• Cloud service providers also provide APIs for various programming languages
that allow developers to access and manage these services programmatically.
Compute Services – Amazon EC2
Amazon Elastic Compute Cloud (EC2) is a compute service provided by Amazon.

Page | 26
• Launching EC2 Instances: To launch a new instance click on the launch instance
button. This will open a wizard where you can select the Amazon machine image
(AMI) with which you want to launch the instance. You can also create their own
AMIs with custom applications, libraries and data. Instances can be launched
with a variety of operating systems.
• Instance Sizes: When you launch an instance you specify the instance type
(micro, small, medium, large, extra-large, etc.), the number of instances to launch
based on the selected AMI and availability zones for the instances.
• Key-pairs: When launching a new instance, the user selects a key-pair from
existing keypairs or creates a new keypair for the instance. Keypairs are used to
securely connect to an instance after it launches.
• Security Groups: The security groups to be associated with the instance can be
selected from the instance launch wizard. Security groups are used to open or
block a specific network port for the launched instances.

Compute Services – Google Compute Engine


Google Compute Engine is a compute service provided by Google.
• Launching Instances: To create a new instance, the user selects an instance
machine type, a zone in which the instance will be launched, a machine image for
the instance and provides an instance name, instance tags and meta-data.
• Disk Resources: Every instance is launched with a disk resource. Depending on
the instance type, the disk resource can be a scratch disk space or persistent
disk space. The scratch disk space is deletedwhen the instance terminates.
Whereas, persistent disks live beyond the life of an instance.
• Network Options: Network option allows you to control the traffic to and from the
instances. By default, traffic between instances in the same network, over any
port and any protocol and incoming SSH connections from anywhere are
enabled.

Page | 27
Compute Services – Windows Azure VMs
Windows Azure Virtual Machines is the compute service from Microsoft.
• Launching Instances:
o To create a new instance, you select the instance type and the machine
image.
o You can either provide a user name and password or upload a certificate
file for securely connecting to the instance.
o Any changes made to the VM are persistently stored and new VMs can be
created from the previously stored machine images.

Storage Services
• Cloud storage services allow storage and retrieval of any amount of data, at any
time from anywhere on the web.
• Most cloud storage services organize data into buckets or containers.
• Scalability

Page | 28
• Cloud storage services provide high capacity and scalability. Objects upto
several tera-bytes in size can be uploaded and multiple
buckets/containers can be created on cloud storages.
• Replication
• When an object is uploaded it is replicated at multiple facilities and/or on
multiple devices within each facility.
• Access Policies
• Cloud storage services provide several security features such as Access
Control Lists (ACLs), bucket/container level policies, etc. ACLs can be
used to selectively grant access permissions on individual objects.
Bucket/container level policies can also be defined to allow or deny
permissions across some or all of the objects within a single
bucket/container.
• Encryption
• Cloud storage services provide Server Side Encryption (SSE) options to
encrypt all data stored in the cloud storage.
• Consistency
• Strong data consistency is provided for all upload and delete operations.
Therefore, any object that is uploaded can be immediately downloaded
after the upload is complete.
Storage Services – Amazon S3
• Amazon Simple Storage Service(S3) is an online cloud-based data storage
infrastructure for storing and retrieving any amount of data.
• S3 provides highly reliable, scalable, fast, fully redundant and affordable storage
infrastructure.
• Buckets
• Data stored on S3 is organized in the form of buckets. You must create a
bucket before you can store data on S3.
• Uploading Files to Buckets
• S3 console provides simple wizards for creating a new bucket and
uploading files.
• You can upload any kind of file to S3.
• While uploading a file, you can specify the redundancy and encryption
options and access permissions.

Storage Services – Google Cloud Storage


• GCS is the Cloud storage service from Google
• Buckets
• Objects in GCS are organized into buckets.
• Access Control Lists
• ACLs are used to control access to objects and buckets.
Page | 29
• ACLs can be configured to share objects and buckets with the entire
world, a Google group, a Google-hosted domain, or specific Google
account holders.

Storage Services – Windows Azure Storage


• Windows Azure Storage is the cloud storage service from Microsoft.
• Windows Azure Storage provides various storage services such as blob storage
service, table service and queue service.
• Blob storage service
• The blob storage service allows storing unstructured binary data or binary
large objects (blobs).
• Blobs are organized into containers.
• Block blobs - can be subdivided into some number of blocks. If a failure
occurs while transferring a block blob, retransmission can resume with the
most recent block rather than sending the entire blob again.
• Page blobs - are divided into number of pages and are designed for
random access. Applications can read and write individual pages at
random in a page blob.

Storage Services – Google Cloud SQL


• Google SQL is the relational database service from Google.

Page | 30
• Google Cloud SQL service allows you to host MySQL databases in the Google’s
cloud.
• Launching DB Instances
• You can create new database instances from the console and manage
existing instances. To create a new instance you select a region, database
tier, billing plan and replication mode.
• Backups
• You can schedule daily backups for your Google Cloud SQL instances, and
also restore backed-up databases.
• Replication
• Cloud SQL provides both synchronous or asynchronous geographic
replication and the ability to import/ export databases.

Database Services
Cloud database services allow you to set-up and operate relational or non-relational
databases in the cloud.
• Relational Databases:Popular relational databases provided by various cloud
service providers include MySQL, Oracle, SQL Server, etc.
• Non-relational Databases:The non-relational (No-SQL) databases provided by
cloud service providers are mostly proprietary solutions.
• Scalability:Cloud database services allow provisioning as much compute and
storage resources as required to meet the application workload levels.
Provisioned capacity can be scaled-up or down. For read-heavy workloads,
read-replicas can be created.
• Reliability:Cloud database services are reliable and provide automated backup
and snapshot options.
• Performance:Cloud database services provide guaranteed performance with
options such as guaranteed input/output operations per second (IOPS) which
can be provisioned upfront.
• Security:Cloud database services provide several security features to restrict the
access to the database instances and stored data, such as network firewalls and
authentication mechanisms.
Database Services – Amazon RDS

Page | 31
• Amazon Relational Database Service (RDS) is a web service that makes it easy
to setup, operate and scale a relational database in the cloud.
• Launching DB Instances
• The console provides an instance launch wizard that allows you to select
the type of database to create (MySQL, Oracle or SQL Server) database
instance size, allocated storage, DB instance identifier, DB username and
password. The status of the launched DB instances can be viewed from
the console.
• Connecting to a DB Instance
• Once the instance is available, you can note the instance end point from
the instance properties tab. This end point can then be used for securely
connecting to the instance.

Database Services – Amazon DynamoDB


• Amazon DynamoDB is the non-relational (No-SQL) database service from
Amazon.
• Data Model
• The DynamoDB data model includes include tables, items and attributes.
• A table is a collection of items and each item is a collection of attributes.
• To store data in DynamoDB you have to create a one or more tables and
specify how much throughput capacity you want to provision and reserve
for reads and writes.
• Fully Managed Service
• DynamoDB is a fully managed service that automatically spreads the data
and traffic for the stored tables over a number of servers to meet the
throughput requirements specified by the users.
• Replication
• All stored data is automatically replicated across multiple availability
zones to provide data durability.

Page | 32
Storage Services – Google Cloud SQL
• Google SQL is the relational database service from Google.
• Google Cloud SQL service allows you to host MySQL databases in the Google’s
cloud.
• Launching DB Instances
• You can create new database instances from the console and manage
existing instances. To create a new instance you select a region, database
tier, billing plan and replication mode.
• Backups
• You can schedule daily backups for your Google Cloud SQL instances, and
also restore backed-up databases.
• Replication
Cloud SQL provides both synchronous or asynchronous geographic replication and the
ability to import/ export databases.

Storage Services – Google Cloud Datastore


• Google Cloud Datastore is a fully managed non-relational database from Google.
• Cloud Datastore offers ACID transactions and high availability of reads and
writes.
• Data Model
• The Cloud Datastore data model consists of entities. Each entity has one
or more properties (key-value pairs) which can be of one of several
supported data types, such as strings and integers. Each entity has a kind
Page | 33
and a key. The entity kind is used for categorizing the entity for the
purpose of queries and the entity key uniquely identifies the entity.

Storage Services – Windows Azure Table Service


• Windows Azure Table Service is a non-relational (No-SQL) database service from
Microsoft.
• Data Model
• The Azure Table Service data model consists of tables having multiple
entities.
• Tables are divided into some number of partitions, each of which can be
stored on a separate machine.
• Each partition in a table holds a specified number of entities, each
containing as many as 255 properties.
• Each property can be one of the several supported data types such as
integers and strings.
• No Fixed Schema
• Tables do not have a fixed schema and different entities in a table can
have different properties.
Application Runtimes & Frameworks
• Cloud-based application runtimes and frameworks allow developers to develop
and host applications in the cloud.
• Support for various programming languages
• Application runtimes provide support for programming languages (e.g.,
Java, Python, or Ruby).
• Resource Allocation
• Application runtimes automatically allocate resources for applications
and handle the application scaling, without the need to run and maintain
servers.
Google App Engine
• Google App Engine is the platform-as-a-service (PaaS) from Google, which
includes both an application runtime and web frameworks.
• Runtimes
• App Engine provides runtime environments for Java, Python, PHP and Go
programming language.
• Sandbox

Page | 34
• Applications run in a secure sandbox environment isolated from other
applications.
• The sandbox environment provides a limited access to the underlying
operating system.
• Web Frameworks
• App Engine provides a simple Python web application framework called
webapp2. App Engine also supports any framework written in pure
Python that speaks WSGI, including Django, CherryPy, Pylons, web.py, and
web2py.
• Datastore
• App Engine provides a no-SQL data storage service.
• Authentication
• App Engine applications can be integrated with Google Accounts for user
authentication.
• URL Fetch service
• URL Fetch service allows applications to access resources on the Internet,
such as web services or other data.
• Other services
• Email service
• Image Manipulation service
• Memcache
• Task Queues
Scheduled Tasks service

Windows Azure Web Sites


• Windows Azure Web Sites is a Platform-as-a-Service (PaaS) from Microsoft.
• Azure Web Sites allows you to host web applications in the Azure cloud.
• Shared & Standard Options.
• In the shared option, Azure Web Sites run on a set of virtual machines that
may contain multiple web sites created by multiple users.
• In the standard option, Azure Web Sites run on virtual machines (VMs)
that belong to an individual user.

Page | 35
• Azure Web Sites supports applications created in ASP.NET, PHP, Node.js and
Python programming languages.
• Multiple copies of an application can be run in different VMs, with Web Sites
automatically load balancing requests across them.
Content Delivery Services
• Cloud-based content delivery service include Content Delivery Networks (CDNs).
• CDN is a distributed system of servers located across multiple geographic
locations to serve content to end- users with high availability and high
performance.
• CDNs are useful for serving static content such as text, images, scripts, etc., and
streaming media.
• CDNs have a number of edge locations deployed in multiple locations, often over
multiple backbones.
• Requests for static for streaming media content that is served by a CDN are
directed to the nearest edge location.
• Amazon CloudFront
• Amazon CloudFront is a content delivery service from Amazon.
CloudFront can be used to deliver dynamic, static and streaming content
using a global network of edge locations.
• Windows Azure Content Delivery Network
• Windows Azure Content Delivery Network (CDN) is the content delivery
service from Microsoft.
Analytics Services
• Cloud-based analytics services allow analyzing massive data sets stored in the
cloud either in cloud storages or in cloud databases using programming models
such as MapReduce.
• Amazon Elastic MapReduce
• Amazon Elastic MapReduce is the MapReduce service from Amazon
based the Hadoop framework running on Amazon EC2 and S3
• EMR supports various job types such as Custom JAR, Hive program,
Streaming job, Pig programs and Hbase
• Google MapReduce Service
• Google MapReduce Service is a part of the App Engine platform and can
be accessed using the Google MapReduce API.
• Google BigQuery
• Google BigQuery is a service for querying massive datasets. BigQuery
allows querying datasets using SQL-like queries.
• Windows Azure HDInsight
• Windows Azure HDInsight is an analytics service from Microsoft.
HDInsight deploys and provisions Hadoop clusters in the Azure cloud and
makes Hadoop available as a service.
Deployment & Management Services
Page | 36
• Cloud-based deployment & management services allow you to easily deploy and
manage applications in the cloud. These services automatically handle
deployment tasks such as capacity provisioning, load balancing, auto-scaling,
and application health monitoring.
• Amazon Elastic Beanstalk
• Amazon provides a deployment service called Elastic Beanstalk that
allows you to quickly deploy and manage applications in the AWS cloud.
• Elastic Beanstalk supports Java, PHP, .NET, Node.js, Python, and Ruby
applications.
• With Elastic Beanstalk you just need to upload the application and specify
configuration settings in a simple wizard and the service automatically
handles instance provisioning, server configuration, load balancing and
monitoring.
• Amazon CloudFormation
• Amazon CloudFormation is a deployment management service from
Amazon.
• With CloudFront you can create deployments from a collection of AWS
resources such as Amazon Elastic Compute Cloud, Amazon Elastic Block
Store, Amazon Simple Notification Service, Elastic Load Balancing and
Auto Scaling.
• A collection of AWS resources that you want to manage together are
organized into a stack.

Identity & Access Management Services


• Identity & Access Management (IDAM) services allow managing the
authentication and authorization of users to provide secure access to cloud
resources.
• Using IDAM services you can manage user identifiers, user permissions, security
credentials and access keys.
• Amazon Identity & Access Management
• AWS Identity and Access Management (IAM) allows you to manage users
and user permissions for an AWS account.
• With IAM you can manage users, security credentials such as access keys,
and permissions that control which AWS resources users can access.
• Using IAM you can control what data users can access and what
resources users can create.
• IAM also allows you to control creation, rotation, and revocation security
credentials of users.
• Windows Azure Active Directory

Page | 37
• Windows Azure Active Directory is an Identity & Access Management
Service from Microsoft.
• Azure Active Directory provides a cloud-based identity provider that easily
integrates with your on-premises active directory deployments and also
provides support for third party identity providers.
With Azure Active Directory you can control access to your applications in Windows
Azure.
Open Source Private Cloud Software – CloudStack
• Apache CloudStack is an open source cloud software that can be used for
creating private cloud offerings.
• CloudStack manages the network, storage, and compute nodes that make up a
cloud infrastructure.
• A CloudStack installation consists of a Management Server and the cloud
infrastructure that it manages.
• Zones : The Management Server manages one or more zones where each zone is
typically a single datacenter.
• Pods : Each zone has one or more pods. A pod is a rack of hardware comprising
of a switch and one or more clusters.
• Cluster: A cluster consists of one or more hosts and a primary storage. A host is
a compute node that runs guest virtual machines.
• Primary Storage: The primary storage of a cluster stores the disk volumes for all
the virtual machines running on the hosts in that cluster.
• Secondary Storage: Each zone has a secondary storage that stores templates,
ISO images, and disk volume snapshots.
Open Source Private Cloud Software – Eucalyptus
• Apache CloudStack is an open source cloud software that can be used for
creating private cloud offerings.
• CloudStack manages the network, storage, and compute nodes that make up a
cloud infrastructure.
• A CloudStack installation consists of a Management Server and the cloud
infrastructure that it manages.
• Zones: The Management Server manages one or more zones where each zone is
typically a single datacenter.
• Pods: Each zone has one or more pods. A pod is a rack of hardware comprising
of a switch and one or more clusters.
• Cluster : A cluster consists of one or more hosts and a primary storage. A host is
a compute node that runs guest virtual machines.
• Primary Storage: The primary storage of a cluster stores the disk volumes for all
the virtual machines running on the hosts in that cluster.
• Secondary Storage: Each zone has a secondary storage that stores templates,
ISO images, and disk volume snapshots
Open Source Private Cloud Software – OpenStack
Page | 38
• Eucalyptus is an open source private cloud software for building private and
hybrid clouds that are compatible with Amazon Web Services (AWS) APIs.
• Node Controller
• NC hosts the virtual machine instances and manages the virtual network
endpoints.
• The cluster-level (availability-zone) consists of three components
• Cluster Controller - which manages the virtual machines and is the
front-end for a cluster.
• Storage Controller – which manages the Eucalyptus block volumes and
snapshots to the instances within its specific cluster. SC is equivalent to
AWS Elastic Block Store (EBS).
• VMWare Broker - which is an optional component that provides an
AWS-compatible interface for VMware environments.
• At the cloud-level there are two components:
• Cloud Controller - which provides an administrative interface for cloud
management and performs high-level resource scheduling, system
accounting, authentication and quota management.
• Walrus - which is equivalent to Amazon S3 and serves as a persistent
storage to all of the virtual machines in the Eucalyptus cloud. Walrus can
be used as a simple Storage-as-a-Service

Page | 39
UNIT – II
Apache Hadoop
Hadoop is an open source framework from Apache and is used to store process
and analyze data which are very huge in volume.
Hadoop is written in Java and is not OLAP (online analytical processing). It is used
for batch/offline processing.
It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more.
Moreover it can be scaled up just by adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks
and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and
converts it into a data set which can be computed in Key value pair. The output of
Map task is consumed by reduce task and then the out of reducer gives the desired
result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by
other Hadoop modules.

Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and
the HDFS (Hadoop Distributed File System). The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master
node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave
node includes DataNode and TaskTracker.

Page | 40
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop.
It contains a master/slave architecture. This architecture consist of a single NameNode
performs the role of master, and multiple DataNodes performs the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines.
The Java language is used to develop HDFS. So any machine that supports Java language
can easily run the NameNode and DataNode software.
NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
o It simplifies the architecture of the system.
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's
clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process
the data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker
Page | 41
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the
appropriate Task Trackers. Sometimes, the TaskTracker fails or time out. In such a case,
that part of the job is rescheduled.
Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in
faster retrieval. Even the tools to process the data are often on the same servers,
thus reducing the processing time. It is able to process terabytes of data in minutes
and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data
so it really cost effective as compared to traditional relational database
management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then
Hadoop takes the other copy of data and use it. Normally, data are replicated thrice
but the replication factor is configurable.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was
the Google File System paper, published by Google.

Let's focus on the history of Hadoop in the following steps: -


o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache
Nutch. It is an open source web crawler software project.

Page | 42
o While working on Apache Nutch, they were dealing with big data. To store that data
they have to spend a lot of costs which becomes the consequence of that project.
This problem becomes one of the important reason for the emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique simplifies
the data processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as
NDFS (Nutch Distributed File System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch
project, Dough Cutting introduces a new project Hadoop with a file system known
as HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in
this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900
node cluster within 209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.

Year Event

2003 Google released the paper, Google File System (GFS).

2004 Google released a white paper on Map Reduce.

2006 o Hadoop introduced.


o Hadoop 0.1.0 released.
o Yahoo deploys 300 machines and within this year reaches 600 machines.

2007 o Yahoo runs 2 clusters of 1000 machines.


o Hadoop includes HBase.

Page | 43
2008 o YARN JIRA opened
o Hadoop becomes the fastest system to sort 1 terabyte of data on a 900
node cluster within 209 seconds.
o Yahoo clusters loaded with 10 terabytes per day.
o Cloudera was founded as a Hadoop distributor.

2009 o Yahoo runs 17 clusters of 24,000 machines.


o Hadoop becomes capable enough to sort a petabyte.
o MapReduce and HDFS become separate subproject.

2010 o Hadoop added the support for Kerberos.


o Hadoop operates 4,000 nodes with 40 petabytes.
o Apache Hive and Pig released.

2011 o Apache Zookeeper released.


o Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of storage.

2012 Apache Hadoop 1.0 version released.

2013 Apache Hadoop 2.2 version released.

2014 Apache Hadoop 2.6 version released.

2015 Apache Hadoop 2.7 version released.

2017 Apache Hadoop 3.0 version released.

2018 Apache Hadoop 3.1 version released.

MapReduce
Hadoop MapReduce is the data processing layer. It processes the huge amount of
structured and unstructured data stored in HDFS. MapReduce handles data in parallel by
splitting the job into the set of independent tasks. So, parallel processing increases speed
and reliability.
Hadoop MapReduce data processing occurs in 2 phases- Map and Reduce phase.
▶ Map phase: It is the initial phase of data processing. In this phase, we
state all the complex logic/business rules/costly code.
▶ Reduce phase: The second phase of processing is the Reduce Phase. In
this phase, we state light-weight processing like aggregation/summation.
Steps of MapReduce Job Execution flow
Page | 44
MapReduce processes the data in various phases with the help of different
components. Let us discuss the steps of job execution in Hadoop.
Input Files
The data for MapReduce job is stored in Input Files. Input files reside in HDFS. The
input file format is random. Line-based log files and binary format can also be used.
InputFormat
After that InputFormat defines how to divide and read these input files. It selects
the files or other objects for input. InputFormat creates InputSplit.
InputSplits
It represents the data that will be processed by an individual Mapper. For each split,
one map task is created. Therefore the number of map tasks is equal to the number of
InputSplits. Framework divide split into records, which mapper process.
RecordReader
It communicates with the inputSplit. And then transforms the data into key-value
pairs suitable for reading by the Mapper. RecordReader by default uses TextInputFormat to
transform data into a key-value pair. It interrelates to the InputSplit until the completion of
file reading. It allocates a byte offset to each line present in the file. Then, these key-value
pairs are further sent to the mapper for added processing.
Mapper
It processes input records produced by the RecordReader and generates
intermediate key-value pairs. The intermediate output is entirely different from the input
pair. The output of the mapper is a full group of key-value pairs. Hadoop framework does
not store the output of the mapper on HDFS. Mapper doesn’t store, as data is temporary
and writing on HDFS will create unnecessary multiple copies. Then Mapper passes the
output to the combiner for extra processing.
Combiner
Combiner is Mini-reducer that performs local aggregation on the mapper’s output. It
minimizes the data transfer between mapper and reducer. So, when the combiner
functionality completes, the framework passes the output to the partitioner for further
processing.
Partitioner
Partitioner comes into existence if we are working with more than one reducer. It
grabs the output of the combiner and performs partitioning.
Partitioning of output occurs based on the key in MapReduce. By hash function, the
key (or a subset of the key) derives the partition.
Based on key-value in MapReduce, partitioning of each combiner output occur. And
then the record having a similar key-value goes into the same partition. After that, each
partition is sent to a reducer.
Partitioning in MapReduce execution permits even distribution of the map output
over the reducer.
Shuffling and Sorting
After partitioning, the output is shuffled to the reduced node. The shuffling is the
physical transfer of the data which is done over the network. As all the mappers complete
and shuffle the output on the reducer nodes. Then the framework joins this intermediate
output and sort. This is then offered as input to reduce the phase.
Page | 45
Reducer
The reducer then takes a set of intermediate key-value pairs produced by the
mappers as the input. After that runs a reducer function on each of them to generate the
output. The output of the reducer is the decisive output. Then the framework stores the
output on HDFS.
RecordWriter
It writes these output key-value pairs from the Reducer phase to the output files.
OutputFormat
OutputFormat defines the way how RecordReader writes these output key-value
pairs in output files. So, its instances offered by the Hadoop write files in HDFS. Thus
OutputFormat instances write the decisive output of reducer on HDFS.
MapReduce Job Execution Workflow:
• MapReduce job execution starts when the client applications submit jobs to the
Job tracker.
• The JobTracker returns a JobID to the client application. The JobTracker talks to
the NameNode to determine the location of the data.
• The JobTracker locates TaskTracker nodes with available slots at/or near the
data.
• The TaskTrackers send out heartbeat messages to the JobTracker, usually every
few minutes, to reassure the JobTracker that they are still alive. These messages
also inform the JobTracker of the number of available slots, so the JobTracker
can stay up to date with where in the cluster, new work can be delegated.
• The JobTracker submits the work to the TaskTracker nodes when they poll for
tasks. To choose a task for a TaskTracker, the JobTracker uses various
scheduling algorithms (default is FIFO).
• The TaskTracker nodes are monitored using the heartbeat signals that are sent
by the TaskTrackers to JobTracker.
• The TaskTracker spawns a separate JVM process for each task so that any task
failure does not bring down the TaskTracker.
• The TaskTracker monitors these spawned processes while capturing the output
and exit codes. When the process finishes, successfully or not, the TaskTracker
notifies the JobTracker. When the job is completed, the JobTracker updates its
status.

Page | 46
MapReduce 2.0 – YARN
• In Hadoop 2.0 the original processing engine of Hadoop (MapReduce) has been
separated from the resource management (which is now part of YARN).
• This makes YARN effectively an operating system for Hadoop that supports
different processing engines on a Hadoop cluster such as MapReduce for batch
processing, Apache Tez for interactive queries, Apache Storm for stream
processing, etc.
• YARN architecture divides architecture divides the two major functions of the
JobTracker - resource management and job life-cycle management - into
separate components:
• ResourceManager
• ApplicationMaster.
YARN Components
• Resource Manager (RM): RM manages the global assignment of compute
resources to applications. RM consists of two main services:
• : Scheduler is a pluggable service that manages and enforces
the resource scheduling policy in the cluster.
• : AsM manages the running Application
Masters in the cluster. AsM is responsible for starting application
masters and for monitoring and restarting them on different nodes in
case of failures.
• Application Master (AM): A per-application AM manages the application’s life
cycle. AM is responsible for negotiating resources from the RM and working with
the NMs to execute and monitor the tasks.
• Node Manager (NM): A per-machine NM manages the user processes on
that machine.
• Containers: Container is a bundle of resources allocated by RM (memory, CPU,
network, etc.). A container is a conceptual entity that grants an application the
privilege to use a certain amount of resources on a given machine to run a
component task.

Page | 47
Hadoop Scheduler
Prior to Hadoop 2, Hadoop MapReduce is a software framework for writing
applications that process huge amounts of data (terabytes to petabytes) in-parallel on
the large Hadoop cluster. This framework is responsible for scheduling tasks,
monitoring them, and re-executes the failed task.
In Hadoop 2, a YARN called Yet Another Resource Negotiator was introduced.
The basic idea behind the YARN introduction is to split the functionalities of resource
management and job scheduling or monitoring into separate daemons that are
ResorceManager, ApplicationMaster, and NodeManager.
ResorceManager is the master daemon that arbitrates resources among all the
applications in the system. NodeManager is the slave daemon responsible for
containers, monitoring their resource usage, and reporting the same to
ResourceManager or Schedulers. ApplicationMaster negotiates resources from the
ResourceManager and works with NodeManager in order to execute and monitor the
task.
The ResourceManager has two main components that are Schedulers and
ApplicationsManager.
AD
Schedulers in YARN ResourceManager is a pure scheduler which is responsible for
allocating resources to the various running applications.

Page | 48
It is not responsible for monitoring or tracking the status of an application. Also,
the scheduler does not guarantee about restarting the tasks that are failed either due
to hardware failure or application failure.
It has some pluggable policies that are responsible for partitioning the cluster
resources among the various queues, applications, etc.
The FIFO Scheduler, CapacityScheduler, and FairScheduler are such pluggable
policies that are responsible for allocating resources to the applications.
Let us now study each of these Schedulers in detail.
TYPES OF HADOOP SCHEDULER

1. FIFO Scheduler
First In First Out is the default scheduling policy used in Hadoop. FIFO
Scheduler gives more preferences to the application coming first than those coming
later. It places the applications in a queue and executes them in the order of their
submission (first in, first out).
Here, irrespective of the size and priority, the request for the first application in the
queue are allocated first. Once the first application request is satisfied, then only the
next application in the queue is served.
Advantage:
• It is simple to understand and doesn’t need any configuration.
• Jobs are executed in the order of their submission.
Disadvantage:
• It is not suitable for shared clusters. If the large application comes before
the shorter one, then the large application will use all the resources in the
cluster, and the shorter application has to wait for its turn. This leads to
starvation.

Page | 49
• It does not take into account the balance of resource allocation between
the long applications and short applications.

2. Capacity Scheduler
The CapacityScheduler allows multiple-tenants to securely share a large
Hadoop cluster. It is designed to run Hadoop applications in a shared, multi-tenant
cluster while maximizing the throughput and the utilization of the cluster.
It supports hierarchical queues to reflect the structure of organizations or
groups that utilizes the cluster resources. A queue hierarchy contains three types of
queues that are root, parent, and leaf.
The root queue represents the cluster itself, parent queue represents
organization/group or sub-organization/sub-group, and the leaf accepts application
submission.
The Capacity Scheduler allows the sharing of the large cluster while giving
capacity guarantees to each organization by allocating a fraction of cluster resources
to each queue.
Also, when there is a demand for the free resources that are available on the
queue who has completed its task, by the queues running below capacity, then these
resources will be assigned to the applications on queues running below capacity. This
provides elasticity for the organization in a cost-effective manner.
Apart from it, the CapacityScheduler provides a comprehensive set of limits to
ensure that a single application/user/queue cannot use a disproportionate amount of
resources in the cluster.
To ensure fairness and stability, it also provides limits on initialized and
pending apps from a single user and queue.
Advantages:
Page | 50
• It maximizes the utilization of resources and throughput in the Hadoop
cluster.
• Provides elasticity for groups or organizations in a cost-effective manner.
• It also gives capacity guarantees and safeguards to the organization
utilizing cluster.
Disadvantage:
• It is complex amongst the other scheduler.
3. Fair Scheduler

FairScheduler allows YARN applications to fairly share resources in large


Hadoop clusters. With FairScheduler, there is no need for reserving a set amount of
capacity because it will dynamically balance resources between all running
applications.
It assigns resources to applications in such a way that all applications get, on
average, an equal amount of resources over time.
The FairScheduler, by default, takes scheduling fairness decisions only on the
basis of memory. We can configure it to schedule with both memory and CPU.
When the single application is running, then that app uses the entire cluster
resources. When other applications are submitted, the free up resources are assigned
to the new apps so that every app eventually gets roughly the same amount of
resources. FairScheduler enables short apps to finish in a reasonable time without
starving the long-lived apps.
Similar to CapacityScheduler, the FairScheduler supports hierarchical queue to
reflect the structure of the long shared cluster.
Apart from fair scheduling, the FairScheduler allows for assigning minimum
shares to queues for ensuring that certain users, production, or group applications
always get sufficient resources. When an app is present in the queue, then the app

Page | 51
gets its minimum share, but when the queue doesn’t need its full guaranteed share,
then the excess share is split between other running applications.
Advantages:
• It provides a reasonable way to share the Hadoop Cluster between the number
of users.
• Also, the FairScheduler can work with app priorities where the priorities are
used as weights in determining the fraction of the total resources that each
application should get.
Disadvantage:It requires configuration.

Page | 52
Cloud Application Design

Design Considerations for Cloud Applications


• Scalability
• Scalability is an important factor that drives the application designers to
move to cloud computing environments. Building applications that can
serve millions of users without taking a hit on their performance has
always been challenging. With the growth of cloud computing
application designers can provision adequate resources to meet their
workload levels.
• Reliability & Availability
• Reliability of a system is defined as the probability that a system will
perform the intended functions under stated conditions for a specified
amount of time. Availability is the probability that a system will perform
a specified function under given conditions at a prescribed time.
• Security
• Security is an important design consideration for cloud applications
given the outsourced nature of cloud computing environments.
• Maintenance & Upgradation
• To achieve a rapid time-to-market, businesses typically launch their
applications with a core set of features ready and then incrementally
add new features as and when they are complete. In such scenarios, it is
important to design applications with low maintenance and upgradation
costs.
• Performance
• Applications should be designed while keeping the performance
requirements in mind.
Reference Architectures –e-Commerce, Business-to-Business, Banking and
Financial apps
• Load Balancing Tier
• Load balancing tier consists of one or more load balancers.
• Application Tier
• For this tier, it is recommended to configure auto scaling.
• Auto scaling can be triggered when the recorded values for any of the
specified metrics such as CPU usage, memory usage, etc. goes above
defined thresholds.
• Database Tier
• The database tier includes a master database instance and multiple
slave instances.
• The master node serves all the write requests and the read requests are
served from the slave nodes.
• This improves the throughput for the database tier since most
applications have a higher number of read requests than write requests.

Page | 53
Reference Architectures –Content delivery apps
• Figure shows a typical deployment architecture for content delivery
applications such as online photo albums, video webcasting, etc.
• Both relational and non-relational data stores are shown in this deployment.
• A content delivery network (CDN) which consists of a global network of edge
locations is used for media delivery.
• CDN is used to speed up the delivery of static content such as images and
videos.

Reference Architectures –Analytics apps


• Figure shows a typical deployment architecture for compute intensive
applications such as Data Analytics, Media Transcoding, etc.
• Comprises of web, application, storage, computing/analytics and database
tiers.
• The analytics tier consists of cloud-based distributed batch processing
frameworks such as Hadoop which are suitable for analyzing big data.
• Data analysis jobs (such as MapReduce) jobs are submitted to the analytics
tier from the application servers.

Page | 54
• The jobs are queued for execution and upon completion the analyzed data is
presented from the application servers.

Service Oriented Architecture:


• Service Oriented Architecture (SOA) is a well established architectural
approach for designing and developing applications in the form services that
can be shared and reused.
• SOA is a collection of discrete software modules or services that form a part of
an application and collectively provide the functionality of an application.
• SOA services are developed as loosely coupled modules with no hard- wired
calls embedded in the services.
• The services communicate with each other by passing messages.
• Services are described using the Web Services Description Language (WSDL).
• WSDL is an XML-based web services description language that is used to
create service descriptions containing information on the functions performed
by a service and the inputs and outputs of the service.

SOA Layers:

Page | 55
1. Business Systems: This layer consists of custom built applications and legacy
systems such as Enterprise Resource Planning (ERP), Customer Relationship
Management (CRM), Supply Chain Management (SCM), etc.
2. Service Components: The service components allow the layers above to
interact with the business systems. The service components are responsible
for realizing the functionality of the services exposed.
3. Composite Services:These are coarse-grained services which are composed of
two or more service components. Composite services can be used to create
enterprise scale components or business-unit specific components.
4. Orchestrated Business Processes: Composite services can be orchestrated to
create higher level business processes. In this layers the compositions and
orchestrations of the composite services are defined to create business
processes.
5. Presentation Services: This is the topmost layer that includes user interfaces
that exposes the services and the orchestrated business processes to the
users.
6. Enterprise Service Bus: This layer integrates the services through adapters,
routing, transformation and messaging mechanisms.

Cloud Component Model:


• Cloud Component Model is an application design methodology that provides a
flexible way of creating cloud applications in a rapid, convenient and platform
independent manner.
• CCM is an architectural approach for cloud applications that is not tied to any
specific programming language or cloud platform.
• Cloud applications designed with CCM approach can have innovative hybrid
deployments in which different components of an application can be deployed
on cloud infrastructure and platforms of different cloud vendors.
• Applications designed using CCM have better portability and interoperability.
• CCM based applications have better scalability by decoupling application
components and providing asynchronous communication mechanisms.
CCM Application Design Methodology:
• CCM approach for application design involves:
Page | 56
• Component Design
• Architecture Design
• Deployment Design

CCM Component Design:


• Cloud Component Model is created for the application based on
comprehensive analysis of the application’s functions and building blocks.
• Cloud component model allows identifying the building blocks of a cloud
application which are classified based on the functions performed and type of
cloud resources required.
• Each building block performs a set of actions to produce the desired outputs
for other components.
• Each component takes specific inputs, performs a pre- defined set of actions
and produces the desired outputs.
• Components offer their functions as services through a functional interface
which can be used by other components.
• Components report their performance to a performance database through a
performance interface.

CCM Architecture Design:


• In Architecture Design step, interactions between the application components
are defined.
• CCM components have the following characteristics:
• Loose Coupling
Page | 57
• Components in the Cloud Component Model are loosely coupled.
• Asynchronous Communication
• By allowing asynchronous communication between components,
it is possible to add capacity by adding additional servers when
the application load increases. Asynchronous communication is
made possible by using messaging queues.
• Stateless Design
• Components in the Cloud Component Model are stateless. By storing
session state outside of the component (e.g. in a database), stateless
component design enables distribution and horizontal scaling.

CCM Deployment Design:


• In Deployment Design step, application components are mapped to specific
cloud resources such as web servers, application servers, database servers,
etc.
• Since the application components are designed to be loosely coupled and
stateless with asynchronous communication, components can be deployed
independently of each other.
• This approach makes it easy to migrate application components from one
cloud to the other.
• With this fiexibility in application design and deployment, the application
developers can ensure that the applications meet the performance and cost
requirements with changing contexts.

Page | 58
SOA vs CCM:
Similarities
SOA CCM

Standardization & SOA advocates principles of CCM is based on reusable


Re-use reuse and well defined components which can be used
relationship between service by multiple cloud applications.
provider and service consumer.

Loose coupling SOA is based on loosely coupled CCM is based on loosely


services that minimize coupled components that
dependencies. communicate asynchronously

Statelessness SOA services minimize resource CCM components are stateless.


consumption by deferring the State is stored outside of the
management of state components.
information.

Difference
SOA CCM

End points SOA services have small and CCM components have very large
well-defined set of endpoints number of endpoints. There is an
through which many types of data endpoint for each resource in a
can pass. component, identified by a URI.
Messaging SOA uses a messaging layer above CCM components use HTTP and REST
HTTP by using SOAP which provide for messaging.
prohibitive constraints to developers.

Page | 59
Security Uses WS-Security , SAML and other CCM components use HTTPS for security.
standards for security

Interfacing SOA uses XML for interfacing. CCM allows resources in components
represent different formats for
interfacing (HTML, XML, JSON, etc.).

Consumption Consuming traditional SOA services CCM components and the underlying
in a browser is cumbersome. component resources are exposed as
XML, JSON (and other formats) over
HTTP or REST, thus easy to consume in
the browser.
Model View Controller:
• Model View Controller (MVC) is a popular software design pattern for web
applications.
• Model
• Model manages the data and the behavior of the applications. Model
processes events sent by the controller. Model has no information about
the views and controllers. Model responds to the requests for
information about its state (from the view) and responds to the
instructions to change state (from controller).
• View
• View prepares the interface which is shown to the user. Users interact
with the application through views. Views present the information that
the model or controller tell the view to present to the user and also
handle user requests and sends them to the controller.
• Controller
• Controller glues the model to the view. Controller processes user
requests and updates the model when the user manipulates the view.
Controller also updates the view when the model changes.

RESTful Web Services:


• Representational State Transfer (REST) is a set of architectural principles by
which you can design web services and web APIs that focus on a system’s
resources and how resource states are addressed and transferred.

Page | 60
• The REST architectural constraints apply to the components, connectors, and
data elements, within a distributed hypermedia system.
• A RESTful web service is a web API implemented using HTTP and REST
principles.
• The REST architectural constraints are as follows:
• Client-Server
• Stateless
• Cacheable
• Layered System
• Uniform Interface
• Code on demand
Relational Databases:
• A relational database is database that conforms to the relational model that
was popularized by Edgar Codd in 1970.
• The 12 rules that Codd introduced for relational databases include:
• Information rule
• Guaranteed access rule
• Systematic treatment of null values
• Dynamic online catalog based on relational model
• Comprehensive sub-language rule
• View updating rule
• High level insert, update, delete
• Physical data independence
• Logical data independence
• Integrity independence
• Distribution independence
• Non-subversion rule
• Relations:A relational database has a collection of relations (or tables). A
relation is a set of tuples (or rows).
• Schema:Each relation has a fixed schema that defines the set of attributes (or
columns in a table) and the constraints on the attributes.
• Tuples:Each tuple in a relation has the same attributes (columns). The tuples in
a relation can have any order and the relation is not sensitive to the ordering of
the tuples.
• Attributes:Each attribute has a domain, which is the set of possible values for
the attribute.
• Insert/Update/Delete:Relations can be modified using insert, update and delete
operations. Every relation has a primary key that uniquely identifies each tuple
in the relation.
• Primary Key:An attribute can be made a primary key if it does not have
repeated values in different tuples.
Page | 61
ACID Guarantees:
Relational databases provide ACID guarantees.
• Atomicity:Atomicity property ensures that each transaction is either “all or
nothing”. An atomic transaction ensures that all parts of the transaction
complete or the database state is left unchanged.
• Consistency:Consistency property ensures that each transaction brings the
database from one valid state to another. In other words, the data in a
database always conforms to the defined schema and constraints.
• Isolation:Isolation property ensures that the database state obtained after a set
of concurrent transactions is the same as would have been if the transactions
were executed serially. This provides concurrency control, i.e. the results of
incomplete transactions are not visible to other transactions. The transactions
are isolated from each other until they finish.
• Durability:Durability property ensures that once a transaction is committed, the
data remains as it is, i.e. it is not affected by system outages such as power
loss. Durability guarantees that the database can keep track of changes and
can recover from abnormal terminations.
Non-Relational Databases
• Non-relational databases (or popularly called No-SQL databases) are becoming
popular with the growth of cloud computing.
• Non-relational databases have better horizontal scaling capability and
improved performance for big data at the cost of less rigorous consistency
models.
• Unlike relational databases, non-relational databases do not provide ACID
guarantees.
• Most non-relational databases offer “eventual” consistency, which means that
given a sufficiently long period of time over which no updates are made, all
updates can be expected to propagate eventually through the system and the
replicas will be consistent.
• The driving force behind the non-relational databases is the need for databases
that can achieve high scalability, fault tolerance and availability.

Page | 62
• These databases can be distributed on a large cluster of machines. Fault
tolerance is provided by storing multiple replicas of data on different machines.
Non-Relational Databases – Types
• Key-value store: Key-value store databases are suited for applications that
require storing unstructured data without a fixed schema. Most key-value
stores have support for native programming language data types.
• Document store: Document store databases store semi-structured data in
the form of documents which are encoded in different standards such as
JSON, XML, BSON, YAML, etc.
• Graph store: Graph stores are designed for storing data that has graph
structure (nodes and edges). These solutions are suitable for applications
that involve graph data such as social networks, transportation systems, etc.
• Object store: Object store solutions are designed for storing data in the form
of objects defined in an object-oriented programming language.
Python Basics:
Pythonisageneral-purposehighlevelprogramminglanguageandsuitableforprovidi
ngasolidfoundationto thereaderinthe area ofcloud computing.

ThemaincharacteristicsofPythonare:
Multi-paradigmprogramminglanguage
Pythonsupportsmorethanoneprogrammingparadigmsincludingobject-orient
edprogrammingandstructuredprogramming
InterpretedLanguage
Python is an interpreted language and does not require an explicit
compilation step. The Python
interpreterexecutestheprogramsourcecodedirectly,statementbystatement,asaproce
ssororscriptingenginedoes.
InteractiveLanguage
Pythonprovidesaninteractivemodeinwhichtheusercansubmitcommandsatth
ePythonpromptandinteractwiththeinterpreterdirectly.
Python – Benefits
• Easy-to-learn, read and maintain
• Python is a minimalistic language with relatively few keywords, uses
English keywords and has fewer syntactical constructions as compared
to other languages. Reading Python programs feels like English with
pseudo-code like constructs. Python is easy to learn yet an extremely
powerful language for a wide range of applications.
• Object and Procedure Oriented
• Python supports both procedure-oriented programming and
object-oriented programming. Procedure oriented paradigm allows
programs to be written around procedures or functions that allow reuse
Page | 63
of code. Procedure oriented paradigm allows programs to be written
around objects that include both data and functionality.
• Extendable
• Python is an extendable language and allows integration of low-level
modules written in languages such as C/C++. This is useful when you
want to speed up a critical portion of a program.
• Scalable
• Due to the minimalistic nature of Python, it provides a manageable
structure for large programs.
• Portable
• Since Python is an interpreted language, programmers do not have to
worry about compilation, linking and loading of programs. Python
programs can be directly executed from source
• Broad Library Support
• Python has a broad library support and works on various platforms such
as Windows, Linux, Mac, etc.
Python – Setup
• Windows
• Python binaries for Windows can be downloaded from
http://www.python.org/getit .
• For the examples and exercise in this book, you would require Python 2.7
which can be directly downloaded from:
http://www.python.org/ftp/python/2.7.5/python-2.7.5.msi
• Once the python binary is installed you can run the python shell at the
command prompt using
> python
• Linux
#Install Dependencies
sudo apt-get install build-essential
sudo apt-get install libreadline-gplv2-dev libncursesw5-dev libssl-dev
libsqlite3-devtk-dev libgdbm-dev libc6-dev libbz2-dev
#Download Python
wget http://python.org/ftp/python/2.7.5/Python-2.7.5.tgz tar -xvf
Python-2.7.5.tgz
cd Python-2.7.5
#Install Python
./configure make
sudo make install
Numbers
• Numbers
• Number data type is used to store numeric values. Numbers are
immutable data types, therefore changing the value of a number data
type results in a newly allocated object.

Page | 64
UNIT – III
Python for Cloud
Outline

• Python for Amazon Web Services

• Python for Google Cloud

• Python for Windows Azure

• Python for MapReduce

• Python Packages of Interest

• Python Web Application Framework - Django

• Development with Django


Amazon EC2 – Python Example
• Boto is a Python package that provides interfaces to Amazon Web Services
(AWS) #Python program for launching an EC2 instance
import boto.ec2
from time import sleep
• In this example, a connection to EC2
ACCESS_KEY="<enter access key>"
service is first established by calling
SECRET_KEY="<enter secret key>"
boto.ec2.connect_to_region.
REGION="us-east-1" AMI_ID = "ami-
• The EC2 region, AWS access key and AWS d0f89fb9"
secret key are passed to this function. EC2_KEY_HANDLE = "<enter key handle>"
After connecting to EC2 , a new instance INSTANCE_TYPE="t1.micro"
is launched using the conn.run_instances SECGROUP_HANDLE="default"
function.
conn = boto.ec2.connect_to_region(REGION, aws_access_key_id=ACCESS_KEY,
• The AMI-ID, instance type, EC2 key aws_secret_access_key=SECRET_KEY)
handle and security group are passed to
this function. reservation = conn.run_instances(image_id=AMI_ID, key_name=EC2_KEY_HANDLE,
instance_type=INSTANCE_TYPE,
security_groups = [ SECGROUP_HANDLE, ] )
Amazon AutoScaling – Python Example
• AutoScaling Service #Python program for creating an AutoScaling group (code excerpt)
import boto.ec2.autoscale
• A connection to AutoScaling service is first established by calling :
print "Connecting to Autoscaling Service"
boto.ec2.autoscale.connect_to_region function. conn = boto.ec2.autoscale.connect_to_region(REGION,
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
• Launch Configuration
print "Creating launch configuration"
• After connecting to AutoScaling service, a new launch configuration
is created by calling conn.create_launch_configuration. Launch lc = LaunchConfiguration(name='My-Launch-Config-2',
image_id=AMI_ID,
configuration contains instructions on how to launch new instances key_name=EC2_KEY_HANDLE,
including the AMI-ID, instance type, security groups, etc. instance_type=INSTANCE_TYPE,
security_groups = [ SECGROUP_HANDLE, ])
conn.create_launch_configuration(lc)
• AutoScaling Group
print "Creating auto-scaling group"
• After creating a launch configuration, it is then associated with a new
AutoScaling group. AutoScaling group is created by calling ag = AutoScalingGroup(group_name='My-Group',
availability_zones=['us-east-1b'],
conn.create_auto_scaling_group. The settings for AutoScaling group launch_config=lc, min_size=1, max_size=2,
such as the maximum and minimum number of instances in the connection=conn)
conn.create_auto_scaling_group(ag)
group, the launch configuration, availability zones, optional load
balancer to use with the group, etc.
Amazon AutoScaling – Python Example
• AutoScaling Policies
• After creating an AutoScaling group, the policies for scaling up and scaling down are defined.
• In this example, a scale up policy with adjustment type ChangeInCapacity and scaling_ad justment = 1 is defined.
• Similarly a scale down policy with adjustment type ChangeInCapacity and scaling_ad justment = -1 is defined.

#Creating auto-scaling policies


scale_up_policy = ScalingPolicy(name='scale_up',
adjustment_type='ChangeInCapacity', as_name='My-Group',
scaling_adjustment=1, cooldown=180)

scale_down_policy = ScalingPolicy(name='scale_down',
adjustment_type='ChangeInCapacity', as_name='My-Group',
scaling_adjustment=-1, cooldown=180)

conn.create_scaling_policy(scale_up_policy)
conn.create_scaling_policy(scale_down_policy)
Amazon AutoScaling – Python Example
• CloudWatch Alarms #Connecting to CloudWatch
cloudwatch = boto.ec2.cloudwatch.connect_to_region(REGION,
aws_access_key_id=ACCESS_KEY,
• With the scaling policies defined, the next step is to aws_secret_access_key=SECRET_KEY)
alarm_dimensions = {"AutoScalingGroupName": 'My-Group'}
create Amazon CloudWatch alarms that trigger these
policies. #Creating scale-up alarm
scale_up_alarm = MetricAlarm( name='scale_up_on_cpu',
namespace='AWS/EC2', metric='CPUUtilization',
statistic='Average', comparison='>', threshold='70',
• The scale up alarm is defined using the CPU Utilization period='60', evaluation_periods=2,
alarm_actions=[scale_up_policy.policy_arn],
metric with the Average statistic and threshold greater dimensions=alarm_dimensions)
cloudwatch.create_alarm(scale_up_alarm)
70% for a period of 60 sec. The scale up policy created
previously is associated with this alarm. This alarm is #Creating scale-down alarm
scale_down_alarm = MetricAlarm( name='scale_down_on_cpu',
triggered when the average CPU utilization of the namespace='AWS/EC2', metric='CPUUtilization',
statistic='Average', comparison='<', threshold='40',
instances in the group becomes greater than 70% for period='60', evaluation_periods=2,
more than 60 seconds. alarm_actions=[scale_down_policy.policy_arn],
dimensions=alarm_dimensions)
cloudwatch.create_alarm(scale_down_alarm)

• The scale down alarm is defined in a similar manner


with a threshold less than 50%.
Amazon S3 – Python Example
• In this example, a connection to S3 service is first established by calling boto.connect_s3 function.
• The upload_to_s3_bucket_path function uploads the file to the S3 bucket specified at the specified path.

# Python program for uploading a file to an S3 bucket


import boto.s3

conn = boto.connect_s3(aws_access_key_id='<enter>',
aws_secret_access_key='<enter>')

def percent_cb(complete, total): print ('.')

def upload_to_s3_bucket_path(bucketname, path, filename):


mybucket = conn.get_bucket(bucketname)
fullkeyname=os.path.join(path,filename)
key = mybucket.new_key(fullkeyname)
key.set_contents_from_filename(filename, cb=percent_cb, num_cb=10)
Amazon RDS – Python Example
• In this example, a connection to RDS(Relational #Python program for launching an RDS instance (excerpt)
import boto.rds
Database Service)service is first established by calling ACCESS_KEY="<enter>"
boto.rds.connect_to_region function. SECRET_KEY="<enter>"
REGION="us-east-1"
INSTANCE_TYPE="db.t1.micro"
• The RDS region, AWS access key and AWS secret key ID = "MySQL-db-instance-3"
USERNAME = 'root'
are passed to this function. PASSWORD = 'password'
DB_PORT = 3306
• After connecting to RDS service, the DB_SIZE = 5
DB_ENGINE = 'MySQL5.1'
conn.create_dbinstance function is called to launch a DB_NAME = 'mytestdb'
SECGROUP_HANDLE="default"
new RDS instance.
#Connecting to RDS
• The input parameters to this function include the conn = boto.rds.connect_to_region(REGION,
aws_access_key_id=ACCESS_KEY,
instance ID, database size, instance type, database aws_secret_access_key=SECRET_KEY)

username, database password, database port, #Creating an RDS instance


database engine (e.g. MySQL5.1), database name, db = conn.create_dbinstance(ID, DB_SIZE, INSTANCE_TYPE,
USERNAME, PASSWORD, port=DB_PORT, engine=DB_ENGINE,
security groups, etc. db_name=DB_NAME, security_groups = [ SECGROUP_HANDLE, ] )
Amazon DynamoDB – Python Example
# Python program for creating a DynamoDB table (excerpt)
• In this example, a connection to DynamoDB import boto.dynamodb

service is first established by calling ACCESS_KEY="<enter>"


SECRET_KEY="<enter>"
boto.dynamodb.connect_to_region. REGION="us-east-1"

#Connecting to DynamoDB
• After connecting to DynamoDB service, a conn = boto.dynamodb.connect_to_region(REGION,
aws_access_key_id=ACCESS_KEY,
schema for the new table is created by calling aws_secret_access_key=SECRET_KEY)

conn.create_schema. table_schema = conn.create_schema(


hash_key_name='msgid',

• The schema includes the hash key and range hash_key_proto_value=str,


range_key_name='date',
range_key_proto_value=str
key names and types. )

• A DynamoDB table is then created by calling #Creating table with schema


table = conn.create_table(
conn.create_table function with the table name='my-test-table',
schema=table_schema,
schema, read units and write units as input read_units=1,
write_units=1
parameters. )
Google Compute Engine – Python Example
# Python program for launching a GCE instance (excerpt)
API_VERSION = 'v1beta15'
• This example uses the OAuth 2.0 scope GCE_SCOPE = 'https://www.googleapis.com/auth/compute'
(https://www.googleapis.com/auth/compute) and GCE_URL = 'https://www.googleapis.com/compute/%s/projects/' %
(API_VERSION) DEFAULT_ZONE = 'us-central1-b'
credentials in the credentials file to request a refresh CLIENT_SECRETS =
and access token, which is then stored in the 'client_secrets.json'
OAUTH2_STORAGE = 'oauth2.dat'
oauth2.dat file.
def main():
• After completing the OAuth authorization, an #OAuth 2.0 authorization.
instance of the Google Compute Engine service is flow = flow_from_clientsecrets(CLIENT_SECRETS,
scope=GCE_SCOPE) storage = Storage(OAUTH2_STORAGE)
obtained. credentials = storage.get()

• To launch a new instance the instances().insert if credentials is None or credentials.invalid:


method of the Google Compute Engine API is used. credentials = run(flow, storage)
http = httplib2.Http()
• The request body to this method contains the auth_http = credentials.authorize(http)

properties such as instance name, machine type, gce_service = build('compute', API_VERSION)


zone, network interfaces, etc., specified in JSON
# Create the instance
format. request = gce_service.instances().insert(project=PROJECT_ID, body=instance,
zone=DEFAULT_ZONE)
response = request.execute(auth_http)
Google Cloud Storage – Python Example
# Python program for uploading a file to GCS (excerpt)
• This example uses the Oauth 2.0 scope def main():
(https://www.googleapis.com/auth/devstora #OAuth 2.0 authorization.
ge.full_control) and credentials in the credentials file flow = flow_from_clientsecrets(CLIENT_SECRETS,
scope=GS_SCOPE) storage = Storage(OAUTH2_STORAGE)
to request a refresh and access token, which is then credentials = storage.get()
stored in the oauth2.dat file.
if credentials is None or credentials.invalid:
• After completing the OAuth authorization, an credentials = run(flow,
instance of the Google Cloud Storage service is storage) http = httplib2.Http()
obtained. auth_http = credentials.authorize(http)

• To upload a file the objects().insert method of the gs_service = build('storage', API_VERSION, http=auth_http)
Google Cloud Storage API is used. # Upload file
fp= open(FILENAME,'r')
• The request to this method contains the bucket fh = io.BytesIO(fp.read())
name, file name and media body containing the media = MediaIoBaseUpload(fh, FILE_TYPE)
MediaIoBaseUpload object created from the file request = gs_service.objects().insert(bucket=BUCKET,
name=FILENAME,
contents. media_body=media)
response = request.execute()
Google Cloud SQL – Python Example # Python program for launching a Google Cloud SQL instance (excerpt)

def main():
#OAuth 2.0 authorization.
• This example uses the OAuth 2.0 scope flow = flow_from_clientsecrets(CLIENT_SECRETS,
(https://www.googleapis.com/auth/compute) and scope=GS_SCOPE) storage = Storage(OAUTH2_STORAGE)
credentials = storage.get()
credentials in the credentials file to request a refresh
and access token, which is then stored in the if credentials is None or credentials.invalid:
credentials = run(flow,
oauth2.dat file. storage) http =
httplib2.Http()
• After completing the OAuth authorization, an auth_http = credentials.authorize(http)
instance of the Google Cloud SQL service is
gcs_service = build('sqladmin', API_VERSION, http=auth_http)
obtained.
# Define request body
• To launch a new instance the instances().insert instance={"instance":
method of the Google Cloud SQL API is used. "mydb", "project":
"bahgacloud", "settings":{
• The request body of this method contains "tier": "D0",
"pricingPlan":
properties such as instance, project, tier, "PER_USE",
pricingPlan and replicationType. "replicationType": "SYNCHRONOUS"}}

# Create the instance


request = gcs_service.instances().insert(project=PROJECT_ID,
body=instance) response = request.execute()
Azure Virtual Machines – Python Example
# Python program for launching a Azure VM instance (excerpt)

from azure import *


sms = ServiceManagementService(subscription_id,
• To create a virtual machine, a cloud service is first certificate_path) name = ‘<enter>'
created. location = 'West US'

• Virtual machine is created using the # Name of an os image as returned by list_os_images


image_name = '<enter>'
create_virtual_machine_deployment method of
the Azure service management API. # Destination storage account container/blob where the VM disk will be
created
media_link = <enter>'

# Linux VM configuration
linux_config = LinuxConfigurationSet('bahga', 'arshdeepbahga', 'Arsh~2483',
True)

os_hd = OSVirtualHardDisk(image_name, media_link)

#Create instance
sms.create_virtual_machine_deployment(service_name=na
me, deployment_name=name,
deployment_slot='production', label=name,
role_name=name, system_config=linux_config,
os_virtual_hard_disk=os_hd, role_size='Small')
Azure Storage – Python Example
# Python example of using Azure Blob Service (excerpt)

from azure.storage import *


• Azure Blobs service allows you to store large blob_service = BlobService(account_name=‘enter', account_key=‘<enter>’)
amounts of unstructured text or binary data such
as video, audio and images. #Create Container
blob_service.create_container('mycontainer')
• This shows an example of using the Blob service for #Upload Blob
storing a file. filename='images.txt'
myblob = open(filename, 'r').read()
• Blobs are organized in containers. The blob_service.put_blob('mycontainer', filename,
create_container method is used to create a new myblob, x_ms_blob_type='BlockBlob')

container. #List Blobs


blobs =
• After creating a container the blob is uploaded using blob_service.list_blobs('mycontainer') for
the put_blob method. blob in blobs:
print(blob.name)
• Blobs can be listed using the list_blobs method. print(blob.url)

• To download a blob, the get_blob method is used. #Download Blob


output_filename='output.txt'
blob = blob_service.get_blob('mycontainer',
'myblob') with open(output_filename, 'w') as f:
f.write(blob)
Python for MapReduce
• The example shows inverted index mapper program.
• The map function reads the data from the standard input (stdin) and splits the tab-limited
data into document-ID and contents of the document.
• The map function emits key-value pairs where key is each word in the document and value
is the document-ID.

#Inverted Index Mapper in Python

#!/usr/bin/env python
import sys
for line in sys.stdin:
doc_id, content = line.split(’’)
words = content.split()
for word in words:
print ’%s%s’ % (word, doc_id)
Python for MapReduce
#Inverted Index Reducer in Python

• The example shows inverted index reducer #!/usr/bin/env


python import sys
program. current_word =
None
• The key-value pairs emitted by the map phase are current_docids = []
shuffled to the reducers and grouped by the key. word = None

• The reducer reads the key-value pairs grouped by for line in sys.stdin:
the same key from the standard input (stdin) and # remove leading and trailing
creates a list of document-IDs in which the word whitespace line = line.strip()
# parse the input we got from
occurs. mapper.py word, doc_id =
line.split(’’)
• The output of reducer contains key value pairs if current_word == word:
where key is a unique word and value is the list of current_docids.append(do
document-IDs in which the word occurs. c_id)
else:
if current_word:
print ’%s%s’ % (current_word,
current_docids) current_docids = []
current_docids.append(doc_id)
current_word = word
Python Packages of Interest
• JSON
• JavaScript Object Notation (JSON) is an easy to read and write data-interchange format. JSON is used as an alternative to XML and is
is easy for machines to parse and generate. JSON is built on two structures - a collection of name-value pairs (e.g. a Python
dictionary) and ordered lists of values (e.g.. a Python list).
• XML
• XML (Extensible Markup Language) is a data format for structured document interchange. The Python minidom library provides a
minimal implementation of the Document Object Model interface and has an API similar to that in other languages.

• HTTPLib & URLLib


• HTTPLib2 and URLLib2 are Python libraries used in network/internet programming
• SMTPLib
• Simple Mail Transfer Protocol (SMTP) is a protocol which handles sending email and routing e-mail between mail servers. The
Python smtplib module provides an SMTP client session object that can be used to send email.
• NumPy
• NumPy is a package for scientific computing in Python. NumPy provides support for large multi-dimensional arrays and
matrices
• Scikit-learn
• Scikit-learn is an open source machine learning library for Python that provides implementations of various machine learning
algorithms for classification, clustering, regression and dimension reduction problems.
Python Web Application Framework - Django

• Django is an open source web application framework for developing web applications in Python.
• A web application framework in general is a collection of solutions, packages and best practices
that allows development of web applications and dynamic websites.
• Django is based on the Model-Template-View architecture and provides a separation of the data
model from the business rules and the user interface.
• Django provides a unified API to a database backend.
• Thus web applications built with Django can work with different databases without requiring any
code changes.
• With this fiexibility in web application design combined with the powerful capabilities of the Python
language and the Python ecosystem, Django is best suited for cloud applications.
• Django consists of an object-relational mapper, a web templating system and a regular-expression-
based URL dispatcher.
Django Architecture
• Django is Model-Template-View (MTV) framework.

• Model
• The model acts as a definition of some stored data and handles the interactions with the database. In a
web application, the data can be stored in a relational database, non-relational database, an XML file,
etc. A Django model is a Python class that outlines the variables and methods for a particular type of
data.
• Template
• In a typical Django web application, the template is simply an HTML page with a few extra
placeholders. Django’s template language can be used to create various forms of text files (XML,
email, CSS, Javascript, CSV, etc.)
• View
• The view ties the model to the template. The view is where you write the code that actually generates
the web pages. View determines what data is to be displayed, retrieves the data from the database and
passes the data to the template.
Django Setup on Amazon EC2
Cloud Application
Development in
Python
Outline

• Design Approaches
• Design methodology for IaaS service model
• Design methodology for PaaS service model
• Cloud application case studies including:
• Image Processing App
• Document Storage App
• MapReduce App
• Social Media Analytics App
Design methodology for IaaS service model
Component Design

•Indentify the building blocks of the application and to be performed by each block
•Group the building blocks based on the functions performed and type of cloud resources required and
identify the application components based on the groupings
•Identify the inputs and outputs of each component
•List the interfaces that each component will expose
•Evaluate the implementation alternatives for each component (design patterns such as MVC, etc.)

Architecture Design

•Define the interactions between the application components


•Guidelines for loosely coupled and stateless designs - use messaging queues (for asynchronous
communication), functional interfaces (such as REST for loose coupling) and external status database (for
stateless design)

Deployment Design

•Map the application components to specific cloud resources (such as web servers,
application servers, database servers, etc.)
Design methodology for PaaS service model
• For applications that use the Platform-as-a-service (PaaS) cloud service model, the architecture and
deployment design steps are not required since the platform takes care of the architecture and
deployment.
• Component Design
• In the component design step, the developers have to take into consideration the platform specific features.
• Platform Specific Software
• Different PaaS offerings such as Google App Engine, Windows Azure Web Sites, etc., provide platform specific software
development kits (SDKs) for developing cloud applications.
• Sandbox Environments
• Applications designed for specific PaaS offerings run in sandbox environments and are allowed to perform only those
actions that do not interfere with the performance of other applications.
• Deployment & Scaling
• The deployment and scaling is handled by the platform while the developers focus on the application development
using the platform-specific SDKs.
• Portability
• Portability is a major constraint for PaaS based applications as it is difficult to move the
Image Processing App – Component Design
• Functionality:
• A cloud-based Image Processing application.
• This application provides online image filtering capability.
• Users can upload image files and choose the filters to apply.
• The selected filters are applied to the image and the
processed image can then be downloaded.

• Component Design
• Web Tier: The web tier for the image processing app has front
ends for image submission and displaying processed images.
• Application Tier: The application tier has components for
processing the image submission requests, processing the
submitted image and processing requests for displaying the
results. Component design for Image Processing App
• Storage Tier: The storage tier comprises of the storage for
processed images.
Image Processing App – Architecture Design

• Architecture design step which defines the


interactions between the application components.
• This application uses the Django framework,
therefore, the web tier components map to the
Django templates and the application tier
components map to the Django views.
• A cloud storage is used for the storage tier. For
each component, the corresponding code box
numbers are mentioned.
Architecture design for Image Processing App
Image Processing App – Deployment Design

• Deployment for the app is a multi-tier


architecture comprising of load balancer,
application servers and a cloud storage for
processed images.

• For each resource in the deployment the


corresponding Amazon Web Services (AWS)
cloud service is mentioned.

Deployment design for Image Processing App


Cloud Drive App – Component Design
• Functionality:
• A cloud-based document storage (Cloud Drive) application.
• This application allows users to store documents on a cloud- based
storage.

• Component Design
• Web Tier: The web tier for the Cloud Drive app has front ends for
uploading files, viewing/deleting files and user profile.
• Application Tier: The application tier has components for
processing requests for uploading files, processing requests for
viewing/deleting files and the component that handles the
registration, profile and login functions.
• Database Tier: The database tier comprises of a user Component design for Cloud Drive App

credentials database.
• Storage Tier: The storage tier comprises of the storage for fi les.
Cloud Drive App – Architecture Design
• Architecture design step which defines the
interactions between the application
components.

• This application uses the Django


framework, therefore, the web tier
components map to the Django templates
and the application tier components map
to the Django views.

• A MySQL database is used for the database tier


and a cloud storage is used for the storage tier.

• For each component, the corresponding Architecture design for Cloud Drive App

code box numbers are mentioned.


Cloud Drive App – Deployment Design

• Deployment for the app is a multi-tier architecture comprising


of load balancer, application servers, cloud storage for storing
documents and a database server for storing user credentials.

• For each resource in the reference architecture the


corresponding Amazon Web Services (AWS) cloud service
is mentioned.

Deployment design for Cloud Drive App


MapReduce App – Component Design
Functionality:
• This application allows users to submit MapReduce jobs for data
analysis.
• This application is based on the Amazon Elastic MapReduce (EMR)
service.
• Users can upload data files to analyze and choose/upload the Map
and Reduce programs.
• The selected Map and Reduce programs along with the input data are
submitted to a queue for processing.
Component Design
• Web Tier: The web tier for the MapReduce app has a front end for
MapReduce job submission.
• Application Tier: The application tier has components for processing
requests for uploading files, creating MapReduce jobs and enqueuing Component design for MapReduce App
jobs, MapReduce consumer and the component that sends email
notifications.
• Analytics Tier: The Hadoop framework is used for the analytics tier
and a cloud storage is used for the storage tier.
• Storage Tier: The storage tier comprises of the storage for files.
MapReduce App – Architecture Design
• Architecture design step which defines the interactions between
the application components.
• This application uses the Django framework, therefore, the web
tier components map to the Django templates and the application
tier components map to the Django views.
• For each component, the corresponding code box numbers
are mentioned.
• To make the application scalable the job submission and job
processing components are separated.
• The MapReduce job requests are submitted to a queue.
• A consumer component that runs on a separate instance retrieves
the MapReduce job requests from the queue and creates the
MapReduce jobs and submits them to the Amazon EMR service. Architecture design for MapReduce App

• The user receives an email notification with the download link for
the results when the job is complete.
MapReduce App – Deployment Design

• Deployment for the app is a multi-tier


architecture comprising of load balancer,
application servers and a cloud storage for
storing MapReduce programs, input data and
MapReduce output.
• For each resource in the deployment the
corresponding Amazon Web Services (AWS)
cloud service is mentioned.

Deployment design for MapReduce App


Social Media Analytics App – Component Design
• Functionality:
• A cloud-based Social Media Analytics application.
• This application collects the social media feeds (Twitter tweets)
on a specified keyword in real time and analyzes the
sentiments of the tweets and provides aggregate results.
• Component Design
• Web Tier: The web tier has a front end for displaying results.
• Application Tier: The application tier has a listener component
that collects social media feeds, a consumer component that
analyzes tweets and a component for rendering the results in the
dashboard.
• Database Tier: A MongoDB database is used for the database Component design for Social Media Analytics App
tier and a cloud storage is used for the storage tier.
• Storage Tier: The storage tier comprises of the storage for files.
Social Media Analytics App – Architecture Design
• Architecture design step which defines the interactions
between the application components.
• To make the application scalable the feeds collection
component (Listener) and feeds processing component
(Consumer) are separated.
• The Listener component uses the Twitter API to get feeds
on a specific keyword (or a list of keywords) and enqueues
the feeds to a queue.
• The Consumer component (that runs on a separate
instance) retrieves the feeds from the queue and analyzes
the feeds and stores the aggregated results in a separate
database. Architecture design for Social Media Analytics App

• The aggregate results are displayed to the users from a


Django application.
Social Media Analytics App – Deployment Design
• Deployment for the app is a multi-tier architecture comprising of load balancer, application servers, listener and
consumer instances, a cloud storage for storing raw data and a database server for storing aggregated results.
• For each resource in the deployment the corresponding Amazon Web Services (AWS) cloud service is mentioned.

Deployment design for Social Media Analytics App


Social Media Analytics App – Dashboard
UNIT - 4
Big Data Analytics
Outline

•Big Data analytics approaches


•Approaches for clustering big data
•Approaches for classification of big data
•Recommendation Systems
Big Data
• Big data is defined as collections of data sets whose volume, velocity in terms of time variation, or variety is so large that
it is difficult to store, manage, process and analyze the data using traditional databases and data processing tools.
• Characteristics of big data:
• Volume
• Though there is no fixed threshold for the volume of data to be considered as big data, however, typically, the
term big data is used for massive scale data that is difficult to store, manage and process using traditional
databases and data processing architectures. The volumes of data generated by modern IT, industrial,
healthcare and systems is growing exponentially driven by the lowering costs of data storage and processing
architectures and the need to extract valuable insights from the data to improve business processes, efficiency
and service to consumers.
• Velocity
• Velocity is another important characteristic of big data and the primary reason for exponential growth of data.
Velocity of data refers to how fast the data is generated. Modern IT, industrial and other systems are
generating data at increasingly higher speeds generating big data.
• Variety
• Variety refers to the forms of the data. Big data comes in different forms such as structured or unstructured
data, including text data, image, audio, video and sensor data.
Clustering Big Data
• Clustering is the process of grouping similar data items together such that data items that are more
similar to each other (with respect to some similarity criteria) than other data items are put in one
cluster.
• Clustering big data is of much interest, and happens in applications such as:
• Clustering social network data to find a group of similar users
• Clustering electronic health record (EHR) data to find similar patients.
• Clustering sensor data to group similar or related faults in a machine
• Clustering market research data to group similar customers
• Clustering clickstream data to group similar users
• Clustering is achieved by clustering algorithms that belong to a broad category algorithms called
unsupervised machine learning.
• Unsupervised machine learning algorithms find the patterns and hidden structure in data for which no
training data is available.
k-means Clustering
• k-means is a clustering algorithm that groups data items into k clusters, where k is user defined.
• Each cluster is defined by a centroid point.
• k-means clustering begins with a set of k centroid points which are either randomly chosen from the
dataset or chosen using some initialization algorithm such as canopy clustering.
• The algorithm proceeds by finding the distance between each data point in the data set and the
centroid points.
• Based on the distance measure, each data point is assigned to a cluster belonging to the closest
centroid.
• In the next step the centroids are recomputed by taking the mean value of all the data points in a
cluster.
• This process is repeated till the centroids no longer move more than a specified threshold.
k-means Clustering
k-means Clustering Algorithm
Start with k centroid points
while the centroids no longer move beyond a threshold or maximum number of iterations reached: for each
point in the dataset:
for each centroid:
find the distance between the point and the centroid
assign the point to the cluster belonging to the nearest centroid for each
cluster:
recompute the centroid point by taking mean value of all points in the cluster

Example of clustering 300 points with k-means: (a) iteration 1, (b)


iteration 2, (c) iteration 3, (d) iteration 5, (e) iteration 10, (f) iteration 100.
Clustering Documents with k-means
• Document clustering is the most commonly used application of k-means clustering algorithm.
• Document clustering problem occurs in many big data applications such as finding similar news articles, finding similar
patients using electronic health records, etc.
• Before applying k-means algorithm for document clustering, the documents need to be vectorized. Since documents
contain textual information, the process of vectorization is required for clustering documents.
• The process of generating document vectors involves several steps:
• A dictionary of all words used in the tokenized records is generated. Each word in the dictionary has a dimension
number assigned to it which is used to represent the dimension the word occupies in the document vector.
• The number of occurrences or term frequency (TF) of each word is computed.
• Inverse Document Frequency (IDF) for each word is computed. Document Frequency (DF) for a word is the number of
documents (or records) in which the word occurs.
• Weight for each word is computed. The term weight Wi is used in the document vector as the value for the dimension-i.
• Similarity between documents is computed using a distance measure such as Euclidean distance measure.
k-means with MapReduce
• The data to be clustered is distributed on a distributed file system such as HDFS and split into blocks which are replicated across
different nodes in the cluster.
• Clustering begins with an initial set of centroids. The client program controls the clustering process.
• In the Map phase, the distances between the data samples and centroids are calculated and each sample is assigned to the nearest
centroid.
• In the Reduce phase, the centroids are recomputed using the mean of all the points in each cluster.
• The new centroids are then fed back to the client which checks whether convergence is reached or maximum number of iterations are
completed.
DBSCAN clustering
• DBSCAN is a density clustering algorithm that works on the notions of density reachability and density connectivity.
• Density Reachability
• Is defined on the basis of Eps-neighborhood, where Eps-neighborhood means that for every point p in a cluster C there
is a point q in C so that p is inside of the Eps-neighborhood of q and there are at least a minimum number (MinPts) of
points in an Eps-neighborhood of that point.
• A point p is called directly density-reachable from a point q if it is not farther away than a given distance (Eps) and if it is
surrounded by at least a minimum number (MinPts) of points that may be considered to be part of a cluster.
• Density Connectivity
• A point p is density connected to a point q if there is a point o such that both, p and q are density-reachable from o wrt.
Eps and MinPts.
• A cluster, is then defined based on the following two properties:
• Maximality: For all point p, q if p belongs to cluster C and q is density-reachable from p (wrt. Eps and MinPts), then q
also belongs to the cluster C.
• Connectivity: For all point p, q in cluster C, p is density-connected to q (wrt. Eps and MinPts).
DBSCAN vs K-means

• DBSCAN can find irregular shaped clusters as seen from this example and can even find a cluster completely surrounded
by a different cluster.

• DBSCAN considers some points as noise and does not assign them to any cluster.
Classification of Big Data
• Classification is the process of categorizing objects into predefined categories.
• Classification is achieved by classification algorithms that belong to a broad category of algorithms called supervised
machine learning.
• Supervised learning involves inferring a model from a set of input data and known responses to the data (training
data) and then using the inferred model to predict responses to new data.
• Binary classification
• Binary classification involves categorizing the data into two categories. For example, classifying the sentiment of a
news article into positive or negative, classifying the state of a machine into good or faulty, classifying the heath
test into positive or negative, etc.
• Multi-class classification
• Multi-class classification involves more than two classes into which the data is categorized. For example, gene
expression classification problem involves multiple classes.
• Document classification
• Document classification is a type of multi-class classification approach in which the data to the classified is in the form
of text document. For classifying news articles into different categories such as politics, sports, etc.
Performance of Classification Algorithms
• Precision: Precision is the fraction of objects that are classified correctly.

• Recall: Recall is the fraction of objects belonging to a category that are classified correctly.

• Accuracy:

• F1-score: F1-score is a measure of accuracy that considers both precision and recall. F1-score is the harmonic
means of precision and recall given as,
Naive Bayes
• Naive Bayes is a probabilistic classification algorithm based on the Bayes theorem
with a naive assumption about the independence of feature attributes. Given a
class variable C and feature variables F1,...,Fn , the conditional probability
(posterior) according to Bayes theorem is given as,

• where, P(C|F1,...,Fn ) is the posterior probability, P(F1,...,Fn |C) is the likelihood


and P(C) is the prior probability and P(F1,...,Fn ) is the evidence. Naive Bayes
makes a naïve assumption about the independence every pair of features given
as,

• Since the evidence P(F1,...,Fn ) is constant for a given input and does not depend
on the class variable C, only the numerator of the posterior probability is
important for classification.
• With this simplification, classification can then be done as follows,
Decision Trees
• Decision Trees are a supervised learning method that use a tree created
from simple decision rules learned from the training data as a predictive
model.
• The predictive model is in the form of a tree that can be used to predict
the value of a target variable based on a several attribute variables.
• Each node in the tree corresponds to one attribute in the dataset on
which the “split” is performed.
• Each leaf in a decision tree represents a value of the target variable.
• The learning process involves recursively splitting on the attributes until all
the samples in the child node have the same value of the target variable
or splitting further results in no further information gain.
• To select the best attribute for splitting at each stage, different metrics
can be used.
Splitting Attributes in Decision Trees
To select the best attribute for splitting at each stage, different metrics can be used such as:
• Information Gain

• Information content of a discrete random variable X with probability mass function


(PMF), P(X), is defined as,

• Information gain is defined based on the entropy of the random variable which is defined
as,

• Entropy is a measure of uncertainty in a random variable and choosing the attribute with the
highest information gain results in a split that reduces the uncertainty the most at that
stage.
• Gini Coefficient
• Gini coefficient measures the inequality, i.e. how often a randomly chosen sample that is
labeled based on the distribution of labels, would be labeled incorrectly. Gini coefficient is
defined as,
Decision Tree Algorithms
• There are different algorithms for building decisions trees, popular ones being ID3 and C4.5.
• ID3:
• Attributes are discrete. If not, discretize the continuous attributes.
• Calculate the entropy of every attribute using the dataset.
• Choose the attribute with the highest information gain.
• Create branches for each value of the selected attribute.
• Repeat with the remaining attributes.
• The ID3 algorithm can be result in over-fitting to the training data and can be expensive to train especially for
continuous attributes.
• C4.5
• The C4.5 algorithm is an extension of the ID3 algorithm. C4.5 supports both discrete and continuous attributes.
• To support continuous attributes, C4.5 finds thresholds for the continuous attributes and then splits based on the
threshold values. C4.5 prevents over-fitting by pruning trees after they have been created.
• Pruning involves removing or aggregating those branches which provide little discriminatory power.
Random Forest
• Random Forest is an ensemble learning method that is based on randomized decision trees.
• Random Forest trains a number decision trees and then takes the majority vote by using the mode of the class predicted by
the individual trees.
Breiman’s Algorithm

1.Draw a bootstrap sample (n times with replacement from the N samples in the training set)
from the dataset
2.Train a decision tree
-Until the tree is fully grown (maximum size)
-Choose next leaf node
-Select m attributes (m is much less than the total number of attributes M) at random.
-Choose the best attribute and split as usual
3.Measure out-of-bag error
- Use the rest of the samples (not selected in the bootstrap) to estimate the error of
the tree, by predicting their classes.
4.Repeat steps 1-3 k times to generate k trees.
5.Make a prediction by majority vote among the k trees
Support Vector Machine
• Support Vector Machine (SVM) is a supervised machine
learning approach used for classification and regression.
• The basic form is SVM is a binary classifier that classifies the
data points into one of the two classes.
• SVM training involves determining the maximum
margin hyperplane that separates the two classes.
• The maximum margin hyperplane is one which has the
largest separation from the nearest training data point.
• Given a training data set (xi ,yi ) where xi is an n dimensional
vector and yi = 1 if xi is in class 1 and yi = -1 if xi is in class 2.
• A standard SVM finds a hyperplane w.x-b = 0, which correctly
separates the training data points and has a maximum margin
which is the distance between the two hyperplanes w.x-b = 1
and w.x-b = -1
Support Vector Machine

Binary classification with Linear SVM Binary classification with RBF SVM
Recommendation Systems

• Recommendation systems are an important part of modern cloud applications such


as e-Commerce, social networks, content delivery networks, etc.
• Item-based or Content-based Recommendation
• Provides recommendations to users (for items such as books, movies, songs, or
restaurants) for unrated items based on the characteristics of the item.
• Collaborative Filtering
• Provides recommendations based on the ratings given by the user and other users
to similar items.
Multimedia
Cloud
Outline

• Reference architecture for Multimedia Cloud


• Case study of a live video streaming cloud application
• Case study of a video transcoding cloud application
Design methodology for PaaS service model

• For applications that use the Platform-as-a-service (PaaS) cloud service model, the architecture and
deployment design steps are not required since the platform takes care of the architecture and deployment.
• Component Design
• In the component design step, the developers have to take into consideration the platform specific features.
• Platform Specific Software
• Different PaaS offerings such as Google App Engine, Windows Azure Web Sites, etc., provide platform specific software
development kits (SDKs) for developing cloud applications.
• Sandbox Environments
• Applications designed for specific PaaS offerings run in sandbox environments and are allowed to perform only those
actions that do not interfere with the performance of other applications.
• Deployment & Scaling
• The deployment and scaling is handled by the platform while the developers focus on the application development
using the platform-specific SDKs.
• Portability
• Portability is a major constraint for PaaS based applications as it is difficult to move the
Multimedia Cloud Reference Architecture
• Infrastructure Services
• In the Multimedia Cloud reference architecture, the first layer is the
infrastructure services layer that includes computing and storage resources.
• Platform Services
• On top of the infrastructure services layer is the platform services layer
that includes frameworks and services for streaming and associated tasks
such as transcoding and analytics that can be leveraged for rapid
development of multimedia applications.
• Applications
• The topmost layer is the applications such as live video streaming, video
transcoding, video-on-demand, multimedia processing etc.
• Cloud-based multimedia applications alleviates the burden of installing and
maintaining multimedia applications locally on the multimedia consumption
devices (desktops, tablets, smartphone, etc) and provide access to rich
multimedia content.
• Service Models
• A multimedia cloud can have various service models such as IaaS, PaaS
and SaaS that offer infrastructure, platform or application services.
Multimedia Cloud - Live Video Streaming
• Workflow of a live video streaming application that uses multimedia cloud:
• The video and audio feeds generated by a number cameras and microphones are mixed/multiplexed with
video/audio mixers and then encoded by a client application which then sends the encoded feeds to the multimedia
cloud.
• On the cloud, streaming instances are created on-demand and the streams are then broadcast over the internet.
• The streaming instances also record the event streams which are later moved to the cloud storage for video
archiving.

Workflow for live video streaming using multimedia cloud


Streaming Protocols
• RTMP Dynamic Streaming (Unicast)
• High-quality, low-latency media streaming with support for live and on-demand and full adaptive bitrate.
• RTMPE (encrypted RTMP)
• Real-time encryption of RTMP.
• RTMFP (multicast)
• IP multicast encrypted with support for both ASM or SSM multicast for multicast-enabled network.
• RTMFP (P2P)
• P2P live video delivery between Flash Player clients.
• RTMFP (multicast fusion)
• IP and P2P working together to support higher QoS within enterprise networks.
• HTTP Dynamic Streaming (HDS)
• Enabling on-demand and live adaptive bitrate video streaming of standards-based MP4 media over regular HTTP
connections.
• Protected HTTP Dynamic Streaming (PHDS)
• Real-time encryption of HDS.
• HTTP Live Streaming (HLS)
• HTTP streaming to iOS devices or devices that support the HLS format; optional encryption with AES128 encryption
standard.
RTMP Streaming
• Real Time Messaging Protocol (RTMP) is a protocol for streaming audio, video and data over the Internet.
• The plain version of RTMP protocol works on top of TCP. RTMPS is a secure variation of RTMP that works
over TLS/SSL.
• RTMP provides a bidirectional message multiplex service over a reliable stream transport, such as TCP.
• RTMP maintains persistent TCP connections that allow low-latency communication.
• RTMP is intended to carry parallel streams of video, audio, and data messages, with associated timing
information, between a pair of communicating peers.
• Streams are split into fragments so that delivery of the streams smoothly.
• The size of the stream fragments is either fixed or negotiated dynamically between the client and server.
• Default fragment sizes used are 64-bytes for audio data, and 128 bytes for video data.
• RTMP implementations typically assign different priorities to different classes of messages, which can
affect the order in which messages are enqueued to the underlying stream transport when transport
capacity is constrained.
HTTP Live Streaming
• HTTP Live Streaming (HLS) can dynamically adjust playback quality to match the available speed of
wired or wireless networks.
• HLS supports multiple alternate streams at different bit rates, and the client software can switch
streams intelligently as network bandwidth changes.
• HLS also provides for media encryption and user authentication over HTTPS, allowing publishers to
protect their work.
• The protocol works by splitting the stream into small chunks which are specified in a playlist file.
• Playlist file is an ordered list of media URIs and informational tags.
• The URIs and their associated tags specify a series of media segments.
• To play the stream, the client first obtains the playlist file and then obtains and plays each media
segment in the playlist.
HTTP Dynamic Streaming
• HTTP Dynamic Streaming (HDS) enables on-demand and live adaptive bitrate video
delivery of standards-based MP4 media (H.264 or VPC) over regular HTTP
connections.
• HDS combines HTTP (progressive download) and RTMP (streaming download) to provide
the ability to deliver video content in a steaming manner over HTTP.
• HDS supports adaptive bitrate which allows HDS to detect the client’s bandwidth and
computer resources and serve content fragments encoded at the most appropriate
bitrate for the best viewing experience.
• HDS supports high-definition video up to 1080p, with bitrates from 700 kbps up to
and beyond 6 Mbps, using either H.264 or VP6 video codecs, or AAC and MP3 audio
codecs.
• HDS allows leveraging existing caching infrastructures, content delivery networks
(CDNs) and standard HTTP server hardware to deliver on-demand and live content.
Live Video Steaming App – Case Study
• Functionality
• Live video streaming application allows on-demand creation of video streaming instances in the cloud.
• Development
• The live streaming application is created using the Django framework and uses Amazon EC2 cloud instances.
• For video stream encoding and publishing, the Adobe Flash Media Live Encoder and Flash Media Server are used
Video Transcoding App – Case Study
• Functionality
• Video transcoding application is based on multimedia cloud.
• The transcoding application allows users to upload video files and choose the conversion presets.
• Development
• The application is built upon the Amazon Elastic Transcoder.
• Elastic Transcoder is highly scalable, relatively easy to use service from Amazon that allows converting video files from
their source format into versions that will playback on mobile devices like smartphones, tablets and PCs.
Video Transcoding App – Demo
Cloud Application
Benchmarking & Tuning
Outline

• Cloud application workload characteristics


• Performance metrics for cloud applications
• Cloud application testing
• Performance testing tools
• Load test and bottleneck detection case study
Benchmarking
• Benchmarking of cloud applications is important or the following reasons:
• Provisioning and capacity planning
• The process of provisioning and capacity planning for cloud applications involves determining the
amount of computing, memory and network resources to provision for the application.
• Benchmarking can help in comparing alternative deployment architectures and choosing the best
and most cost effective deployment architecture that can meet the application performance
requirements.
• Ensure proper utilization of resources
• Benchmarking can help in determining the utilization of computing, memory and network
resources for applications and identify resources which are either under-utilized or over-
provisioned and hence save deployments costs.
• Market readiness of applications
• Performance of an application depends on the characteristics of the workloads it experiences.
Different types of workloads can dead to different performance for the same application.
• To ensure the market readiness of an application it is important to model all types of workloads the
application can experience and benchmark the application with such workloads.
Cloud Application Benchmarking - Steps
• Trace Collection/Generation
• The first step in benchmarking cloud applications is to collect/generate traces of real application workloads. For
generating a trace of workload, the application is instrumented to log information such as the requests
submitted by the users, the time-stamps of the requests, etc.
• Workload Modeling
• Workload modeling involves creation of mathematical models that can be used for generation of synthetic
workloads.
• Workload Specification
• Since the workload models of each class of cloud computing applications can have different workload
attributes, a Workload Specification Language (WSL) is often used for specification of application workloads.
WSL can provide a structured way for specifying the workload attributes that are critical to the performance of
the applications. WSL can be used by synthetic workload generators for generating workloads with slightly
varying the characteristics.
• Synthetic Workload Generation
• Synthetic workloads are used for benchmarking cloud applications. An important requirement for a synthetic
workload generator is that the generated workloads should be representative of the real workloads.
Synthetic Workload Generation Approaches

• Empirical approach
• In this approach traces of applications are sampled and replayed to generate the synthetic workloads.
• The empirical approach lacks flexibility as the real traces obtained from a particular system are used for
workload generation which may not well represent the workloads on other systems with different
configurations and load conditions.

• Analytical approach
• Uses mathematical models to define the workload characteristics that are used by a synthetic workload
generator.
• Analytical approach is flexible and allows generation of workloads with different characteristics by
varying the workload model attributes.
• With the analytical approach it is possible to modify the workload model parameters one at a time and
investigate the effect on application performance to measure the application sensitivity to different
parameters.
User Emulation vs Aggregate Workloads
The commonly used techniques for workload generation are:
• User Emulation
• Each user is emulated by a separate thread that mimics the actions of a user by alternating between making
requests and lying idle.
• The attributes for workload generation in the user emulation method include think time, request types, inter-
request dependencies, for instance.
• User emulation allows fine grained control over modeling the behavioral aspects of the users interacting with the
system under test, however, it does not allow controlling the exact time instants at which the requests arrive the
system.
• Aggregate Workload Generation:
• Allows specifying the exact time instants at which the requests should arrive the system under test.
• However, there is no notion of an individual user in aggregate workload generation, therefore, it is not possible
to use this approach when dependencies between requests need to be satisfied.
• Dependencies can be of two types inter-request and data dependencies.
• An inter-request dependency exists when the current request depends on the previous request, whereas a data
dependency exists when the current requests requires input data which is obtained from the response of the
previous request.
Workload Characteristics
• Session
• A set of successive requests submitted by a user constitute a session.
• Inter-Session Interval
• Inter-session interval is the time interval between successive sessions.
• Think Time
• In a session, a user submits a series of requests in succession. The time interval between
two successive requests is called think time.
• Session Length
• The number of requests submitted by a user in a session is called the session length.
• Workload Mix
• Workload mix defines the transitions between different pages of an application and the
proportion in which the pages are visited.
Application Performance Metrics

The most commonly used performance metrics for cloud applications are:

• Response Time
• Response time is the time interval between the moment when the user submits a
request to the application and the moment when the user receives a response.

• Throughput
• Throughput is the number of requests that can be serviced per second.
Considerations for Benchmarking Methodology
• Accuracy
• Accuracy of a benchmarking methodology is determined by how closely the generated synthetic workloads
mimic the realistic workloads.
• Ease of Use
• A good benchmarking methodology should be user friendly and should involve minimal hand coding effort
for writing scripts for workload generation that take into account the dependencies between requests,
workload attributes, for instance.
• Flexibility
• A good benchmarking methodology should allow fine grained control over the workload attributes such as
think time, inter-session interval, session length, workload mix, for instance, to perform sensitivity analysis.
• Sensitivity analysis is performed by varying one workload characteristic at a time while keeping the others
constant.
• Wide Application Coverage
• A good benchmarking methodology is one that works for a wide range of applications and not tied to the
application architecture or workload types.
Types of Tests
• Baseline Tests
• Baseline tests are done to collect the performance metrics data of the entire application or a component of the
application.
• The performance metrics data collected from baseline tests is used to compare various performance tuning
changes which are subsequently made to the application or a component.
• Load Tests
• Load tests evaluate the performance of the system with multiple users and workload levels that are encountered in
the production phase.
• The number of users and workload mix are usually specified in the load test configuration.
• Stress Tests
• Stress tests load the application to a point where it breaks down.
• These tests are done to determine how the application fails, the conditions in which the application fails and the
metrics to monitor which can warn about impending failures under elevated workload levels.
• Soak Tests
• Soak tests involve subjecting the application to a fixed workload level for long periods of time.
• Soak tests help in determining the stability of the application under prolonged use and how the performance changes
with time.
Deployment Prototyping
• Deployment prototyping can help in making deployment architecture design choices.
• By comparing performance of alternative deployment architectures, deployment
prototyping can help in choosing the best and most cost effective deployment
architecture that can meet the application performance requirements.

• Deployment design is an iterative process that involves the following steps:


• Deployment Design
• Create the deployment with various tiers as specified in the deployment configuration
and deploy the application.
• Performance Evaluation
• Verify whether the application meets the performance requirements with the
deployment.
• Deployment Refinement
• Deployments are refined based on the performance evaluations. Various alternatives
can exist in this step such as vertical scaling, horizontal scaling, for instance.
Performance Evaluation Workflow
Semi-Automated Workflow (Traditional Approach)
• In traditional approach to capture workload characteristics, a real user’s
interactions with a cloud application are first recorded as virtual user
scripts.
• The recorded virtual user scripts then are parameterized to account for
randomness in application and workload parameters.
• Multiple scripts have to be recorded to create different workload scenarios.
This approach involves a lot of manual effort.
• To add new specifications for workload mix and new requests, new scripts need
to be recorded and parameterized.
• Traditional approaches which are based on manually generating virtual user
scripts by interacting with a cloud application, are not able to generate synthetic
workloads which have the same characteristics as real workloads.
• Traditional approaches do now allow rapidly comparing various
deployment architectures.
Performance Evaluation Workflow
Fully-Automated Workflow (Modern Approach)
• In the automated approach real traces of a multi-tier application which are logged
on web servers, application servers and database servers are analyzed to generate
benchmark and workload models that capture the cloud application and workload
characteristics.
• A statistical analysis of the user requests in the real traces is performed to identify
the right distributions that can be used to model the workload model attributes.
• Real traces are analyzed to generate benchmark and workload models.
• Various workload scenarios can be created by changing the specifications of the
workload model.
• Since real traces from a cloud application are used to capture workload and
application characteristics into workload and benchmark models, the generated
synthetic workloads have the same characteristics as real workloads.
• An architecture model captures the deployment configurations of multi-tier
applications.
Benchmarking Case Study
• Fig (a) shows the average throughput and response time. The observed throughput
increases as demanded request rate increases. As more number of requests are served per
second by the application, the response time also increases. The observed throughput
saturates beyond a demanded request rate of 50 req/sec.
• Fig (b) shows the CPU usage density of one of the application servers. This plot shows
that the application server CPU is non-saturated resource.
• Fig (c) shows the database server CPU usage density. From this density plot we observe that
the database CPU spends a large percentage of time at high utilization levels for demanded
request rate more than 40 req/sec.
• Fig (d) shows the density plot of the database disk I/O bandwidth.
• Fig (e) shows the network out rate for one of the application servers
• Fig (f) shows the density plot of the network out rate for the database server. From this plot
we observe a continuous saturation of the network out rate around 200 KB/s.
• Analysis
• Throughput continuously increases as the demanded request rate increases from 10 to
40 req/sec. Beyond 40 req/sec demanded request rate, we observe that throughput
saturates, which is due to the high CPU utilization density of the database server CPU.
From the analysis of density plots of various system resources we observe that the
database CPU is a system bottleneck.
UNIT -V
Cloud Security
Introduction: -
More, and more organizations are moving their applications and associated data to cloud
to reduce costs and reduce the operational and. maintenance overheads, and one of the important
considerations is that of security of the daita.in the cloud. Most cloud service providers
implement advanced security features similar to those that exist in in-house IT environments.
However, due the out sourced nature .of' the cloud, resource pooling and multi-tenanted;
architectures, security remains an important .concern in adoption of cloud computing. :In
addition to the traditional vulnerabilities that exist for web app1icatons, the
cloud. applications have additional vulnerabilities because of the shared usage of
resources and virtualized resources. Key Security challenges for. cloud applications include:
1)Authentication
2)Authorization
3)Security of Data at Rest
4)Security of Data at Motion
5)Data Integrity
6)Auditing
1)Authentication:-
Authentication refers to digitally confirming the identity of the entity requesting access to
some protected information.
In a traditional in-house IT Environment authentication polices are under the control of
the organization. However, in cloud computing environments, where applications and data are
accessed over the internet, the complexity of digital authentication mechanisms increases
rapidly.
2)Authorization:-
Authorization refers to digitally specifying the access rights to the protected resources
using access policies. In a traditiona1 in-house IT environment, the access policies are controlled
by the organization and can.be altered at their convenience.
An organization, for example, can provide different access policies for different
departments.
Authorization in a cloud computing environment requires the use of the cloud service
providers services for specifying the access policies.
3)Security of Data at Rest:-
Due to the multi-tenant environments used in the cloud, the application and database
servers of different applications belonging to different organizations can be provisioned side-by-
side increasing the complexity of securing the. data.
Appropriate separation mechanisms are required to ensure the isolation between
applications and. data from different organizations.

4)Security of Data at Motion:-


In traditional in-house IT environments, all the data exchanged between the applications
and users remains within the organization's control and geographical boundaries. Organizations
believe that they have complete visibility of all the data exchanged and control the IT
infrastructure. With the adoption of tire cloud model, the applications and the data are moved out
of the in-house IT infrastructure to the cloud provider. In such a scenario, organizations have to
access their applications with the data moving in and out of the cloud over the internet.
Therefore, appropriate security mechanisms are required to ensure the security of data in, and
while in, motion.
5)Data Integrity: -
Data integrity ensures that the data is not altered in an unauthorized manner after it is
created, transmitted or stored.
Due to the, outsourcing of data storage, in cloud computing environments, ensuring
integrity of data is important. Appropriate mechanisms are required för detecting accidental
and/or intentional changes in the data.
6)Auditing
Auditing is very important for applications deployed in cloud computing environments.
In traditional in-house IT environments, organizations have complete visibility of their
applications and accesses to the protected information.
For cloud applications appropriate auditing mechanisms are required to get visibility into
the application, data accesses and actions performed by the app1ïcation users, including mobile
users and devices such as. wireless laptops and smartphones.

CSA Cloud Security Architecture


Introduction: -
The Cloud Security Alliance (CSA) provides a Trusted Cloud Initiative (TCI)
Reference Architecture which is. a methodology and a set of tools that enab1e cloud application
developers and security architects to assess where their internal IT and their cloud providers are
in terms of security capabilities, and to plan a roadmap to meet the security needs of their
business.
The Security and Risk Management (SRM) domain within the TCI Reference
Architecture provides the core components of an organization’s information security program to
safeguard assets and detect, assess, and monitor risks inherent in operating activities.
The sub-domains of SRM include:
1)Governance, Risk Management and Compliance:-
This sub-domain deals with the identification and implementation of the
appropriate organizational structures, processes, and controls to maintain effective information
security governance, risk management and compliance.
2)Information Security Management:-
This sub-domain deals with the implementation of appropriate measurements
(such as capability maturity models, capability mapping models, security architectures roadmaps
and risk portfolios) in order to minimize or eliminate the impact that security related threats and
vulnerabilities might have on an organization.
3)Privilege Management Infrastructure:-
The objective of this sub-domain is to ensure that users have access and privileges
required to execute their duties and responsibilities with Identity and Access Management (IAM)
functions such as identity management, authentication services, authorization services, and
privilege usage management.
4)Threat and Vulnerability Management:-
This sub-domain deals with core security such. as vulnerability management, threat
management, compliance testing, and penetration testing.
5)Infrastructure Protection Service:-
This objective of this sub-domain is to secure Server, End-Point, Network and
Application layers.
6)Data Protection
This sub-domain deals with data lifecycle management, data leakage prevention,
intellectual property protection with digital rights management, and cryptographic services such
as key management and PKI/symmetric encryption.
7) Polices and Standards: - Security policies and standards are derived from risk-based
business requirements and exist at a number of different levels including Information Security
policy, Physical Security Policy, Business Continuity Policy, Infrastructure Security Po1icies„
Application Security Policies as well as the over-arching Business Operational Risk
Management Policy.
Below Diagram shows the SRM domain within the TCI Reference Architecture of CSÄ.
Authenticaiton
Introduction: -
• Authentication refers to confirming the digital identity of the entity requesting access to
some protected information.
• The process of authentication involves, but is not limited to, validating at least one factor
of identification of the entity to be authenticated.
• A factor can be something the entity or the user knows (password or pin), something the
user has (such as a smart card), or something that can uniquely identify the user (such as
fingerprints).
• In multifactor authentication more than one of these factors are used for authentication.
• There are various mechanisms for authentication including:
-SSO
-OTP
1)Single Sign On (SSO):-
• Single Sign-on (SSO) enables users to access multiple systems or applications after
signing in only once, for the first time.
• When a user signs in, the user identity is recognized and there is no need to sign in again
and again to access related systems or applications.
• Since different systems or applications may be internally using different authentication
mechanisms, SSO upon receiving initial credential translates to different credentials for
different systems or applications.
• The benefit of using SSO is that it reduces human error and saves time spent in
authenticating with different systems or applications for the same identity.
• There are different implementation mechanisms:
• SAML-Token
• Kerberos
a) SAML Token: -
• Security Assertion Markup Language (SAML) is an XML-based open standard data
format for exchanging security information (authentication and authorization data)
between an identity provider and a service provider.
• SAML-token based SSO authentication
• When a user tries to access the cloud application, a SAML request is
generated and the user is redirected to the identity provider.
• The identity provider parses the SAML request and authenticates the user. A SAML
token is returned to the user, who then accesses the cloud application with the
token.
• SAML prevents man-in-the-middle and replay attacks by requiring the use of SSL
encryption when transmitting assertions and messages.
• SAML also provides a digital signature mechanism that enables the assertion
to have a validity time range to prevent replay attacks.

The below diagram shows the Authentication flow for a Cloud Application using SAML SSO
b) Kerberos: -
• Kerberos is an open authentication protocol that was developed At MIT.
• Kerberos uses tickets for authenticating client to a service that communicate over an un-
secure network.
• Kerberos provides mutual authentication, i.e. both the client and the server authenticate
with each other.
• Below diagram shown Kerberos Authentication Flow:
2)One Time Password (OTP) :-
• One time password is another authentication mechanism that uses passwords which are
valid for single use only for a single transaction or session.
• Authentication mechanism based on OTP tokens are more secure because they are not
vulnerable to replay attacks.
• Text messaging (SMS) is the most common delivery mode for OTP tokens.
• The most common approach for generating OTP tokens is time synchronization.
• Time-based OTP algorithm (TOTP) is a popular time synchronization based algorithm
for generating OTPs.

Authorization
Introduction: -
• Authorization refers to specifying the access rights to the protected resources using
access policies.
• OAuth:
o OAuth is an open standard for authorization that allows resource owners to
share their private resources stored on one site with another site without
handing out the credentials.
o In the OAuth model, an application (which is not the resource owner) requests
access to resources controlled by the resource owner (but hosted by the server).
o The resource owner grants permission to access the resources in the form of a
token and matching shared-secret.
o Tokens make it unnecessary for the resource owner to share its credentials with
the application.
o Tokens can be issued with a restricted scope and limited lifetime, and
revoked independently.

Identity and Access Management


Introduction: -
• Identity management provides consistent methods for digitally identifying
persons and maintaining associated identity attributes for the users across
multiple organizations.
• Access management deals with user privileges.
• Identity and access management deal with user identities, their
authentication, authorization and access policies.
• Federated Identity Management
o Federated identity management allows users of one domain to securely access data or
systems of another domain seamlessly without the need for maintaining identity
information separately for multiple domains.
o Federation is enabled through the use single sign-on mechanisms such as SAML token and
Kerberos.
• Role-based access control
o Used for restricting access to confidential information to authorized users.
o These access control policies allow defining different roles for different users.

Below Diagram shows an example of the Role Based Access Control Framework
in the cloud.
Data Security
Introduction: -
• Securing data in the cloud is critical for cloud applications as the data flows
from applications to storage and vice versa. Cloud applications deal with both
data at test and data in motion.

• There are various types of threats that can exist for data in the cloud such as
denial of service, replay attacks, man-in the -middle attacks, unauthorized
access/modification, etc.

1)Securing Data at Rest: -


• Data at rest is the data that is stored in database in the form of tables/records,
files on a file server or raw data on a distributed storage or storage area network
(SAN).

• Data at rest is secured by encryption.

• Encryption is the process of converting data from its original form (i.e.,
plaintext) to a scrambled form (ciphertext) that is unintelligible. Decryption
converts data from ciphertext to plaintext.

• Encryption can be of two types:


• Symmetric Encryption (symmetric-key algorithms)
• Asymmetric Encryption (public-key algorithms)

a) Symmetric Encryption:
• Symmetric encryption uses the same secret key for both encryption and
decryption.

• The secret key is shared between the sender and the receiver.

• Symmetric encryption is best suited for securing data at rest since the data is
accessed by known entities from known locations.

• Popular symmetric encryption algorithms include:


• Advanced Encryption Standard (AES)
• Twofish
• Blowfish
• Triple Data Encryption Standard (3DES)
• Serpent
• RC6
• MARS

b) Asymmetric Encryption:
• Asymmetric encryption uses two keys, one for encryption (public key) and
other for decryption (private key).
• The two keys are linked to each other such that one key encrypts plaintext to
ciphertext and other decrypts ciphertext back to plaintext.
• Public key can be shared or published while the private key is known only to
the user.
• Asymmetric encryption is best suited for securing data that is exchanged
between two parties where symmetric encryption can be unsafe because the
secret key has to be exchanged between the parties and anyone who manages
to obtain the secret key can decrypt the data.
• In asymmetric encryption a separate key is used for decryption which is kept
private.
c)Encryption Levels:
Encryption can be performed at various levels described as follows:
- Application
- Host
- Network
- Device
Application:
• Application-level encryption involves encrypting application data right at the point where
it originates i.e. within the application.
• Application-level encryption provides security at the level of both the operating system
and from other applications
• An application encrypts all data generated in the application before it flows to the lower
levels and presents decrypted data to the user.
Host:
• In host-level encryption, encryption is performed at the file-level for all applications running
on the host.
• Host level encryption can be done in software in which case additional computational resource
is required for encryption or it can be performed with specialized hardware such as a
cryptographic accelerator card.

Network:
• Network-level encryption is best suited for cases where the threats to data are at the network or
storage level and not at the application or host level.
• Network-level encryption is performed when moving the data form a creation point to its
destination using a specialized hardware that encrypts all incoming data in real-time.

Device:
• Device-level encryption is performed on a disk controller or a storage server
• Device level encryption is easy to implement and is best suited for cases where the primary
concern about data security is to protect data residing on storage media

Below Diagram shows various Encryption levels:

2)Securing Data in Motion: -


• Securing data in motion, i.e., when the data flows between a client and a server over a
potentially insecure network, is important to ensure data confidentiality and integrity.
Data confidentiality:
Data Confidentiality means limiting the access to data so that only authorized recipients
can
access it.

Data integrity:
Data integrity means that the data remains unchanged when moving from sender to
receiver.
Data integrity ensures that the data is not altered in an unauthorized manner after it is
created, transmitted or stored.

Transport Layer Security (TLS) and Secure Socket Layer (SSL) are the mechanisms used for
securing data in motion.

Below diagram shows the TLS Handshake protocol:

TLS and SSL are used to encrypt web traffic using Hypertext Transfer Protocol (HTTP).

TLS and SSL use asymmetric cryptography for authentication of key exchange, symmetric
encryption for confidentiality and message authentication codes for message integrity.
Key Management
Introduction: -
Management of encryption keys is critical to ensure security of encrypted
data. The key management lifecycle involves different phases including:
Creation: Creation of keys is the first step in the key management lifecycle. Keys must be
created in a secure environment and must have adequate strength. It is recommended to
encrypt the keys themselves. with a separate master key.
Backup: Backup of keys must be made before putting them into production because in the
event of loss of keys, all encrypted data can become useless.
Deployment: In this phase the new key is deployed for encrypting the data. Deployment of
a new key involves re-keying existing data.
Monitoring: After a key has been deployed, monitoring the performance of the encryption
environment is done to ensure that the key has been deployed correctly.
Rotation: Key rotation involves creating a new key and re-encrypting all data with the new
key.
Expiration: Key expiration phase begins after the key rotation is complete. It is
recommended to complete the key rotation process before the expiry of the existing key.
Archival: Archival is the phase before the key is finally destroyed. It is recommended to
archive old keys for some period of time tt› account for scenarios where there is still some
data in the system that is encrypted with the old key.
Destruction: Expired keys are finally destroyed after ensuring that there is no data
encrypted with the expired keys.

Below diagram shows an example of the key Management approach:


Auditing
Introduction: -
• Auditing is mandated by most data security regulations.

• Auditing requires that all read and write accesses to data be logged.

• Logs can include the user involved, type of access, timestamp, actions performed and
records accessed.

• The main purpose of auditing is to find security breaches, so that necessary changes can be made in
the application and deployment to prevent a further security breach.

Objectives:
The objectives of auditing include:
• Verify efficiency and compliance of identity and access management controls as per established
access policies.
• Verifying that authorized users are granted access to data and services based on their roles.
• Verify whether access policies are updated in a timely manner upon change in the roles of the users.
• Verify whether the data protection policies are sufficient.
• Assessment of support activities such as problem management

Cloud Computing for Education.


Introduction: -
Cloud computing is bringing a transformative impact in the. field of education by
improving the reach of quality education to students through the use of online learning platforms
and collaboration tools.
In the recent years the concept of Massively Online Open Courses (MOOCs)
appears to be gaining popularity worldwide with large numbers students enrolling for online
courses.

Some of the Education programs running on the Cloud platforms are listed
below:
MOOCs
• MOOCs are aimed for large audiences and use cloud technologies for providing
audio/video content, readings, assignment and exams.
• Cloud-based auto-grading applications are used for grading exams and assignments.
Cloud-based applications for peer grading of exams and assignments are also used in
some MOOCs
Online Programs
• Many universities across the world are using cloud platforms for providing online
degree programs.
• Lectures are delivered through live/recorded video using cloud-based content delivery
networks to students across the world.
Online Proctoring
• Online proctoring for distance learning programs is also becoming popular through the
use of cloud-based live video streaming technologies where online proctors observe test
takers remotely through video.
Virtual Labs
• Access to virtual labs is provided to distance learning students through the cloud. Virtual
labs provide remote access to the same software and applications that are used by
students on campus.
Course Management Platforms
• Cloud-based course management platforms are used to for sharing reading materials,
providing assignments and releasing grades, for instance.
• Cloud-based collaboration applications such as online forums, can help student discuss
common problems and seek guidance from experts.
Information Management
• Universities, colleges and schools can use cloud-based information management systems
to improve administrative efficiency, offer online and distance education programs,
online exams, track progress of students, collect feedback from students, for instance.
Reduce Cost of Education
• Cloud computing thus has the potential of helping in bringing down the cost of
education by increasing the student-teacher ratio through the use of online learning
platforms and new evaluation approaches without sacrificing quality.

Below diagram shows the generic use of the Cloud for Education:
******ALL THE BEST******

You might also like