Cloudcomputing Merged
Cloudcomputing Merged
***************
Page | 1
Benefits
Cloud Computing has numerous advantages. Some of them are listed below -
• One can access applications as utilities, over the Internet.
• One can manipulate and configure the applications online at any time.
• It does not require to install a software to access or manipulate cloud application.
• Cloud Computing offers online development and deployment tools, programming
runtime environment through PaaS model.
• Cloud resources are available over the network in a manner that provide platform
independent access to any type of clients.
• Cloud Computing offers on-demand self-service. The resources can be used
without interaction with cloud service provider.
• Cloud Computing is highly cost effective because it operates at high efficiency with
optimum utilization. It just requires an Internet connection
• Cloud Computing offers load balancing that makes it more reliable.
Page | 2
Characteristics of Cloud Computing
There are four key characteristics of cloud computing. They are shown in the
following diagram:
Cloud Models:
There are certain services and models working behind the scene making the cloud
computing feasible and accessible to end users. Following are the working models for
cloud computing:
Page | 3
• Deployment Models
• Service Models
Deployment Models
Deployment models define the type of access to the cloud, i.e., how the cloud is
located? Cloud can have any of the four types of access: Public, Private, Hybrid, and
Community.
Public Cloud
The public cloud allows systems and services to be easily accessible to the general
public. Public cloud may be less secure because of its openness.
Private Cloud
The private cloud allows systems and services to be accessible within an
organization. It is more secured because of its private nature.
Community Cloud
The community cloud allows systems and services to be accessible by a group of
organizations.
Hybrid Cloud
The hybrid cloud is a mixture of public and private cloud, in which the critical
activities are performed using private cloud while the non-critical activities are performed
using public cloud.
Service Models
Cloud computing is based on service models. These are categorized into three
basic service models which are -
• Infrastructure-as–a-Service (IaaS)
• Platform-as-a-Service (PaaS)
• Software-as-a-Service (SaaS)
Anything-as-a-Service (XaaS) is yet another service model, which includes Network-
as-a-Service, Business-as-a-Service, Identity-as-a-Service, Database-as-a-Service or Strategy-
as-a-Service.
The Infrastructure-as-a-Service (IaaS) is the most basic level of service. Each of
the service models inherit the security and management mechanism from the underlying
model, as shown in the following diagram:
Page | 4
Infrastructure-as-a-Service (IaaS)
IaaS provides access to fundamental resources such as physical machines, virtual
machines, virtual storage, etc.
Platform-as-a-Service (PaaS)
PaaS provides the runtime environment for applications, development and
deployment tools, etc.
Software-as-a-Service (SaaS)
SaaS model allows to use software applications as a service to end-users.
Page | 5
We need to understand the meaning of the word virtual. The word virtual means
that it is a representation of something physically present elsewhere.
Similarly, Virtualization in Cloud Computing is a technology that allows us to
create virtual resources such as servers, networks, and storage in the cloud. All these
resources are allocated from a physical machine that runs somewhere in the world, and
we'll get the software to provision and manage these virtual resources. These physical
machines are operated by cloud providers, who take care of maintenance, and hardware
supplies.
• Virtualization refers to the partitioning the resources of a physical system (such
as computing, storage, network and memory) into multiple virtual resources.
• Key enabling technology of cloud computing that allow pooling of resources.
• In cloud computing, resources are pooled to serve multiple users using
multi-tenancy.
Page | 6
The VMs function like digital files inside the physical device and they can be
moved from one system to another, thereby increasing the portability. There are many
open-source and paid Hypervisors available. Cloud providers use them based on their
requirements and business needs.
Virtualization Work in Cloud Computing
Type-2 Hypervisor:
Type 2 hypervisors or hosted hypervisors run on top of a conventional
(main/host) operating system and monitor the guest operating systems.
Page | 7
3. Virtualization software
A tool that works on deploying virtualization on the device, this is the software
that the user interacts with for specifying virtual resources requirements. This software
communicates with the hypervisor for the resource requirements.
4. Virtual Networking
The Virtual Networking, the network that is configured inside the servers is
separated logically these networks can be scaled across multiple servers, and these
networks can be controlled by the software.
Page | 8
Types of Virtualization:
Full Virtualization
Full Virtualization is virtualization in which the guest operating system is unaware
that it is in a virtualized environment, and therefore hardware is virtualized by the host
operating system so that the guest can issue commands to what it thinks is actual
hardware, but really are just simulated hardware devices created by the host.
Para-Virtualization
Para-Virtualization is virtualization in which the guest operating system (the one
being virtualized) is aware that it is a guest and accordingly has drivers that, instead of
issuing hardware commands, simply issues commands directly to the host operating
system. This will include things such as memory management as well.
Hardware Virtualization
Hardware assisted virtualization is enabled by hardware features such as Intel’s
Virtualization Technology (VT-x) and AMD’s AMD-V.
In hardware assisted virtualization, privileged and sensitive calls are set to
automatically trap to the hypervisor. Thus, there is no need for either binary translation
or para-virtualization.
Page | 9
• Second is a multiple-server solution in which a scalable service system on a cluster
of servers is built. That’s why it is more cost effective as well as more scalable to
build a server cluster system for network services.
Load balancing is beneficial with almost any type of service, like HTTP, SMTP, DNS, FTP,
and POP/IMAP. It also rises reliability through redundancy. The balancing service is
provided by a dedicated hardware device or program. Cloud-based servers farms can
attain more precise scalability and availability using server load balancing.
Load balancing solutions can be categorized into two types –
1. Software-based load balancers: Software-based load balancers run on standard
hardware (desktop, PCs) and standard operating systems.
2. Hardware-based load balancer: Hardware-based load balancers are dedicated
boxes which include Application Specific Integrated Circuits (ASICs) adapted for a
particular use. ASICs allows high speed promoting of network traffic and are
frequently used for transport-level load balancing because hardware-based load
balancing is faster in comparison to software solution.
Load Balancing Algorithms
• Round Robin load balancing
• Weighted Round Robin load balancing
• Low Latency load balancing
• Least Connections load balancing
• Priority load balancing
• Overflow load balancing
(a) Round Robin Load Balancing (b) Weighted Round Robin Load
Balancing
Page | 10
(c)Low Latency Load Balancing (d) Least connections Load Balancing
Page | 11
Example of cloud elasticity : Cloud elasticity refers to scaling up (or scaling down) the
computing capacity as needed. It basically helps you understand how well your
architecture can adapt to the workload in real time.
For example, 100 users log in to your website every hour. A single server can easily
handle this volume of traffic. However, what happens if 5000 users log in at the same time?
If your existing architecture can quickly and automatically provision new web servers to
handle this load, your design is elastic.
As you can imagine, cloud elasticity comes in handy when your business
experiences sudden spikes in user activity and, with it, a drastic increase in workload
demand – as happens in businesses such as streaming services or e-commerce
marketplaces.
Take the video streaming service Netflix, for example. Here’s how Netflix’s
architecture leverages the power of elasticity to scale up and down:
Cloud Scalability
Cloud scalability only adapts to the workload increase through the incremental
provision of resources without impacting the system’s overall performance. This is built in
as part of the infrastructure design instead of makeshift resource allocation (as with cloud
elasticity).
Below are some of its main features:
• Typically handled by adding resources to existing instances, also known as scaling
up or vertical scaling, or by adding more copies of existing instances, also known
as scaling out or horizontal scaling
• Allows companies to implement big data models for machine learning (ML) and
data analysis
• Handles rapid and unpredictable changes in a scalable capacity
• Generally more granular and targeted than elasticity in terms of sizing
Page | 12
• Ideal for businesses with a predictable and preplanned workload where capacity
planning and performance are relatively stable
Example of cloud scalability : Cloud scalability has many examples and use cases. It
allows you to scale up or scale out to meet the increasing workloads. You can scale up a
platform or architecture to increase the performance of an individual server.
Usually, this means that hardware costs increase linearly with demand. On the flip
side, you can also add multiple servers to a single server and scale out to enhance server
performance and meet the growing demand.
Another good example of cloud scalability is a call center. A call center requires a
scalable application infrastructure as new employees join the organization and customer
requests increase incrementally. As a result, organizations need to add new server features
to ensure consistent growth and quality performance.
It is used just to fulfil the sudden requirement in It is used to fulfil the static boost in
1
the workload for a short period. the workload.
It is a short term event that is used to deal with It is a long term event that is used
4
an unplanned or sudden growth in demand. to deal with an expected growth in
Page | 13
demand.
3. Diagonal scaling : Diagonal scaling involves horizontal and vertical scaling. It’s more
flexible and cost-effective as it helps add or remove resources as per existing workload
requirements. Adding and upgrading resources according to the varying system load and
demand provides better throughput and optimizes resources for even better performance.
Page | 14
• Cloud application deployment design is an iterative process that involves:
• Deployment Design
• The variables in this step include the number of servers in each tier,
computing, memory and storage capacities of severs, server
interconnection, load balancing and replication strategies.
• Performance Evaluation
• To verify whether the application meets the performance
requirements with the deployment.
• Involves monitoring the workload on the application and measuring
various workload parameters such as response time and throughput.
• Utilization of servers (CPU, memory, disk, I/O, etc.) in each tier is also
monitored.
• Deployment Refinement
• Various alternatives can exist in this step such as vertical scaling (or
scaling up), horizontal scaling (or scaling out), alternative server
interconnections, alternative load balancing and replication
strategies, for instance.
Public Cloud: The name says it all. It is accessible to the public. Public deployment
models in the cloud are perfect for organizations with growing and fluctuating demands. It
also makes a great choice for companies with low-security concerns.
Thus, you pay a cloud service provider for networking services, compute
virtualization & storage available on the public internet. It is also a great delivery model for
the teams with development and testing. Its configuration and deployment are quick and
easy, making it an ideal choice for test environments.
Page | 15
Benefits of Public Cloud
o Minimal Investment - As a pay-per-use service, there is no large upfront cost and is
ideal for businesses who need quick access to resources
o No Hardware Setup - The cloud service providers fully fund the entire Infrastructure
o No Infrastructure Management - This does not require an in-house team to utilize
the public cloud.
Private Cloud: Companies that look for cost efficiency and greater control over data &
resources will find the private cloud a more suitable choice.
It means that it will be integrated with your data center and managed by your IT
team. Alternatively, you can also choose to host it externally. The private cloud offers
bigger opportunities that help meet specific organizations' requirements when it comes to
customization. It's also a wise choice for mission-critical processes that may have
frequently changing requirements.
Page | 16
share common objectives and use cases. This type of deployment model of cloud
computing is managed and hosted internally or by a third-party vendor. However, you can
also choose a combination of all three.
Page | 17
o Flexibility - With higher levels of flexibility, businesses can create custom solutions
that fit their exact requirements
Replication:
• Replication is used to create and maintain multiple copies of the data in the cloud.
• Cloud enables rapid implementation of replication solutions for disaster recovery
for organizations.
• With cloud-based data replication organizations can plan for disaster recovery
without making any capital expenditures on purchasing, configuring and managing
secondary site locations.
• Types:
• Array-based Replication
• Network-based Replication
• Host-based Replication
OpenFlow
• OpenFlow is the broadly accepted SDN protocol for the Southbound interface.
• With OpenFlow, the forwarding plane of the network devices can be directly
accessed and manipulated.
• OpenFlow uses the concept of flows to identify network traffic based on
pre-defined match rules.
• Flows can be programmed statically or dynamically by the SDN control software.
• OpenFlow protocol is implemented on both sides of the interface between the
controller and the network devices.
Page | 19
OpenFlow switch comprising of one or more flow tables and a group table, which perform
packet lookups and forwarding, and OpenFlow channel to an external controller.
Network Function Virtualization:
• Network Function Virtualization (NFV) is a technology that leverages virtualization
to consolidate the heterogeneous network devices onto industry standard high
volume servers, switches and storage.
• Relationship to SDN
• NFV is complementary to SDN as NFV can provide the infrastructure on
which SDN can run.
• NFV and SDN are mutually beneficial to each other but not dependent.
• Network functions can be virtualized without SDN, similarly, SDN can run
without NFV.
• NFV comprises of network functions implemented in software that run on
virtualized resources in the cloud.
• NFV enables a separation the network functions which are implemented in software
from the underlying hardware.
NFV Architecture
• Key elements of the NFV architecture are
• Virtualized Network Function (VNF): VNF is a software implementation of a
network function which is capable of running over the NFV Infrastructure
(NFVI).
• NFV Infrastructure(NFVI): NFVI includes compute, network and storage
resources that are virtualized.
• NFV Management and Orchestration: NFV Management and Orchestration
focuses on all virtualization-specific management tasks and covers the
orchestration and lifecycle management of physical and/or software
resources that support the infrastructure virtualization, and the lifecycle
management of VNFs.
Page | 20
MapReduce: MapReduce is a processing technique and a program model for
distributed computing based on java.
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
• Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
• Secondly, reduce task, which takes the output from a map as an input and
combines those data tuples into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always performed after the map job.
• The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes.
• Under the MapReduce model, the data processing primitives are called mappers
and reducers. Decomposing a data processing application
into and is sometimes nontrivial.
• But, once we write an application in the MapReduce form, scaling the application to
run over hundreds, thousands, or even tens of thousands of machines in a cluster is
merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
The Algorithm
• Generally MapReduce paradigm is based on sending the computer to where the
data resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage,
and reduce stage.
o Map stage − The map or mapper’s job is to process the input data. Generally
the input data is in the form of file or directory and is stored in the Hadoop
file system (HDFS). The input file is passed to the mapper function line by
line. The mapper processes the data and creates several small chunks of
data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from
the mapper. After processing, it produces a new set of output, which will be
stored in the HDFS.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
• The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the nodes.
Page | 21
• Most of the computing takes place on nodes with data on local disks that reduces
the network traffic.
• After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.
Terminology
▶ PayLoad − Applications implement the Map and the Reduce functions, and form the
core of the job.
▶ Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value
pair.
▶ NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
▶ DataNode − Node where data is presented in advance before any processing takes
place.
▶ MasterNode − Node where JobTracker runs and which accepts job requests from
clients.
▶ SlaveNode − Node where Map and Reduce program runs.
▶ JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
▶ Task Tracker − Tracks the task and reports status to JobTracker.
▶ Job − A program is an execution of a Mapper and Reducer across a dataset.
▶ Task − An execution of a Mapper or a Reducer on a slice of data.
▶ Task Attempt − A particular instance of an attempt to execute a task on a
SlaveNode.
Service level agreements in Cloud computing
A Service Level Agreement (SLA) is the bond for performance negotiated between
the cloud services provider and the client. Earlier, in cloud computing all Service Level
Agreements were negotiated between a client and the service consumer.
Service level agreements are also defined at different levels which are mentioned
below:
Customer-based SLA
Service-based SLA
Multilevel SLA
Few Service Level Agreements are enforceable as contracts, but mostly are agreements or
contracts which are more along the lines of an Operating Level Agreement (OLA) and may
not have the restriction of law. It is fine to have an attorney review the documents before
making a major agreement to the cloud service provider. Service Level Agreements usually
specify some parameters which are mentioned below:
Page | 22
Availability of the Service (uptime)
Latency or the response time
Service components reliability
Each party accountability
Warranties
Each individual component has its own Service Level Agreements. Below are two major
Service Level Agreements (SLA) described:
1. Windows Azure SLA – Window Azure has different SLA’s for compute and storage.
For compute, there is a guarantee that when a client deploys two or more role
instances in separate fault and upgrade domains, client’s internet facing roles will
have external connectivity minimum 99.95% of the time. Moreover, all of the role
instances of the client are monitored and there is guarantee of detection 99.9% of
the time when a role instance’s process is not runs and initiates properly.
2. SQL Azure SLA – SQL Azure clients will have connectivity between the database
and internet gateway of SQL Azure. SQL Azure will handle a “Monthly Availability” of
99.9% within a month. Monthly Availability Proportion for a particular tenant
database is the ratio of the time the database was available to customers to the
total time in a month. Time is measured in some intervals of minutes in a 30-day
monthly cycle. Availability is always remunerated for a complete month. A portion
of time is marked as unavailable if the customer’s attempts to connect to a
database are denied by the SQL Azure gateway.
SLA Lifecycle
Page | 23
service level objectives, metrics, and targets that will be used to measure the
performance of the service provider.
3. Establish Agreement: After the service level requirements have been defined, an
agreement is established between the organization and the service provider
outlining the terms and conditions of the service. This agreement should include
the SLA, any penalties for non-compliance, and the process for monitoring and
reporting on the service level objectives.
4. Monitor SLA violation: This step involves regularly monitoring the service level
objectives to ensure that the service provider is meeting their commitments. If any
violations are identified, they should be reported and addressed in a timely manner.
5. Terminate SLA: If the service provider is unable to meet the service level objectives,
or if the organization is not satisfied with the service provided, the SLA can be
terminated. This can be done through mutual agreement or through the
enforcement of penalties for non-compliance.
6. Enforce penalties for SLA Violation: If the service provider is found to be in
violation of the SLA, penalties can be imposed as outlined in the agreement. These
penalties can include financial penalties, reduced service level objectives, or
termination of the agreement.
Advantages of SLA
1. Improved communication: A better framework for communication between the
service provider and the client is established through SLAs, which explicitly outline
the degree of service that a customer may anticipate. This can make sure that
everyone is talking about the same things when it comes to service expectations.
2. Increased accountability: SLAs give customers a way to hold service providers
accountable if their services fall short of the agreed-upon standard. They also hold
service providers responsible for delivering a specific level of service.
3. Better alignment with business goals: SLAs make sure that the service being given
is in line with the goals of the client by laying down the performance goals and
service level requirements that the service provider must satisfy.
4. Reduced downtime: SLAs can help to limit the effects of service disruptions by
creating explicit protocols for issue management and resolution.
5. Better cost management: By specifying the level of service that the customer can
anticipate and providing a way to track and evaluate performance, SLAs can help to
limit costs. Making sure the consumer is getting the best value for their money can
be made easier by doing this.
Disadvantages of SLA
1. Complexity: SLAs can be complex to create and maintain, and may require
significant resources to implement and enforce.
2. Rigidity: SLAs can be rigid and may not be flexible enough to accommodate
changing business needs or service requirements.
Page | 24
3. Limited service options: SLAs can limit the service options available to the
customer, as the service provider may only be able to offer the specific services
outlined in the agreement.
4. Misaligned incentives: SLAs may misalign incentives between the service provider
and the customer, as the provider may focus on meeting the agreed-upon service
levels rather than on providing the best service possible.
5. Limited liability: SLAs are not legal binding contracts and often limited the liability
of the service provider in case of service failure.
Identity and Access Management
• Identity and Access Management (IDAM) for cloud describes the authentication
and authorization of users to provide secure access to cloud resources.
• Organizations with multiple users can use IDAM services provided by the cloud
service provider for management of user identifiers and user permissions.
• IDAM services allow organizations to centrally manage users, access
permissions, security credentials and access keys.
• Organizations can enable role-based access control to cloud resources and
applications using the IDAM services.
• IDAM services allow creation of user groups where all the users in a group have
the same access permissions.
• Identity and Access Management is enabled by a number of technologies such
as OpenAuth, Role-based Access Control (RBAC), Digital Identities, Security
Tokens, Identity Providers, etc.
Billing
Cloud service providers offer a number of billing models described as follows:
• Elastic Pricing
• In elastic pricing or pay-as-you-use pricing model, the customers are
charged based on the usage of cloud resources.
• Fixed Pricing
• In fixed pricing models, customers are charged a fixed amount per month
for the cloud resources.
• Spot Pricing
• Spot pricing models offer variable pricing for cloud resources which is
driven by market demand.
Page | 25
• Infrastructure & Facilities Layer: Includes the physical infrastructure such as
datacenter facilities, electrical and mechanical equipment, etc.
• Hardware Layer: Includes physical compute, network and storage hardware.
• Virtualization Layer: Partitions the physical hardware resources into multiple
virtual resources that enabling pooling of resources.
• Platform & Middleware Layer: Builds upon the IaaS layers below and provides
standardized stacks ofservices such as database service, queuing service,
application frameworksand run-time environments, messaging services,
monitoring services, analytics services, etc.
• Service Management Layer: Provides APIs for requesting, managing and
monitoring cloud resources.
• Applications Layer: Includes SaaS applications such as Email, cloud storage
application, productivity applications, management portals, customer self-service
portals, etc.
Compute Services
• Compute services provide dynamically scalable compute capacity in the cloud.
• Compute resources can be provisioned on-demand in the form of virtual
machines. Virtual machines can be created from standard images provided by
the cloud service provider or custom images created by the users.
• Compute services can be accessed from the web consoles of these services that
provide graphical user interfaces for provisioning, managing and monitoring
these services.
• Cloud service providers also provide APIs for various programming languages
that allow developers to access and manage these services programmatically.
Compute Services – Amazon EC2
Amazon Elastic Compute Cloud (EC2) is a compute service provided by Amazon.
Page | 26
• Launching EC2 Instances: To launch a new instance click on the launch instance
button. This will open a wizard where you can select the Amazon machine image
(AMI) with which you want to launch the instance. You can also create their own
AMIs with custom applications, libraries and data. Instances can be launched
with a variety of operating systems.
• Instance Sizes: When you launch an instance you specify the instance type
(micro, small, medium, large, extra-large, etc.), the number of instances to launch
based on the selected AMI and availability zones for the instances.
• Key-pairs: When launching a new instance, the user selects a key-pair from
existing keypairs or creates a new keypair for the instance. Keypairs are used to
securely connect to an instance after it launches.
• Security Groups: The security groups to be associated with the instance can be
selected from the instance launch wizard. Security groups are used to open or
block a specific network port for the launched instances.
Page | 27
Compute Services – Windows Azure VMs
Windows Azure Virtual Machines is the compute service from Microsoft.
• Launching Instances:
o To create a new instance, you select the instance type and the machine
image.
o You can either provide a user name and password or upload a certificate
file for securely connecting to the instance.
o Any changes made to the VM are persistently stored and new VMs can be
created from the previously stored machine images.
Storage Services
• Cloud storage services allow storage and retrieval of any amount of data, at any
time from anywhere on the web.
• Most cloud storage services organize data into buckets or containers.
• Scalability
Page | 28
• Cloud storage services provide high capacity and scalability. Objects upto
several tera-bytes in size can be uploaded and multiple
buckets/containers can be created on cloud storages.
• Replication
• When an object is uploaded it is replicated at multiple facilities and/or on
multiple devices within each facility.
• Access Policies
• Cloud storage services provide several security features such as Access
Control Lists (ACLs), bucket/container level policies, etc. ACLs can be
used to selectively grant access permissions on individual objects.
Bucket/container level policies can also be defined to allow or deny
permissions across some or all of the objects within a single
bucket/container.
• Encryption
• Cloud storage services provide Server Side Encryption (SSE) options to
encrypt all data stored in the cloud storage.
• Consistency
• Strong data consistency is provided for all upload and delete operations.
Therefore, any object that is uploaded can be immediately downloaded
after the upload is complete.
Storage Services – Amazon S3
• Amazon Simple Storage Service(S3) is an online cloud-based data storage
infrastructure for storing and retrieving any amount of data.
• S3 provides highly reliable, scalable, fast, fully redundant and affordable storage
infrastructure.
• Buckets
• Data stored on S3 is organized in the form of buckets. You must create a
bucket before you can store data on S3.
• Uploading Files to Buckets
• S3 console provides simple wizards for creating a new bucket and
uploading files.
• You can upload any kind of file to S3.
• While uploading a file, you can specify the redundancy and encryption
options and access permissions.
Page | 30
• Google Cloud SQL service allows you to host MySQL databases in the Google’s
cloud.
• Launching DB Instances
• You can create new database instances from the console and manage
existing instances. To create a new instance you select a region, database
tier, billing plan and replication mode.
• Backups
• You can schedule daily backups for your Google Cloud SQL instances, and
also restore backed-up databases.
• Replication
• Cloud SQL provides both synchronous or asynchronous geographic
replication and the ability to import/ export databases.
Database Services
Cloud database services allow you to set-up and operate relational or non-relational
databases in the cloud.
• Relational Databases:Popular relational databases provided by various cloud
service providers include MySQL, Oracle, SQL Server, etc.
• Non-relational Databases:The non-relational (No-SQL) databases provided by
cloud service providers are mostly proprietary solutions.
• Scalability:Cloud database services allow provisioning as much compute and
storage resources as required to meet the application workload levels.
Provisioned capacity can be scaled-up or down. For read-heavy workloads,
read-replicas can be created.
• Reliability:Cloud database services are reliable and provide automated backup
and snapshot options.
• Performance:Cloud database services provide guaranteed performance with
options such as guaranteed input/output operations per second (IOPS) which
can be provisioned upfront.
• Security:Cloud database services provide several security features to restrict the
access to the database instances and stored data, such as network firewalls and
authentication mechanisms.
Database Services – Amazon RDS
Page | 31
• Amazon Relational Database Service (RDS) is a web service that makes it easy
to setup, operate and scale a relational database in the cloud.
• Launching DB Instances
• The console provides an instance launch wizard that allows you to select
the type of database to create (MySQL, Oracle or SQL Server) database
instance size, allocated storage, DB instance identifier, DB username and
password. The status of the launched DB instances can be viewed from
the console.
• Connecting to a DB Instance
• Once the instance is available, you can note the instance end point from
the instance properties tab. This end point can then be used for securely
connecting to the instance.
Page | 32
Storage Services – Google Cloud SQL
• Google SQL is the relational database service from Google.
• Google Cloud SQL service allows you to host MySQL databases in the Google’s
cloud.
• Launching DB Instances
• You can create new database instances from the console and manage
existing instances. To create a new instance you select a region, database
tier, billing plan and replication mode.
• Backups
• You can schedule daily backups for your Google Cloud SQL instances, and
also restore backed-up databases.
• Replication
Cloud SQL provides both synchronous or asynchronous geographic replication and the
ability to import/ export databases.
Page | 34
• Applications run in a secure sandbox environment isolated from other
applications.
• The sandbox environment provides a limited access to the underlying
operating system.
• Web Frameworks
• App Engine provides a simple Python web application framework called
webapp2. App Engine also supports any framework written in pure
Python that speaks WSGI, including Django, CherryPy, Pylons, web.py, and
web2py.
• Datastore
• App Engine provides a no-SQL data storage service.
• Authentication
• App Engine applications can be integrated with Google Accounts for user
authentication.
• URL Fetch service
• URL Fetch service allows applications to access resources on the Internet,
such as web services or other data.
• Other services
• Email service
• Image Manipulation service
• Memcache
• Task Queues
Scheduled Tasks service
Page | 35
• Azure Web Sites supports applications created in ASP.NET, PHP, Node.js and
Python programming languages.
• Multiple copies of an application can be run in different VMs, with Web Sites
automatically load balancing requests across them.
Content Delivery Services
• Cloud-based content delivery service include Content Delivery Networks (CDNs).
• CDN is a distributed system of servers located across multiple geographic
locations to serve content to end- users with high availability and high
performance.
• CDNs are useful for serving static content such as text, images, scripts, etc., and
streaming media.
• CDNs have a number of edge locations deployed in multiple locations, often over
multiple backbones.
• Requests for static for streaming media content that is served by a CDN are
directed to the nearest edge location.
• Amazon CloudFront
• Amazon CloudFront is a content delivery service from Amazon.
CloudFront can be used to deliver dynamic, static and streaming content
using a global network of edge locations.
• Windows Azure Content Delivery Network
• Windows Azure Content Delivery Network (CDN) is the content delivery
service from Microsoft.
Analytics Services
• Cloud-based analytics services allow analyzing massive data sets stored in the
cloud either in cloud storages or in cloud databases using programming models
such as MapReduce.
• Amazon Elastic MapReduce
• Amazon Elastic MapReduce is the MapReduce service from Amazon
based the Hadoop framework running on Amazon EC2 and S3
• EMR supports various job types such as Custom JAR, Hive program,
Streaming job, Pig programs and Hbase
• Google MapReduce Service
• Google MapReduce Service is a part of the App Engine platform and can
be accessed using the Google MapReduce API.
• Google BigQuery
• Google BigQuery is a service for querying massive datasets. BigQuery
allows querying datasets using SQL-like queries.
• Windows Azure HDInsight
• Windows Azure HDInsight is an analytics service from Microsoft.
HDInsight deploys and provisions Hadoop clusters in the Azure cloud and
makes Hadoop available as a service.
Deployment & Management Services
Page | 36
• Cloud-based deployment & management services allow you to easily deploy and
manage applications in the cloud. These services automatically handle
deployment tasks such as capacity provisioning, load balancing, auto-scaling,
and application health monitoring.
• Amazon Elastic Beanstalk
• Amazon provides a deployment service called Elastic Beanstalk that
allows you to quickly deploy and manage applications in the AWS cloud.
• Elastic Beanstalk supports Java, PHP, .NET, Node.js, Python, and Ruby
applications.
• With Elastic Beanstalk you just need to upload the application and specify
configuration settings in a simple wizard and the service automatically
handles instance provisioning, server configuration, load balancing and
monitoring.
• Amazon CloudFormation
• Amazon CloudFormation is a deployment management service from
Amazon.
• With CloudFront you can create deployments from a collection of AWS
resources such as Amazon Elastic Compute Cloud, Amazon Elastic Block
Store, Amazon Simple Notification Service, Elastic Load Balancing and
Auto Scaling.
• A collection of AWS resources that you want to manage together are
organized into a stack.
Page | 37
• Windows Azure Active Directory is an Identity & Access Management
Service from Microsoft.
• Azure Active Directory provides a cloud-based identity provider that easily
integrates with your on-premises active directory deployments and also
provides support for third party identity providers.
With Azure Active Directory you can control access to your applications in Windows
Azure.
Open Source Private Cloud Software – CloudStack
• Apache CloudStack is an open source cloud software that can be used for
creating private cloud offerings.
• CloudStack manages the network, storage, and compute nodes that make up a
cloud infrastructure.
• A CloudStack installation consists of a Management Server and the cloud
infrastructure that it manages.
• Zones : The Management Server manages one or more zones where each zone is
typically a single datacenter.
• Pods : Each zone has one or more pods. A pod is a rack of hardware comprising
of a switch and one or more clusters.
• Cluster: A cluster consists of one or more hosts and a primary storage. A host is
a compute node that runs guest virtual machines.
• Primary Storage: The primary storage of a cluster stores the disk volumes for all
the virtual machines running on the hosts in that cluster.
• Secondary Storage: Each zone has a secondary storage that stores templates,
ISO images, and disk volume snapshots.
Open Source Private Cloud Software – Eucalyptus
• Apache CloudStack is an open source cloud software that can be used for
creating private cloud offerings.
• CloudStack manages the network, storage, and compute nodes that make up a
cloud infrastructure.
• A CloudStack installation consists of a Management Server and the cloud
infrastructure that it manages.
• Zones: The Management Server manages one or more zones where each zone is
typically a single datacenter.
• Pods: Each zone has one or more pods. A pod is a rack of hardware comprising
of a switch and one or more clusters.
• Cluster : A cluster consists of one or more hosts and a primary storage. A host is
a compute node that runs guest virtual machines.
• Primary Storage: The primary storage of a cluster stores the disk volumes for all
the virtual machines running on the hosts in that cluster.
• Secondary Storage: Each zone has a secondary storage that stores templates,
ISO images, and disk volume snapshots
Open Source Private Cloud Software – OpenStack
Page | 38
• Eucalyptus is an open source private cloud software for building private and
hybrid clouds that are compatible with Amazon Web Services (AWS) APIs.
• Node Controller
• NC hosts the virtual machine instances and manages the virtual network
endpoints.
• The cluster-level (availability-zone) consists of three components
• Cluster Controller - which manages the virtual machines and is the
front-end for a cluster.
• Storage Controller – which manages the Eucalyptus block volumes and
snapshots to the instances within its specific cluster. SC is equivalent to
AWS Elastic Block Store (EBS).
• VMWare Broker - which is an optional component that provides an
AWS-compatible interface for VMware environments.
• At the cloud-level there are two components:
• Cloud Controller - which provides an administrative interface for cloud
management and performs high-level resource scheduling, system
accounting, authentication and quota management.
• Walrus - which is equivalent to Amazon S3 and serves as a persistent
storage to all of the virtual machines in the Eucalyptus cloud. Walrus can
be used as a simple Storage-as-a-Service
Page | 39
UNIT – II
Apache Hadoop
Hadoop is an open source framework from Apache and is used to store process
and analyze data which are very huge in volume.
Hadoop is written in Java and is not OLAP (online analytical processing). It is used
for batch/offline processing.
It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more.
Moreover it can be scaled up just by adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks
and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and
converts it into a data set which can be computed in Key value pair. The output of
Map task is consumed by reduce task and then the out of reducer gives the desired
result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by
other Hadoop modules.
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and
the HDFS (Hadoop Distributed File System). The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master
node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave
node includes DataNode and TaskTracker.
Page | 40
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop.
It contains a master/slave architecture. This architecture consist of a single NameNode
performs the role of master, and multiple DataNodes performs the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines.
The Java language is used to develop HDFS. So any machine that supports Java language
can easily run the NameNode and DataNode software.
NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
o It simplifies the architecture of the system.
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's
clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process
the data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker
Page | 41
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the
appropriate Task Trackers. Sometimes, the TaskTracker fails or time out. In such a case,
that part of the job is rescheduled.
Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in
faster retrieval. Even the tools to process the data are often on the same servers,
thus reducing the processing time. It is able to process terabytes of data in minutes
and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data
so it really cost effective as compared to traditional relational database
management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then
Hadoop takes the other copy of data and use it. Normally, data are replicated thrice
but the replication factor is configurable.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was
the Google File System paper, published by Google.
Page | 42
o While working on Apache Nutch, they were dealing with big data. To store that data
they have to spend a lot of costs which becomes the consequence of that project.
This problem becomes one of the important reason for the emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique simplifies
the data processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as
NDFS (Nutch Distributed File System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch
project, Dough Cutting introduces a new project Hadoop with a file system known
as HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in
this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900
node cluster within 209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.
Year Event
Page | 43
2008 o YARN JIRA opened
o Hadoop becomes the fastest system to sort 1 terabyte of data on a 900
node cluster within 209 seconds.
o Yahoo clusters loaded with 10 terabytes per day.
o Cloudera was founded as a Hadoop distributor.
MapReduce
Hadoop MapReduce is the data processing layer. It processes the huge amount of
structured and unstructured data stored in HDFS. MapReduce handles data in parallel by
splitting the job into the set of independent tasks. So, parallel processing increases speed
and reliability.
Hadoop MapReduce data processing occurs in 2 phases- Map and Reduce phase.
▶ Map phase: It is the initial phase of data processing. In this phase, we
state all the complex logic/business rules/costly code.
▶ Reduce phase: The second phase of processing is the Reduce Phase. In
this phase, we state light-weight processing like aggregation/summation.
Steps of MapReduce Job Execution flow
Page | 44
MapReduce processes the data in various phases with the help of different
components. Let us discuss the steps of job execution in Hadoop.
Input Files
The data for MapReduce job is stored in Input Files. Input files reside in HDFS. The
input file format is random. Line-based log files and binary format can also be used.
InputFormat
After that InputFormat defines how to divide and read these input files. It selects
the files or other objects for input. InputFormat creates InputSplit.
InputSplits
It represents the data that will be processed by an individual Mapper. For each split,
one map task is created. Therefore the number of map tasks is equal to the number of
InputSplits. Framework divide split into records, which mapper process.
RecordReader
It communicates with the inputSplit. And then transforms the data into key-value
pairs suitable for reading by the Mapper. RecordReader by default uses TextInputFormat to
transform data into a key-value pair. It interrelates to the InputSplit until the completion of
file reading. It allocates a byte offset to each line present in the file. Then, these key-value
pairs are further sent to the mapper for added processing.
Mapper
It processes input records produced by the RecordReader and generates
intermediate key-value pairs. The intermediate output is entirely different from the input
pair. The output of the mapper is a full group of key-value pairs. Hadoop framework does
not store the output of the mapper on HDFS. Mapper doesn’t store, as data is temporary
and writing on HDFS will create unnecessary multiple copies. Then Mapper passes the
output to the combiner for extra processing.
Combiner
Combiner is Mini-reducer that performs local aggregation on the mapper’s output. It
minimizes the data transfer between mapper and reducer. So, when the combiner
functionality completes, the framework passes the output to the partitioner for further
processing.
Partitioner
Partitioner comes into existence if we are working with more than one reducer. It
grabs the output of the combiner and performs partitioning.
Partitioning of output occurs based on the key in MapReduce. By hash function, the
key (or a subset of the key) derives the partition.
Based on key-value in MapReduce, partitioning of each combiner output occur. And
then the record having a similar key-value goes into the same partition. After that, each
partition is sent to a reducer.
Partitioning in MapReduce execution permits even distribution of the map output
over the reducer.
Shuffling and Sorting
After partitioning, the output is shuffled to the reduced node. The shuffling is the
physical transfer of the data which is done over the network. As all the mappers complete
and shuffle the output on the reducer nodes. Then the framework joins this intermediate
output and sort. This is then offered as input to reduce the phase.
Page | 45
Reducer
The reducer then takes a set of intermediate key-value pairs produced by the
mappers as the input. After that runs a reducer function on each of them to generate the
output. The output of the reducer is the decisive output. Then the framework stores the
output on HDFS.
RecordWriter
It writes these output key-value pairs from the Reducer phase to the output files.
OutputFormat
OutputFormat defines the way how RecordReader writes these output key-value
pairs in output files. So, its instances offered by the Hadoop write files in HDFS. Thus
OutputFormat instances write the decisive output of reducer on HDFS.
MapReduce Job Execution Workflow:
• MapReduce job execution starts when the client applications submit jobs to the
Job tracker.
• The JobTracker returns a JobID to the client application. The JobTracker talks to
the NameNode to determine the location of the data.
• The JobTracker locates TaskTracker nodes with available slots at/or near the
data.
• The TaskTrackers send out heartbeat messages to the JobTracker, usually every
few minutes, to reassure the JobTracker that they are still alive. These messages
also inform the JobTracker of the number of available slots, so the JobTracker
can stay up to date with where in the cluster, new work can be delegated.
• The JobTracker submits the work to the TaskTracker nodes when they poll for
tasks. To choose a task for a TaskTracker, the JobTracker uses various
scheduling algorithms (default is FIFO).
• The TaskTracker nodes are monitored using the heartbeat signals that are sent
by the TaskTrackers to JobTracker.
• The TaskTracker spawns a separate JVM process for each task so that any task
failure does not bring down the TaskTracker.
• The TaskTracker monitors these spawned processes while capturing the output
and exit codes. When the process finishes, successfully or not, the TaskTracker
notifies the JobTracker. When the job is completed, the JobTracker updates its
status.
Page | 46
MapReduce 2.0 – YARN
• In Hadoop 2.0 the original processing engine of Hadoop (MapReduce) has been
separated from the resource management (which is now part of YARN).
• This makes YARN effectively an operating system for Hadoop that supports
different processing engines on a Hadoop cluster such as MapReduce for batch
processing, Apache Tez for interactive queries, Apache Storm for stream
processing, etc.
• YARN architecture divides architecture divides the two major functions of the
JobTracker - resource management and job life-cycle management - into
separate components:
• ResourceManager
• ApplicationMaster.
YARN Components
• Resource Manager (RM): RM manages the global assignment of compute
resources to applications. RM consists of two main services:
• : Scheduler is a pluggable service that manages and enforces
the resource scheduling policy in the cluster.
• : AsM manages the running Application
Masters in the cluster. AsM is responsible for starting application
masters and for monitoring and restarting them on different nodes in
case of failures.
• Application Master (AM): A per-application AM manages the application’s life
cycle. AM is responsible for negotiating resources from the RM and working with
the NMs to execute and monitor the tasks.
• Node Manager (NM): A per-machine NM manages the user processes on
that machine.
• Containers: Container is a bundle of resources allocated by RM (memory, CPU,
network, etc.). A container is a conceptual entity that grants an application the
privilege to use a certain amount of resources on a given machine to run a
component task.
Page | 47
Hadoop Scheduler
Prior to Hadoop 2, Hadoop MapReduce is a software framework for writing
applications that process huge amounts of data (terabytes to petabytes) in-parallel on
the large Hadoop cluster. This framework is responsible for scheduling tasks,
monitoring them, and re-executes the failed task.
In Hadoop 2, a YARN called Yet Another Resource Negotiator was introduced.
The basic idea behind the YARN introduction is to split the functionalities of resource
management and job scheduling or monitoring into separate daemons that are
ResorceManager, ApplicationMaster, and NodeManager.
ResorceManager is the master daemon that arbitrates resources among all the
applications in the system. NodeManager is the slave daemon responsible for
containers, monitoring their resource usage, and reporting the same to
ResourceManager or Schedulers. ApplicationMaster negotiates resources from the
ResourceManager and works with NodeManager in order to execute and monitor the
task.
The ResourceManager has two main components that are Schedulers and
ApplicationsManager.
AD
Schedulers in YARN ResourceManager is a pure scheduler which is responsible for
allocating resources to the various running applications.
Page | 48
It is not responsible for monitoring or tracking the status of an application. Also,
the scheduler does not guarantee about restarting the tasks that are failed either due
to hardware failure or application failure.
It has some pluggable policies that are responsible for partitioning the cluster
resources among the various queues, applications, etc.
The FIFO Scheduler, CapacityScheduler, and FairScheduler are such pluggable
policies that are responsible for allocating resources to the applications.
Let us now study each of these Schedulers in detail.
TYPES OF HADOOP SCHEDULER
1. FIFO Scheduler
First In First Out is the default scheduling policy used in Hadoop. FIFO
Scheduler gives more preferences to the application coming first than those coming
later. It places the applications in a queue and executes them in the order of their
submission (first in, first out).
Here, irrespective of the size and priority, the request for the first application in the
queue are allocated first. Once the first application request is satisfied, then only the
next application in the queue is served.
Advantage:
• It is simple to understand and doesn’t need any configuration.
• Jobs are executed in the order of their submission.
Disadvantage:
• It is not suitable for shared clusters. If the large application comes before
the shorter one, then the large application will use all the resources in the
cluster, and the shorter application has to wait for its turn. This leads to
starvation.
Page | 49
• It does not take into account the balance of resource allocation between
the long applications and short applications.
2. Capacity Scheduler
The CapacityScheduler allows multiple-tenants to securely share a large
Hadoop cluster. It is designed to run Hadoop applications in a shared, multi-tenant
cluster while maximizing the throughput and the utilization of the cluster.
It supports hierarchical queues to reflect the structure of organizations or
groups that utilizes the cluster resources. A queue hierarchy contains three types of
queues that are root, parent, and leaf.
The root queue represents the cluster itself, parent queue represents
organization/group or sub-organization/sub-group, and the leaf accepts application
submission.
The Capacity Scheduler allows the sharing of the large cluster while giving
capacity guarantees to each organization by allocating a fraction of cluster resources
to each queue.
Also, when there is a demand for the free resources that are available on the
queue who has completed its task, by the queues running below capacity, then these
resources will be assigned to the applications on queues running below capacity. This
provides elasticity for the organization in a cost-effective manner.
Apart from it, the CapacityScheduler provides a comprehensive set of limits to
ensure that a single application/user/queue cannot use a disproportionate amount of
resources in the cluster.
To ensure fairness and stability, it also provides limits on initialized and
pending apps from a single user and queue.
Advantages:
Page | 50
• It maximizes the utilization of resources and throughput in the Hadoop
cluster.
• Provides elasticity for groups or organizations in a cost-effective manner.
• It also gives capacity guarantees and safeguards to the organization
utilizing cluster.
Disadvantage:
• It is complex amongst the other scheduler.
3. Fair Scheduler
Page | 51
gets its minimum share, but when the queue doesn’t need its full guaranteed share,
then the excess share is split between other running applications.
Advantages:
• It provides a reasonable way to share the Hadoop Cluster between the number
of users.
• Also, the FairScheduler can work with app priorities where the priorities are
used as weights in determining the fraction of the total resources that each
application should get.
Disadvantage:It requires configuration.
Page | 52
Cloud Application Design
Page | 53
Reference Architectures –Content delivery apps
• Figure shows a typical deployment architecture for content delivery
applications such as online photo albums, video webcasting, etc.
• Both relational and non-relational data stores are shown in this deployment.
• A content delivery network (CDN) which consists of a global network of edge
locations is used for media delivery.
• CDN is used to speed up the delivery of static content such as images and
videos.
Page | 54
• The jobs are queued for execution and upon completion the analyzed data is
presented from the application servers.
SOA Layers:
Page | 55
1. Business Systems: This layer consists of custom built applications and legacy
systems such as Enterprise Resource Planning (ERP), Customer Relationship
Management (CRM), Supply Chain Management (SCM), etc.
2. Service Components: The service components allow the layers above to
interact with the business systems. The service components are responsible
for realizing the functionality of the services exposed.
3. Composite Services:These are coarse-grained services which are composed of
two or more service components. Composite services can be used to create
enterprise scale components or business-unit specific components.
4. Orchestrated Business Processes: Composite services can be orchestrated to
create higher level business processes. In this layers the compositions and
orchestrations of the composite services are defined to create business
processes.
5. Presentation Services: This is the topmost layer that includes user interfaces
that exposes the services and the orchestrated business processes to the
users.
6. Enterprise Service Bus: This layer integrates the services through adapters,
routing, transformation and messaging mechanisms.
Page | 58
SOA vs CCM:
Similarities
SOA CCM
Difference
SOA CCM
End points SOA services have small and CCM components have very large
well-defined set of endpoints number of endpoints. There is an
through which many types of data endpoint for each resource in a
can pass. component, identified by a URI.
Messaging SOA uses a messaging layer above CCM components use HTTP and REST
HTTP by using SOAP which provide for messaging.
prohibitive constraints to developers.
Page | 59
Security Uses WS-Security , SAML and other CCM components use HTTPS for security.
standards for security
Interfacing SOA uses XML for interfacing. CCM allows resources in components
represent different formats for
interfacing (HTML, XML, JSON, etc.).
Consumption Consuming traditional SOA services CCM components and the underlying
in a browser is cumbersome. component resources are exposed as
XML, JSON (and other formats) over
HTTP or REST, thus easy to consume in
the browser.
Model View Controller:
• Model View Controller (MVC) is a popular software design pattern for web
applications.
• Model
• Model manages the data and the behavior of the applications. Model
processes events sent by the controller. Model has no information about
the views and controllers. Model responds to the requests for
information about its state (from the view) and responds to the
instructions to change state (from controller).
• View
• View prepares the interface which is shown to the user. Users interact
with the application through views. Views present the information that
the model or controller tell the view to present to the user and also
handle user requests and sends them to the controller.
• Controller
• Controller glues the model to the view. Controller processes user
requests and updates the model when the user manipulates the view.
Controller also updates the view when the model changes.
Page | 60
• The REST architectural constraints apply to the components, connectors, and
data elements, within a distributed hypermedia system.
• A RESTful web service is a web API implemented using HTTP and REST
principles.
• The REST architectural constraints are as follows:
• Client-Server
• Stateless
• Cacheable
• Layered System
• Uniform Interface
• Code on demand
Relational Databases:
• A relational database is database that conforms to the relational model that
was popularized by Edgar Codd in 1970.
• The 12 rules that Codd introduced for relational databases include:
• Information rule
• Guaranteed access rule
• Systematic treatment of null values
• Dynamic online catalog based on relational model
• Comprehensive sub-language rule
• View updating rule
• High level insert, update, delete
• Physical data independence
• Logical data independence
• Integrity independence
• Distribution independence
• Non-subversion rule
• Relations:A relational database has a collection of relations (or tables). A
relation is a set of tuples (or rows).
• Schema:Each relation has a fixed schema that defines the set of attributes (or
columns in a table) and the constraints on the attributes.
• Tuples:Each tuple in a relation has the same attributes (columns). The tuples in
a relation can have any order and the relation is not sensitive to the ordering of
the tuples.
• Attributes:Each attribute has a domain, which is the set of possible values for
the attribute.
• Insert/Update/Delete:Relations can be modified using insert, update and delete
operations. Every relation has a primary key that uniquely identifies each tuple
in the relation.
• Primary Key:An attribute can be made a primary key if it does not have
repeated values in different tuples.
Page | 61
ACID Guarantees:
Relational databases provide ACID guarantees.
• Atomicity:Atomicity property ensures that each transaction is either “all or
nothing”. An atomic transaction ensures that all parts of the transaction
complete or the database state is left unchanged.
• Consistency:Consistency property ensures that each transaction brings the
database from one valid state to another. In other words, the data in a
database always conforms to the defined schema and constraints.
• Isolation:Isolation property ensures that the database state obtained after a set
of concurrent transactions is the same as would have been if the transactions
were executed serially. This provides concurrency control, i.e. the results of
incomplete transactions are not visible to other transactions. The transactions
are isolated from each other until they finish.
• Durability:Durability property ensures that once a transaction is committed, the
data remains as it is, i.e. it is not affected by system outages such as power
loss. Durability guarantees that the database can keep track of changes and
can recover from abnormal terminations.
Non-Relational Databases
• Non-relational databases (or popularly called No-SQL databases) are becoming
popular with the growth of cloud computing.
• Non-relational databases have better horizontal scaling capability and
improved performance for big data at the cost of less rigorous consistency
models.
• Unlike relational databases, non-relational databases do not provide ACID
guarantees.
• Most non-relational databases offer “eventual” consistency, which means that
given a sufficiently long period of time over which no updates are made, all
updates can be expected to propagate eventually through the system and the
replicas will be consistent.
• The driving force behind the non-relational databases is the need for databases
that can achieve high scalability, fault tolerance and availability.
Page | 62
• These databases can be distributed on a large cluster of machines. Fault
tolerance is provided by storing multiple replicas of data on different machines.
Non-Relational Databases – Types
• Key-value store: Key-value store databases are suited for applications that
require storing unstructured data without a fixed schema. Most key-value
stores have support for native programming language data types.
• Document store: Document store databases store semi-structured data in
the form of documents which are encoded in different standards such as
JSON, XML, BSON, YAML, etc.
• Graph store: Graph stores are designed for storing data that has graph
structure (nodes and edges). These solutions are suitable for applications
that involve graph data such as social networks, transportation systems, etc.
• Object store: Object store solutions are designed for storing data in the form
of objects defined in an object-oriented programming language.
Python Basics:
Pythonisageneral-purposehighlevelprogramminglanguageandsuitableforprovidi
ngasolidfoundationto thereaderinthe area ofcloud computing.
ThemaincharacteristicsofPythonare:
Multi-paradigmprogramminglanguage
Pythonsupportsmorethanoneprogrammingparadigmsincludingobject-orient
edprogrammingandstructuredprogramming
InterpretedLanguage
Python is an interpreted language and does not require an explicit
compilation step. The Python
interpreterexecutestheprogramsourcecodedirectly,statementbystatement,asaproce
ssororscriptingenginedoes.
InteractiveLanguage
Pythonprovidesaninteractivemodeinwhichtheusercansubmitcommandsatth
ePythonpromptandinteractwiththeinterpreterdirectly.
Python – Benefits
• Easy-to-learn, read and maintain
• Python is a minimalistic language with relatively few keywords, uses
English keywords and has fewer syntactical constructions as compared
to other languages. Reading Python programs feels like English with
pseudo-code like constructs. Python is easy to learn yet an extremely
powerful language for a wide range of applications.
• Object and Procedure Oriented
• Python supports both procedure-oriented programming and
object-oriented programming. Procedure oriented paradigm allows
programs to be written around procedures or functions that allow reuse
Page | 63
of code. Procedure oriented paradigm allows programs to be written
around objects that include both data and functionality.
• Extendable
• Python is an extendable language and allows integration of low-level
modules written in languages such as C/C++. This is useful when you
want to speed up a critical portion of a program.
• Scalable
• Due to the minimalistic nature of Python, it provides a manageable
structure for large programs.
• Portable
• Since Python is an interpreted language, programmers do not have to
worry about compilation, linking and loading of programs. Python
programs can be directly executed from source
• Broad Library Support
• Python has a broad library support and works on various platforms such
as Windows, Linux, Mac, etc.
Python – Setup
• Windows
• Python binaries for Windows can be downloaded from
http://www.python.org/getit .
• For the examples and exercise in this book, you would require Python 2.7
which can be directly downloaded from:
http://www.python.org/ftp/python/2.7.5/python-2.7.5.msi
• Once the python binary is installed you can run the python shell at the
command prompt using
> python
• Linux
#Install Dependencies
sudo apt-get install build-essential
sudo apt-get install libreadline-gplv2-dev libncursesw5-dev libssl-dev
libsqlite3-devtk-dev libgdbm-dev libc6-dev libbz2-dev
#Download Python
wget http://python.org/ftp/python/2.7.5/Python-2.7.5.tgz tar -xvf
Python-2.7.5.tgz
cd Python-2.7.5
#Install Python
./configure make
sudo make install
Numbers
• Numbers
• Number data type is used to store numeric values. Numbers are
immutable data types, therefore changing the value of a number data
type results in a newly allocated object.
Page | 64
UNIT – III
Python for Cloud
Outline
scale_down_policy = ScalingPolicy(name='scale_down',
adjustment_type='ChangeInCapacity', as_name='My-Group',
scaling_adjustment=-1, cooldown=180)
conn.create_scaling_policy(scale_up_policy)
conn.create_scaling_policy(scale_down_policy)
Amazon AutoScaling – Python Example
• CloudWatch Alarms #Connecting to CloudWatch
cloudwatch = boto.ec2.cloudwatch.connect_to_region(REGION,
aws_access_key_id=ACCESS_KEY,
• With the scaling policies defined, the next step is to aws_secret_access_key=SECRET_KEY)
alarm_dimensions = {"AutoScalingGroupName": 'My-Group'}
create Amazon CloudWatch alarms that trigger these
policies. #Creating scale-up alarm
scale_up_alarm = MetricAlarm( name='scale_up_on_cpu',
namespace='AWS/EC2', metric='CPUUtilization',
statistic='Average', comparison='>', threshold='70',
• The scale up alarm is defined using the CPU Utilization period='60', evaluation_periods=2,
alarm_actions=[scale_up_policy.policy_arn],
metric with the Average statistic and threshold greater dimensions=alarm_dimensions)
cloudwatch.create_alarm(scale_up_alarm)
70% for a period of 60 sec. The scale up policy created
previously is associated with this alarm. This alarm is #Creating scale-down alarm
scale_down_alarm = MetricAlarm( name='scale_down_on_cpu',
triggered when the average CPU utilization of the namespace='AWS/EC2', metric='CPUUtilization',
statistic='Average', comparison='<', threshold='40',
instances in the group becomes greater than 70% for period='60', evaluation_periods=2,
more than 60 seconds. alarm_actions=[scale_down_policy.policy_arn],
dimensions=alarm_dimensions)
cloudwatch.create_alarm(scale_down_alarm)
conn = boto.connect_s3(aws_access_key_id='<enter>',
aws_secret_access_key='<enter>')
#Connecting to DynamoDB
• After connecting to DynamoDB service, a conn = boto.dynamodb.connect_to_region(REGION,
aws_access_key_id=ACCESS_KEY,
schema for the new table is created by calling aws_secret_access_key=SECRET_KEY)
• To upload a file the objects().insert method of the gs_service = build('storage', API_VERSION, http=auth_http)
Google Cloud Storage API is used. # Upload file
fp= open(FILENAME,'r')
• The request to this method contains the bucket fh = io.BytesIO(fp.read())
name, file name and media body containing the media = MediaIoBaseUpload(fh, FILE_TYPE)
MediaIoBaseUpload object created from the file request = gs_service.objects().insert(bucket=BUCKET,
name=FILENAME,
contents. media_body=media)
response = request.execute()
Google Cloud SQL – Python Example # Python program for launching a Google Cloud SQL instance (excerpt)
def main():
#OAuth 2.0 authorization.
• This example uses the OAuth 2.0 scope flow = flow_from_clientsecrets(CLIENT_SECRETS,
(https://www.googleapis.com/auth/compute) and scope=GS_SCOPE) storage = Storage(OAUTH2_STORAGE)
credentials = storage.get()
credentials in the credentials file to request a refresh
and access token, which is then stored in the if credentials is None or credentials.invalid:
credentials = run(flow,
oauth2.dat file. storage) http =
httplib2.Http()
• After completing the OAuth authorization, an auth_http = credentials.authorize(http)
instance of the Google Cloud SQL service is
gcs_service = build('sqladmin', API_VERSION, http=auth_http)
obtained.
# Define request body
• To launch a new instance the instances().insert instance={"instance":
method of the Google Cloud SQL API is used. "mydb", "project":
"bahgacloud", "settings":{
• The request body of this method contains "tier": "D0",
"pricingPlan":
properties such as instance, project, tier, "PER_USE",
pricingPlan and replicationType. "replicationType": "SYNCHRONOUS"}}
# Linux VM configuration
linux_config = LinuxConfigurationSet('bahga', 'arshdeepbahga', 'Arsh~2483',
True)
#Create instance
sms.create_virtual_machine_deployment(service_name=na
me, deployment_name=name,
deployment_slot='production', label=name,
role_name=name, system_config=linux_config,
os_virtual_hard_disk=os_hd, role_size='Small')
Azure Storage – Python Example
# Python example of using Azure Blob Service (excerpt)
#!/usr/bin/env python
import sys
for line in sys.stdin:
doc_id, content = line.split(’’)
words = content.split()
for word in words:
print ’%s%s’ % (word, doc_id)
Python for MapReduce
#Inverted Index Reducer in Python
• The reducer reads the key-value pairs grouped by for line in sys.stdin:
the same key from the standard input (stdin) and # remove leading and trailing
creates a list of document-IDs in which the word whitespace line = line.strip()
# parse the input we got from
occurs. mapper.py word, doc_id =
line.split(’’)
• The output of reducer contains key value pairs if current_word == word:
where key is a unique word and value is the list of current_docids.append(do
document-IDs in which the word occurs. c_id)
else:
if current_word:
print ’%s%s’ % (current_word,
current_docids) current_docids = []
current_docids.append(doc_id)
current_word = word
Python Packages of Interest
• JSON
• JavaScript Object Notation (JSON) is an easy to read and write data-interchange format. JSON is used as an alternative to XML and is
is easy for machines to parse and generate. JSON is built on two structures - a collection of name-value pairs (e.g. a Python
dictionary) and ordered lists of values (e.g.. a Python list).
• XML
• XML (Extensible Markup Language) is a data format for structured document interchange. The Python minidom library provides a
minimal implementation of the Document Object Model interface and has an API similar to that in other languages.
• Django is an open source web application framework for developing web applications in Python.
• A web application framework in general is a collection of solutions, packages and best practices
that allows development of web applications and dynamic websites.
• Django is based on the Model-Template-View architecture and provides a separation of the data
model from the business rules and the user interface.
• Django provides a unified API to a database backend.
• Thus web applications built with Django can work with different databases without requiring any
code changes.
• With this fiexibility in web application design combined with the powerful capabilities of the Python
language and the Python ecosystem, Django is best suited for cloud applications.
• Django consists of an object-relational mapper, a web templating system and a regular-expression-
based URL dispatcher.
Django Architecture
• Django is Model-Template-View (MTV) framework.
• Model
• The model acts as a definition of some stored data and handles the interactions with the database. In a
web application, the data can be stored in a relational database, non-relational database, an XML file,
etc. A Django model is a Python class that outlines the variables and methods for a particular type of
data.
• Template
• In a typical Django web application, the template is simply an HTML page with a few extra
placeholders. Django’s template language can be used to create various forms of text files (XML,
email, CSS, Javascript, CSV, etc.)
• View
• The view ties the model to the template. The view is where you write the code that actually generates
the web pages. View determines what data is to be displayed, retrieves the data from the database and
passes the data to the template.
Django Setup on Amazon EC2
Cloud Application
Development in
Python
Outline
• Design Approaches
• Design methodology for IaaS service model
• Design methodology for PaaS service model
• Cloud application case studies including:
• Image Processing App
• Document Storage App
• MapReduce App
• Social Media Analytics App
Design methodology for IaaS service model
Component Design
•Indentify the building blocks of the application and to be performed by each block
•Group the building blocks based on the functions performed and type of cloud resources required and
identify the application components based on the groupings
•Identify the inputs and outputs of each component
•List the interfaces that each component will expose
•Evaluate the implementation alternatives for each component (design patterns such as MVC, etc.)
Architecture Design
Deployment Design
•Map the application components to specific cloud resources (such as web servers,
application servers, database servers, etc.)
Design methodology for PaaS service model
• For applications that use the Platform-as-a-service (PaaS) cloud service model, the architecture and
deployment design steps are not required since the platform takes care of the architecture and
deployment.
• Component Design
• In the component design step, the developers have to take into consideration the platform specific features.
• Platform Specific Software
• Different PaaS offerings such as Google App Engine, Windows Azure Web Sites, etc., provide platform specific software
development kits (SDKs) for developing cloud applications.
• Sandbox Environments
• Applications designed for specific PaaS offerings run in sandbox environments and are allowed to perform only those
actions that do not interfere with the performance of other applications.
• Deployment & Scaling
• The deployment and scaling is handled by the platform while the developers focus on the application development
using the platform-specific SDKs.
• Portability
• Portability is a major constraint for PaaS based applications as it is difficult to move the
Image Processing App – Component Design
• Functionality:
• A cloud-based Image Processing application.
• This application provides online image filtering capability.
• Users can upload image files and choose the filters to apply.
• The selected filters are applied to the image and the
processed image can then be downloaded.
• Component Design
• Web Tier: The web tier for the image processing app has front
ends for image submission and displaying processed images.
• Application Tier: The application tier has components for
processing the image submission requests, processing the
submitted image and processing requests for displaying the
results. Component design for Image Processing App
• Storage Tier: The storage tier comprises of the storage for
processed images.
Image Processing App – Architecture Design
• Component Design
• Web Tier: The web tier for the Cloud Drive app has front ends for
uploading files, viewing/deleting files and user profile.
• Application Tier: The application tier has components for
processing requests for uploading files, processing requests for
viewing/deleting files and the component that handles the
registration, profile and login functions.
• Database Tier: The database tier comprises of a user Component design for Cloud Drive App
credentials database.
• Storage Tier: The storage tier comprises of the storage for fi les.
Cloud Drive App – Architecture Design
• Architecture design step which defines the
interactions between the application
components.
• For each component, the corresponding Architecture design for Cloud Drive App
• The user receives an email notification with the download link for
the results when the job is complete.
MapReduce App – Deployment Design
• DBSCAN can find irregular shaped clusters as seen from this example and can even find a cluster completely surrounded
by a different cluster.
• DBSCAN considers some points as noise and does not assign them to any cluster.
Classification of Big Data
• Classification is the process of categorizing objects into predefined categories.
• Classification is achieved by classification algorithms that belong to a broad category of algorithms called supervised
machine learning.
• Supervised learning involves inferring a model from a set of input data and known responses to the data (training
data) and then using the inferred model to predict responses to new data.
• Binary classification
• Binary classification involves categorizing the data into two categories. For example, classifying the sentiment of a
news article into positive or negative, classifying the state of a machine into good or faulty, classifying the heath
test into positive or negative, etc.
• Multi-class classification
• Multi-class classification involves more than two classes into which the data is categorized. For example, gene
expression classification problem involves multiple classes.
• Document classification
• Document classification is a type of multi-class classification approach in which the data to the classified is in the form
of text document. For classifying news articles into different categories such as politics, sports, etc.
Performance of Classification Algorithms
• Precision: Precision is the fraction of objects that are classified correctly.
• Recall: Recall is the fraction of objects belonging to a category that are classified correctly.
• Accuracy:
• F1-score: F1-score is a measure of accuracy that considers both precision and recall. F1-score is the harmonic
means of precision and recall given as,
Naive Bayes
• Naive Bayes is a probabilistic classification algorithm based on the Bayes theorem
with a naive assumption about the independence of feature attributes. Given a
class variable C and feature variables F1,...,Fn , the conditional probability
(posterior) according to Bayes theorem is given as,
• Since the evidence P(F1,...,Fn ) is constant for a given input and does not depend
on the class variable C, only the numerator of the posterior probability is
important for classification.
• With this simplification, classification can then be done as follows,
Decision Trees
• Decision Trees are a supervised learning method that use a tree created
from simple decision rules learned from the training data as a predictive
model.
• The predictive model is in the form of a tree that can be used to predict
the value of a target variable based on a several attribute variables.
• Each node in the tree corresponds to one attribute in the dataset on
which the “split” is performed.
• Each leaf in a decision tree represents a value of the target variable.
• The learning process involves recursively splitting on the attributes until all
the samples in the child node have the same value of the target variable
or splitting further results in no further information gain.
• To select the best attribute for splitting at each stage, different metrics
can be used.
Splitting Attributes in Decision Trees
To select the best attribute for splitting at each stage, different metrics can be used such as:
• Information Gain
• Information gain is defined based on the entropy of the random variable which is defined
as,
• Entropy is a measure of uncertainty in a random variable and choosing the attribute with the
highest information gain results in a split that reduces the uncertainty the most at that
stage.
• Gini Coefficient
• Gini coefficient measures the inequality, i.e. how often a randomly chosen sample that is
labeled based on the distribution of labels, would be labeled incorrectly. Gini coefficient is
defined as,
Decision Tree Algorithms
• There are different algorithms for building decisions trees, popular ones being ID3 and C4.5.
• ID3:
• Attributes are discrete. If not, discretize the continuous attributes.
• Calculate the entropy of every attribute using the dataset.
• Choose the attribute with the highest information gain.
• Create branches for each value of the selected attribute.
• Repeat with the remaining attributes.
• The ID3 algorithm can be result in over-fitting to the training data and can be expensive to train especially for
continuous attributes.
• C4.5
• The C4.5 algorithm is an extension of the ID3 algorithm. C4.5 supports both discrete and continuous attributes.
• To support continuous attributes, C4.5 finds thresholds for the continuous attributes and then splits based on the
threshold values. C4.5 prevents over-fitting by pruning trees after they have been created.
• Pruning involves removing or aggregating those branches which provide little discriminatory power.
Random Forest
• Random Forest is an ensemble learning method that is based on randomized decision trees.
• Random Forest trains a number decision trees and then takes the majority vote by using the mode of the class predicted by
the individual trees.
Breiman’s Algorithm
1.Draw a bootstrap sample (n times with replacement from the N samples in the training set)
from the dataset
2.Train a decision tree
-Until the tree is fully grown (maximum size)
-Choose next leaf node
-Select m attributes (m is much less than the total number of attributes M) at random.
-Choose the best attribute and split as usual
3.Measure out-of-bag error
- Use the rest of the samples (not selected in the bootstrap) to estimate the error of
the tree, by predicting their classes.
4.Repeat steps 1-3 k times to generate k trees.
5.Make a prediction by majority vote among the k trees
Support Vector Machine
• Support Vector Machine (SVM) is a supervised machine
learning approach used for classification and regression.
• The basic form is SVM is a binary classifier that classifies the
data points into one of the two classes.
• SVM training involves determining the maximum
margin hyperplane that separates the two classes.
• The maximum margin hyperplane is one which has the
largest separation from the nearest training data point.
• Given a training data set (xi ,yi ) where xi is an n dimensional
vector and yi = 1 if xi is in class 1 and yi = -1 if xi is in class 2.
• A standard SVM finds a hyperplane w.x-b = 0, which correctly
separates the training data points and has a maximum margin
which is the distance between the two hyperplanes w.x-b = 1
and w.x-b = -1
Support Vector Machine
Binary classification with Linear SVM Binary classification with RBF SVM
Recommendation Systems
• For applications that use the Platform-as-a-service (PaaS) cloud service model, the architecture and
deployment design steps are not required since the platform takes care of the architecture and deployment.
• Component Design
• In the component design step, the developers have to take into consideration the platform specific features.
• Platform Specific Software
• Different PaaS offerings such as Google App Engine, Windows Azure Web Sites, etc., provide platform specific software
development kits (SDKs) for developing cloud applications.
• Sandbox Environments
• Applications designed for specific PaaS offerings run in sandbox environments and are allowed to perform only those
actions that do not interfere with the performance of other applications.
• Deployment & Scaling
• The deployment and scaling is handled by the platform while the developers focus on the application development
using the platform-specific SDKs.
• Portability
• Portability is a major constraint for PaaS based applications as it is difficult to move the
Multimedia Cloud Reference Architecture
• Infrastructure Services
• In the Multimedia Cloud reference architecture, the first layer is the
infrastructure services layer that includes computing and storage resources.
• Platform Services
• On top of the infrastructure services layer is the platform services layer
that includes frameworks and services for streaming and associated tasks
such as transcoding and analytics that can be leveraged for rapid
development of multimedia applications.
• Applications
• The topmost layer is the applications such as live video streaming, video
transcoding, video-on-demand, multimedia processing etc.
• Cloud-based multimedia applications alleviates the burden of installing and
maintaining multimedia applications locally on the multimedia consumption
devices (desktops, tablets, smartphone, etc) and provide access to rich
multimedia content.
• Service Models
• A multimedia cloud can have various service models such as IaaS, PaaS
and SaaS that offer infrastructure, platform or application services.
Multimedia Cloud - Live Video Streaming
• Workflow of a live video streaming application that uses multimedia cloud:
• The video and audio feeds generated by a number cameras and microphones are mixed/multiplexed with
video/audio mixers and then encoded by a client application which then sends the encoded feeds to the multimedia
cloud.
• On the cloud, streaming instances are created on-demand and the streams are then broadcast over the internet.
• The streaming instances also record the event streams which are later moved to the cloud storage for video
archiving.
• Empirical approach
• In this approach traces of applications are sampled and replayed to generate the synthetic workloads.
• The empirical approach lacks flexibility as the real traces obtained from a particular system are used for
workload generation which may not well represent the workloads on other systems with different
configurations and load conditions.
• Analytical approach
• Uses mathematical models to define the workload characteristics that are used by a synthetic workload
generator.
• Analytical approach is flexible and allows generation of workloads with different characteristics by
varying the workload model attributes.
• With the analytical approach it is possible to modify the workload model parameters one at a time and
investigate the effect on application performance to measure the application sensitivity to different
parameters.
User Emulation vs Aggregate Workloads
The commonly used techniques for workload generation are:
• User Emulation
• Each user is emulated by a separate thread that mimics the actions of a user by alternating between making
requests and lying idle.
• The attributes for workload generation in the user emulation method include think time, request types, inter-
request dependencies, for instance.
• User emulation allows fine grained control over modeling the behavioral aspects of the users interacting with the
system under test, however, it does not allow controlling the exact time instants at which the requests arrive the
system.
• Aggregate Workload Generation:
• Allows specifying the exact time instants at which the requests should arrive the system under test.
• However, there is no notion of an individual user in aggregate workload generation, therefore, it is not possible
to use this approach when dependencies between requests need to be satisfied.
• Dependencies can be of two types inter-request and data dependencies.
• An inter-request dependency exists when the current request depends on the previous request, whereas a data
dependency exists when the current requests requires input data which is obtained from the response of the
previous request.
Workload Characteristics
• Session
• A set of successive requests submitted by a user constitute a session.
• Inter-Session Interval
• Inter-session interval is the time interval between successive sessions.
• Think Time
• In a session, a user submits a series of requests in succession. The time interval between
two successive requests is called think time.
• Session Length
• The number of requests submitted by a user in a session is called the session length.
• Workload Mix
• Workload mix defines the transitions between different pages of an application and the
proportion in which the pages are visited.
Application Performance Metrics
The most commonly used performance metrics for cloud applications are:
• Response Time
• Response time is the time interval between the moment when the user submits a
request to the application and the moment when the user receives a response.
• Throughput
• Throughput is the number of requests that can be serviced per second.
Considerations for Benchmarking Methodology
• Accuracy
• Accuracy of a benchmarking methodology is determined by how closely the generated synthetic workloads
mimic the realistic workloads.
• Ease of Use
• A good benchmarking methodology should be user friendly and should involve minimal hand coding effort
for writing scripts for workload generation that take into account the dependencies between requests,
workload attributes, for instance.
• Flexibility
• A good benchmarking methodology should allow fine grained control over the workload attributes such as
think time, inter-session interval, session length, workload mix, for instance, to perform sensitivity analysis.
• Sensitivity analysis is performed by varying one workload characteristic at a time while keeping the others
constant.
• Wide Application Coverage
• A good benchmarking methodology is one that works for a wide range of applications and not tied to the
application architecture or workload types.
Types of Tests
• Baseline Tests
• Baseline tests are done to collect the performance metrics data of the entire application or a component of the
application.
• The performance metrics data collected from baseline tests is used to compare various performance tuning
changes which are subsequently made to the application or a component.
• Load Tests
• Load tests evaluate the performance of the system with multiple users and workload levels that are encountered in
the production phase.
• The number of users and workload mix are usually specified in the load test configuration.
• Stress Tests
• Stress tests load the application to a point where it breaks down.
• These tests are done to determine how the application fails, the conditions in which the application fails and the
metrics to monitor which can warn about impending failures under elevated workload levels.
• Soak Tests
• Soak tests involve subjecting the application to a fixed workload level for long periods of time.
• Soak tests help in determining the stability of the application under prolonged use and how the performance changes
with time.
Deployment Prototyping
• Deployment prototyping can help in making deployment architecture design choices.
• By comparing performance of alternative deployment architectures, deployment
prototyping can help in choosing the best and most cost effective deployment
architecture that can meet the application performance requirements.
The below diagram shows the Authentication flow for a Cloud Application using SAML SSO
b) Kerberos: -
• Kerberos is an open authentication protocol that was developed At MIT.
• Kerberos uses tickets for authenticating client to a service that communicate over an un-
secure network.
• Kerberos provides mutual authentication, i.e. both the client and the server authenticate
with each other.
• Below diagram shown Kerberos Authentication Flow:
2)One Time Password (OTP) :-
• One time password is another authentication mechanism that uses passwords which are
valid for single use only for a single transaction or session.
• Authentication mechanism based on OTP tokens are more secure because they are not
vulnerable to replay attacks.
• Text messaging (SMS) is the most common delivery mode for OTP tokens.
• The most common approach for generating OTP tokens is time synchronization.
• Time-based OTP algorithm (TOTP) is a popular time synchronization based algorithm
for generating OTPs.
Authorization
Introduction: -
• Authorization refers to specifying the access rights to the protected resources using
access policies.
• OAuth:
o OAuth is an open standard for authorization that allows resource owners to
share their private resources stored on one site with another site without
handing out the credentials.
o In the OAuth model, an application (which is not the resource owner) requests
access to resources controlled by the resource owner (but hosted by the server).
o The resource owner grants permission to access the resources in the form of a
token and matching shared-secret.
o Tokens make it unnecessary for the resource owner to share its credentials with
the application.
o Tokens can be issued with a restricted scope and limited lifetime, and
revoked independently.
Below Diagram shows an example of the Role Based Access Control Framework
in the cloud.
Data Security
Introduction: -
• Securing data in the cloud is critical for cloud applications as the data flows
from applications to storage and vice versa. Cloud applications deal with both
data at test and data in motion.
• There are various types of threats that can exist for data in the cloud such as
denial of service, replay attacks, man-in the -middle attacks, unauthorized
access/modification, etc.
• Encryption is the process of converting data from its original form (i.e.,
plaintext) to a scrambled form (ciphertext) that is unintelligible. Decryption
converts data from ciphertext to plaintext.
a) Symmetric Encryption:
• Symmetric encryption uses the same secret key for both encryption and
decryption.
• The secret key is shared between the sender and the receiver.
• Symmetric encryption is best suited for securing data at rest since the data is
accessed by known entities from known locations.
b) Asymmetric Encryption:
• Asymmetric encryption uses two keys, one for encryption (public key) and
other for decryption (private key).
• The two keys are linked to each other such that one key encrypts plaintext to
ciphertext and other decrypts ciphertext back to plaintext.
• Public key can be shared or published while the private key is known only to
the user.
• Asymmetric encryption is best suited for securing data that is exchanged
between two parties where symmetric encryption can be unsafe because the
secret key has to be exchanged between the parties and anyone who manages
to obtain the secret key can decrypt the data.
• In asymmetric encryption a separate key is used for decryption which is kept
private.
c)Encryption Levels:
Encryption can be performed at various levels described as follows:
- Application
- Host
- Network
- Device
Application:
• Application-level encryption involves encrypting application data right at the point where
it originates i.e. within the application.
• Application-level encryption provides security at the level of both the operating system
and from other applications
• An application encrypts all data generated in the application before it flows to the lower
levels and presents decrypted data to the user.
Host:
• In host-level encryption, encryption is performed at the file-level for all applications running
on the host.
• Host level encryption can be done in software in which case additional computational resource
is required for encryption or it can be performed with specialized hardware such as a
cryptographic accelerator card.
Network:
• Network-level encryption is best suited for cases where the threats to data are at the network or
storage level and not at the application or host level.
• Network-level encryption is performed when moving the data form a creation point to its
destination using a specialized hardware that encrypts all incoming data in real-time.
Device:
• Device-level encryption is performed on a disk controller or a storage server
• Device level encryption is easy to implement and is best suited for cases where the primary
concern about data security is to protect data residing on storage media
Data integrity:
Data integrity means that the data remains unchanged when moving from sender to
receiver.
Data integrity ensures that the data is not altered in an unauthorized manner after it is
created, transmitted or stored.
Transport Layer Security (TLS) and Secure Socket Layer (SSL) are the mechanisms used for
securing data in motion.
TLS and SSL are used to encrypt web traffic using Hypertext Transfer Protocol (HTTP).
TLS and SSL use asymmetric cryptography for authentication of key exchange, symmetric
encryption for confidentiality and message authentication codes for message integrity.
Key Management
Introduction: -
Management of encryption keys is critical to ensure security of encrypted
data. The key management lifecycle involves different phases including:
Creation: Creation of keys is the first step in the key management lifecycle. Keys must be
created in a secure environment and must have adequate strength. It is recommended to
encrypt the keys themselves. with a separate master key.
Backup: Backup of keys must be made before putting them into production because in the
event of loss of keys, all encrypted data can become useless.
Deployment: In this phase the new key is deployed for encrypting the data. Deployment of
a new key involves re-keying existing data.
Monitoring: After a key has been deployed, monitoring the performance of the encryption
environment is done to ensure that the key has been deployed correctly.
Rotation: Key rotation involves creating a new key and re-encrypting all data with the new
key.
Expiration: Key expiration phase begins after the key rotation is complete. It is
recommended to complete the key rotation process before the expiry of the existing key.
Archival: Archival is the phase before the key is finally destroyed. It is recommended to
archive old keys for some period of time tt› account for scenarios where there is still some
data in the system that is encrypted with the old key.
Destruction: Expired keys are finally destroyed after ensuring that there is no data
encrypted with the expired keys.
• Auditing requires that all read and write accesses to data be logged.
• Logs can include the user involved, type of access, timestamp, actions performed and
records accessed.
• The main purpose of auditing is to find security breaches, so that necessary changes can be made in
the application and deployment to prevent a further security breach.
Objectives:
The objectives of auditing include:
• Verify efficiency and compliance of identity and access management controls as per established
access policies.
• Verifying that authorized users are granted access to data and services based on their roles.
• Verify whether access policies are updated in a timely manner upon change in the roles of the users.
• Verify whether the data protection policies are sufficient.
• Assessment of support activities such as problem management
Some of the Education programs running on the Cloud platforms are listed
below:
MOOCs
• MOOCs are aimed for large audiences and use cloud technologies for providing
audio/video content, readings, assignment and exams.
• Cloud-based auto-grading applications are used for grading exams and assignments.
Cloud-based applications for peer grading of exams and assignments are also used in
some MOOCs
Online Programs
• Many universities across the world are using cloud platforms for providing online
degree programs.
• Lectures are delivered through live/recorded video using cloud-based content delivery
networks to students across the world.
Online Proctoring
• Online proctoring for distance learning programs is also becoming popular through the
use of cloud-based live video streaming technologies where online proctors observe test
takers remotely through video.
Virtual Labs
• Access to virtual labs is provided to distance learning students through the cloud. Virtual
labs provide remote access to the same software and applications that are used by
students on campus.
Course Management Platforms
• Cloud-based course management platforms are used to for sharing reading materials,
providing assignments and releasing grades, for instance.
• Cloud-based collaboration applications such as online forums, can help student discuss
common problems and seek guidance from experts.
Information Management
• Universities, colleges and schools can use cloud-based information management systems
to improve administrative efficiency, offer online and distance education programs,
online exams, track progress of students, collect feedback from students, for instance.
Reduce Cost of Education
• Cloud computing thus has the potential of helping in bringing down the cost of
education by increasing the student-teacher ratio through the use of online learning
platforms and new evaluation approaches without sacrificing quality.
Below diagram shows the generic use of the Cloud for Education:
******ALL THE BEST******