0% found this document useful (0 votes)
84 views49 pages

Business-Logic Layer Design: Architecting With Google Cloud Platform: Design and Process

Uploaded by

Daniel Reyes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views49 pages

Business-Logic Layer Design: Architecting With Google Cloud Platform: Design and Process

Uploaded by

Daniel Reyes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Business-logic Layer Design

Architecting with Google Cloud Platform:


Design and Process

Last modified 2018-08-08


© 2017 Google Inc. All rights reserved. Google
and the Google logo are trademarks of Google Inc.
All other company and product names may be
trademarks of the respective companies with
which they are associated.
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
“In computer software, business logic
or domain logic is the part of the
program that encodes the real-world
business rules that determine how data
can be created, stored, and changed.”
Wikipedia

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

https://en.wikipedia.org/wiki/Business_logic
Agenda Microservices architecture

GCP 12-factor support

Mapping compute needs to platform products

Compute system provisioning

The photo service is slow

Design challenge #1: Log aggregation

GCP lab Deployment Manager: Package and deploy

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
Microservices architecture

A Microservice Architecture is a method of


developing software applications as a suite
of independently deployable, small, modular
services.

Each service runs a unique process and


communicates through a well-defined,
lightweight mechanism.

Each service contributes to a business goal.

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

http://microservices.io/patterns/microservices.html
https://cloud.google.com/appengine/docs/standard/python/microservices-on-app-engi
ne
Microservices architecture is considered a specific type of service-oriented
architecture (SOA).
Benefits of microservices design
Benefits of small separate services:

● Atomic, single-purpose code is easier to develop and maintain


● Does one thing and does it very well
● Supports A/B testing

Independently developed services aid in:

● Fault isolation
● Debugging
● Redundancy and resiliency

BUT it’s harder to understand how the microservices interoperate.

● Unit testing is easier, integration testing is harder.

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

http://microservices.io/patterns/microservices.html
https://cloud.google.com/appengine/docs/standard/python/microservices-on-app-engi
ne

Allow for independent deployment cycles, including rollback. Facilitate concurrent,


A/B release testing on subsystems. Minimize test automation and quality-assurance
overhead.
Improve clarity of logging and monitoring. Provide fine-grained cost accounting.
Increase overall application scalability and reliability.

Small is good
● Easier to understand, faster to develop, more productive, A/B testing
● Faster startup (parallelism in system startup/boot)
● Granular cost
Independently developed and deployed versions, modular/replaceable parts
● Each microservice can be developed and deployed independently
● Easier to deploy new versions
● Modular/replaceable parts reduce "lock in" to a single solution or technology
Improved fault isolation
● Limits system impact due to failure (ex: memory leak)
● Easier debugging
● better reliability/redundancy
Distributed design
● Difficult to implement/Difficult to manage the business logic
● Distributed transactions
● IDEs not geared for it
● Interservice communications
● Testing complexity
Deployment complications
● Resource overhead
● Isolation comes at a cost; for example, multiple VMs instead of one VM means
multiple VM overhead

Post about the drawbacks of Microservices:


http://www.ptone.com/dablog/2015/07/microservices-may-be-the-new-premature-opti
mization/
How microservices complicate business logic

Accounting Cross-services
Microservice communications

Unified Banking Deposit Withdrawal


Service Microservice Microservice

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

Person "A" wants to deposit money in the bank in one location so that person "B" can
withdraw the money in another location.

In the unified service implementation, the deposit results in an immediate change to


the account. The deposit transaction is atomic. When it is completed, the withdrawal
transaction is permitted. And it is also atomic, so that the account is reduced at the
same time the cash money is physically released from the ATM machine.

In a microservices design, the Deposit Microservice is separate from the Withdrawal


Microservice, and these are both separate from the Accounting Microservice. To
perform deposit and withdrawal now requires cross-services communications. Several
communications have to be made between the services to implement a transaction.
Allowances have to be made in the business logic in case the cross-services
communications drops.

Example 1: When is the cash accepted at the ATM by the Deposit Microservice, and
when is the account value increase by the Accounting Microservice? If the cash is
accepted before the account is updated, and communication drops, then the deposit
might not register. If the account is increased before the cash is accepted by the
ATM, then there is an incentive to disrupt the ATM after the transaction has started
and before the cash is ingested by the ATM machine.

Example 2: When the withdraw is being made, a similar complication arises in


cross-service communications. If the cash is released from the machine before the
account is updated, the money could be lost. If the account is reduced first, and the
communication drops before the cash can be released from the machine, then the
account will show the money was taken out, but the user will not have received it.

To make both of these examples work requires a multi-trip negotiation between the
services to "open" a transaction, hold state on both sides of the transaction, and only
"close" the transaction when it is verified that all of the constituent actions have been
completed. Only by holding state can the microservices guard against loss of
communications.

https://pixabay.com/en/atm-terminal-withdraw-money-146307/
Use microservices where they
make sense in the design

Microservices make sense when there


are many consumers of an atomic unit
of functionality.

When there is one consumer of


tightly-coupled functionality,
microservices add overhead without
much benefit.

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

Advice
Smaller and independent isn't always better: sometimes centralized control is better.
Do you have the staff/processes to deal with thousands of tiny microservices?
Can you debug thousands of loosely connected microservices?
How can you track the business logic? What if it changes? Do you need to modify
thousands of applications to implement the change?
Consider the processes, not just the technical design.
This is not just a "set it and forget it" strategy: you need to plan and implement
processes to monitor and decide when to expand the microservices

https://pixabay.com/en/adjustable-wrench-tool-1659780/
Cloud Functions is useful in a microservices design

Google Cloud Functions is a lightweight compute solution for developers to create


single-purpose, stand-alone functions that respond to cloud events without the
need to manage a server or runtime environment. Cloud Functions runs javascript
in node.js supporting both frontend HTTP and background functions.

However, there are some limitations:

● Cloud Functions is not a low latency service.


● Because it is serverless, there are few resources that can be adjusted for
price/performance tradeoffs.

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

Cloud Functions is the name of the Google service. A single entity of this service is a
cloud function. A cloud function is Google's implementation of what is commonly
known in computer science as an anonymous function, a lambda function, or a
function literal. The self-contained function is registered with Cloud Functions,
triggered by an event, and executes without the overhead of a full application
environment.
https://en.wikipedia.org/wiki/Anonymous_function
Microservices example using Cloud Functions

Image

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

https://cloud.google.com/functions/docs/tutorials/ocr
[1] An image is uploaded to storage, triggering a Cloud Function.
The Cloud Function users Vision API to pull out the text from the image.
The Cloud Function:
● [2] Queues the string of text to the Translate API.
● [3] Uses Cloud Pub/Sub to pass state and trigger a new Cloud
Function to Translate the queued text to the desired language.
● [4] Uses Cloud Pub/Sub to pass state and trigger a new Cloud function
to post the translation to storage.

1. An image is uploaded to storage, triggering a cloud function.


2. The cloud function uses Vision API to pull out the text from the image.
3. The cloud function:
a. Queues the string of text to the Translate API.
b. Uses Cloud Pub/Sub to pass state and trigger a new cloud function to
translate the queued text to the desired language.
c. Uses Cloud Pub/Sub to pass state and trigger a new cloud function to
post the translation to storage.

https://pixabay.com/en/shield-learn-note-sign-directory-2300042/
Microservices design on App Engine

Microservices can be implemented as App Engine


services:
PROJECT-1
● Full code isolation
SERVICE-1 SERVICE-2
● Can be written in different languages
Version-1 Version-2 Version-1 Version-2
● Code executed through HTTP
invocation/RESTful API

However, there are some limitations:

● There are shared services that must be


isolated in the application design
Cloud Task
Memcache
● One master app per project Datastore Queues

● Multiple apps incur additional overhead


Shared services

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

https://cloud.google.com/appengine/docs/standard/python/microservices-on-app-engi
ne
GCP 12-factor support

12-factor design, which is a popular


methodology for developers to follow when
building modern web-based and cloud-based
applications. Details: 12factor.net

GCP technologies support 12-factor design.

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
12-factor development tools and platforms in GCP
One codebase tracked in version control; many
deployments

● Cloud Shell for building and deploying


● Cloud Source Repositories / Github support

Strictly separate build and run stages

● Build App Engine app in Cloudshell, upload to App


Engine

Keep development, staging, and production as similar as


possible

● Deployment Manager Templates

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

Codebase: Put all your code into a source control system. Begin with all the code in a
single repository. As the code grows in complexity, move the code for specific parts of
the application into separate repositories. In a distributed application, code that
communicates to other code is an indicator that it should be considered for a separate
repository. Cloud Source Repositories:
https://cloud.google.com/source-repositories/

Build, Deploy, Run: The idea is that after an outage, the application should come
back without human intervention
● Build: The process that wraps the code into a package of scripts, binaries,
and assets.
● Deploy (Release): Sends the package to the servers/services along with
separate configuration for the environment.
● Run: Runs the code. Should be simple and reliable.
App Engine Tutorials:
https://cloud.google.com/appengine/docs/standard/python/tutorials

Dev / Prod Parity: It is common to have rapid development and deployment cycles,
making changes to your application and deploying them within hours. Keep the
development and production environments as similar as possible to reduce the area
of vulnerability where issues could arise. Use the same backing services, same
configuration techniques, and same versions. Deployment Manager Templates:
https://cloud.google.com/deployment-manager/docs/step-by-step-guide/create-a-temp
late

https://pixabay.com/en/binary-hands-keyboard-tap-enter-2450152/
12-factor software design infrastructure services in GCP

Explicitly declare and isolate dependencies

● Custom images

Store config in the environment

● Metadata server, GCS

Maximize robustness with fast startup and


graceful shutdown

● Instance templates, Managed instance


groups, and autoscaling

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

Dependencies: All systems have dependencies in every environment where they


run. Never let your application simply assume that these are present.

● Option 1: Ensure that they are present by "baking” them into the application.
● Option 2: At startup, list expected versions and download and update libraries
to the correct version so you know that the dependent resources are in place.
Eliminate guessing by making this an explicit and dynamic process.
Custom images:
https://cloud.google.com/compute/docs/images/create-delete-deprecate-private-imag
es

Config: Configuration data should be stored separately from the code and read at run
time. Anything that might vary between environments should be in the environment:
location of a resource, logging or debugging settings, and usernames and passwords.
Metadata Server: https://cloud.google.com/compute/docs/storing-retrieving-metadata
Another place to store configuration data is in GCS, which provides multiple region
access, reliability, and features like customer-supplied encryption.

Disposability: The idea is that when part of your application starts, it should quickly
be able to start serving: so design it to avoid doing a lot of setup work at startup,
which can cause complexity and delays in scaling. Store state in high speed
databases or cache for fast recovery. Avoid mandatory cleanup processes that could
harm the application if they don't complete during a crash scenario. Instance
Templates: https://cloud.google.com/compute/docs/instance-templates
Also, Instance Groups: https://cloud.google.com/compute/docs/instance-groups/ and
Autoscaling: https://cloud.google.com/compute/docs/autoscaler/

https://pixabay.com/en/binary-hands-keyboard-tap-enter-2372130/
12-factor "store state in the environment" has tradeoffs

Operation Time in ns Time in ms Operation Time in ns Time in ms


(1ms = 1,000,000 ns) (1ms = 1,000,000 ns)

L1 cache 1 Read 1 MB sequentially from memory 7,000 0.007


reference

Branch SSD Random Read 16,000 0.016


3
misprediction
Read 1 MB bytes sequentially from SSD 123,000 0.123
L2 cache 4
reference
10 / 0.123 = Round trip within same data center 500,000 0.5
Mutex 17
lock/unlock 81 times slower!
Read 1 MB sequentially from disk 10,000,000 10
Main memory 100
reference Read 1 MB sequentially from 1-Gbps 10
10,000,000
network
Compress 1 kB 2,000 0.002
with Zippy Disk seek 10
10,000,000
Send 2 kB over 2,000 0.002
1-Gbps network Send packet CA->Netherlands->CA 150,000,000 150

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

Look at the difference, for example, between reading 1 MB sequentially from a local
SSD, and over a 1-Gbps network. That's 123,000 nanoseconds versus 10,000,000
nanoseconds. That advice has been to store configuration information and state
information separately from the processing—from the VM. Moving that data from the
SSD to networked storage is immediately 81 times slower. This is an example of the
real costs of reliability over speed.
Mapping Compute Needs
to Platform Products
Business logic (the application) uses CPU

Where does it get compute resources?

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
Which platform can service
speed, reliability, and scale with
the least amount of effort?

If you can use a native GCP service, you may not


need to design anything fancier.

App Engine is code-first so it is easy to use to


create new applications. It autoscales. It is highly
available and reliable. App Engine is often a
sufficient solution on its own.

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

Start with App Engine. If you identify exceptions that can't be handled by App Engine
standard environment, look at containers in both App Engine flexible environment and
Kubernetes Engine. If those don't handle the exceptions, look to Compute Engine.

https://pixabay.com/en/keyboard-software-programming-2529270/
App Engine

Code first

Focus on programming, minimize IT work

Minimize operations overhead

Have scale and reliability handled by the platform

Containers can be run on App Engine flexible environment

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

https://cloud.google.com/docs/choosing-a-compute-option
App Engine use cases: Web sites, mobile app and gaming backends, RESTful APIs,
Internal Line of Business (LOB) apps, Internet of things (IoT) apps.
Kubernetes Engine

Platform independence

Separate application from OS

No OS dependencies

Already using Kubernetes and need to scale

Application can be containerized

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

The choice for containers is between App Engine flexible environment (which can run
containers) and Kubernetes Engine. App Engine flexible environment is code-first, the
platform is proprietary, and some elements are not exposed or controllable by the
application. Kubernetes Engine is container-first, the platform is open, and some
elements of the platform are not handled automatically for you, so you should plan on
doing some IT work.
Kubernetes Engine use cases: Containerized workloads, cloud-native distributed
systems, hybrid applications.
Compute Engine

Migrating an application from a Data Center without rewriting it

Dependencies on a specific OS

Required to use an existing VM image

Direct hardware access (GPUs, SSDs)

Driver-level access

Hardware performance is critical

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

In general, if you already know which is more important in your design, (1)
infrastructure control, (2) code development, or (3) a balance of both, then you should
consider (1) Compute Engine, (2) App Engine, or (3) Kubernetes Engine,
respectively.

Compute Engine use cases: Any workload requiring a specific OS or OS


configuration, currently deployed, on-premises software that you want to run in the
cloud.
Compute System
Provisioning
Deciding how the system will acquire new
compute resources and adapt to changing
requirements.

App Engine is an autoscaling platform.

Kubernetes Engine autoscales based on


containers and pods within the cluster.

So this section applies to Compute Engine


and VMs.

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
How to grow the thumbnail service?

Vertical scaling Horizontal scaling

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

Do you have a funeral when the server dies?


In the vertical scaling model, the answer is "yes". Your users and your service are
attached to and dependent on one piece of hardware.
In the horizontal scaling model, the answer is "no". One server loss is only part of the
service, and other parts can pick up the load and keep running.

Small servers (horizontal scaling)


● Easy to schedule (e.g., binpack)
● Decrease per-instance failure cost (capacity, recovery, etc.)
● Incremental scaling
● Cheaper at large scale
Large servers (vertical scaling)
● Lowers overhead (unlikely to matter)
● Cheaper at small scale, maybe
Horizontal scaling issues and answers

More server lifecycles to manage (deployment


complexity)

● Automation makes this easy

End-to-end latency increases slightly

● Requirements will indicate whether the latency


matters

More overhead, but unlikely to matter

● Outweighed by the benefits of decoupling scaling,


failures, upgrades, configuration, and so forth

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

A hard disk might have a 4-million hour average time to failure. At cloud scale this
means hard disks are constantly failing.
That means there are more server lifecycles to manage. But virtualization and
automation effectively solves this problem.

To manage traffic to multiple servers, there must be a decision points, such as a load
balancer. That slightly increases latency to the server versus a vertical scale
single-server approach.

There is more overhead in operating and maintaining a lot of servers. But this is
unlikely to matter compared to the benefits of being able to add capacity and to
handle individual server failures without impacting the overall availability of the service
to the users.
Horizontal scaling design

Keep servers simple: Do one thing well N/3 qps


● Minimize complexity
● Construct simple and concise APIs
● Identify where tasks are separable
● Split into separate servers
N qps Cloud Load
Balancing
Queries Per Second
Prefer small, stateless servers

● Easy to scale; no state to shard/rebalance


Ranges of keys
● Failure is cheap; no state to migrate/recover
mapping to server
● Easy to load balance; no hot-spotting

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

Be very careful of caching in horizontal scaling designs. Stateless servers can serve
stale copies of data stored at stateful servers if the cache is not properly invalidated.
Tradeoffs: Balance latency, capacity, scalability, and cost

Small stateless servers increase Large stateful servers reduce


reliability and scalability complexity and latency
Divide into parts Unify

Duplicate and coordinate Simplify and consolidate

Separate and isolate Coalesce and colocate

Methods of achieving balance in your design


● What are your SLOs? What do your users value?
● What is the optimal size and number of parts?
● Sometimes central control is necessary/optimal
● Plan on adjusting and build adjustment processes

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

The previous slides make a strong case for small stateless servers. However, as
mentioned in "Defining the service", you have to consider the design in context. And
no design solution fits every case. Determine whether that design makes sense for
specific components of your service.

The history of computing shows many periods when design skewed towards
centralization and unified control, followed by periods when design skewed towards
decentralization and distributed systems. Choosing the degree of centralization or
decentralization his is not a philosophical debate, but a practical necessity. You need
to strike a balance between these two by examining each part of your application and
balancing the tradeoffs in terms of latency, capacity, scalability, and cost. It is possible
to divide an application into small enough parts that it becomes unmanageable and
unmaintainable. Strike the right balance using measurement—SLOs and SLIs—to
determine what solution is optimal for your users.
This also means that your initial design may need to be adjusted as you get actual
measurable feedback. That's okay.
Build that feedback and adjustment process into your design of both the technical and
behavioral (human, operations, process) elements.
Design first, dimension later

Trying to dimension the solution before the design is completed and before it is iterated
and evolved can lead to confusion. The same is true of cost optimization.

How many machines of what capacity?


Great questions.
● Network: queries per second or bandwidth
They just come later ● Memory: data stored in memory for speed; MB or GB
in the design process. ● Storage: data stored on local disk (PD or SSD); GB or TB

How can the cost be minimized?

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

You arrive at the design thinking about the qualities that you want in your
system—what parts you want to scale. How you get reliability into the system. And
then you handle dimensioning separately.
View current documentation for current offerings of size and price of GCP resources.
The photo service is slow

This section continues developing the design


of the thumbnail photo service.

What is the cause of the slow service?

What can be done about it?

What lessons does this offer for design?

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

https://pixabay.com/en/summer-sunflower-flowers-sky-cloud-368224/
Business problem: Users complain that the service is slow

Business Logic

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

So the problem is that this process is slow, and this the users’ experience of the
service. Let’s think about what you can do..
PROCESS
Systematic logical troubleshooting
2
Step through the system
manually in your mind.

1
Segment and reduce
the problem space. 3
Add more
monitoring or
logging.

Intermittent failures can be caused by multiple simultaneous factors.


© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

It’s necessary to have a systematic and logical process to troubleshooting. Start by


segmenting and reducing the problem space. Where could slow-downs affect the user
experience? Maybe everything does, maybe none do. (CLICK)
Step through the system manually in your mind and get an idea of what the business
logic looks like. (CLICK)
Add monitoring or logging if you’re not getting that information. If there's no monitoring
to tell you what's going on, once your experience no longer supports you, you need to
rely on actual data to try to troubleshoot the problem. So this is where you ask
yourself, do I have the ability to add more information to this so I can troubleshoot this
in the future?

https://pixabay.com/en/light-bulb-current-light-glow-503881/
https://pixabay.com/en/measuring-tape-length-cm-measure-2202258/

Reference: SRE Book: Chapter 12 - Effective Troubleshooting


Reference: SRE Book: Chapter 16 - Tracking Outages

Each step should reduce the set of possible causes


● Don't check hypothesis at random, partition the problem space
Leverage system knowledge to step through the system operation mentally
● Keep diagrams and docs updated
● Sanity check expected component behavior
If you can't identify root cause, add more monitoring or logging
Intermittent failures and performance degradation are the hard to troubleshoot
● Can be multiple simultaneous factors, appearing random
● Historical sub-operation performance data helps diagnose performance issues
PROCESS
Collaboration and communication

● Five Why's
● Being a hero can lead to longer
downtime
● Closed-group conversations can
cause confusion instead of
coordination
● People are never root cause.
Thinking they are...
○ stops analysis early
○ leads to fixing the wrong things

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

Collaboration and communication is important because it’s often not one person
involved. Use the Five Whys iterative interrogation technique to explore possible
cause-and-effect relationships. Unfortunately, being a hero can lead to longer
downtime and closed-group conversations can cause confusion rather than
coordination. How are others going to benefit from the experience of this
troubleshooting if there’s no collaboration? History shows that people are never the
root cause of a problem. Thinking they are will often lead to the analysis process
ending prematurely and the wrong things being fixed. In addition, focussing on the
individual is counter-productive when trying to encourage collaboration.

Don't try to "fix it yourself " or "be a hero" - it can lead to longer downtime
● 15-20 minutes to find a quick fix, then declare an incident
Keep communications regular and broad
● Closed-group conversations can cause confusion and a lack of coordination
among incident responders
Don't consider people as the root cause. Ask the right questions.
● Stops analysis before finding the real root cause
● Leads to fixing the wrong things

https://pixabay.com/en/building-blocks-insert-2065238/
https://pixabay.com/en/meeting-together-cooperation-1015316/
Useful (blameless) questions to ask about people and processes:

How can you make the systems, tools, and processes more immune to human
fallibility?
How can you give people better information to make decisions?
Was the information they had flawed or misleading?
Is there an automated way to fix the information?

The system should not have been able to fail this way.
Could software have prevented or mitigated this error?
Can this activity be automated so it doesn't require human intervention?

How likely is it that the next person could cause the same problem?
Could a new hire have made this mistake?
What is going on inside the photo thumbnail sample application?
Business Logic

Thumbnail
Image Thumbnail Serving
Ingest Conversion
Storage Storage Thumbnails
(Processing)

User Experience

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

So let’s break down the business logic behind your photo service: You have the user
experience;
you ingest data;
you store it;
you do some kind of processing;
you store it again;
and then you serve that back up to the user. So if the user experience is slow, there's
something going on internally with one these systems. The first thing that comes to
mind is usually based on your experience. But do you understand the different
attributes of each of these services? Well, let's start by identifying those.

https://pixabay.com/en/countryside-tree-landscape-sunlight-2175353/
What are the characteristics of each of the functions?
Business Logic
Single App
Server
Solution

User Experience Ingest Thumbnail Image Storage Serving Thumbnails


● HTTP Server ● HTTP Conversion ● Large files ● HTTP
● Dynamic / ● High ● High CPU ● Small files ● High read I/O
Static Content Throughput ● Memory ● Many files ● Dynamic /
● Session ● Low disk I/O ● Low disk I/O ● High r/w Static content
Handling ● Session
Handling

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

So looking at the user experience, you know there's a web server. Web servers are
pretty fast, but there’s dynamic and static content. So you might want to determine
things like: Is the web server responding? Is it DNS servers? Is there some kind of
rogue JavaScript, for example? What about session handling? Well, you’re on a
single box so you don't have to worry about that now, but that could come into play
later on. What if retries have to occur because you lost session info?
Now you get into ingest. So, ingesting is also going to be done through an HTTP
protocol. You know that you're going to do high throughput writes to disk. All you're
doing is writing long streams of large images, so the disk I/O - the actual transactions
- shouldn't be a problem.
Thumbnail conversions are going to consume a lot of the machine's resources. It's
going to be very CPU intensive and will consume a lot of memory. However, probably
low disk I/O though, because all it has to do is read the image into a memory, process
it, then generate an output.
What about the image storage? Well, this could be large and small files, so you might
keep those copies around. There's going to be a lot of files, so you might have to
worry about file systems. How do you keep track of millions or hundreds of millions of
files? It's going to be very high disk I/O intensive, and that could be a potential issue
too.
And then you have considerations in serving that thumbnail back.
Now this might be quick to troubleshoot, and you might draw initial conclusions, but
what you initially identify might not be the problem.

https://pixabay.com/en/countryside-tree-landscape-sunlight-2175353/
Segregate services for better performance and scalability.

Business Issue: User experience is being


impacted by the thumbnail conversion
processing.

● Offloading high CPU process to another


service.
● Reducing memory allocation on the web App Server
servers. Web Server Thumbnail
● Reduced some disk I/O. Processing
● Moved image storage to the the thumbnail
processor.

Thumbnail
Image Conversion Thumbnail Serving
Ingest
Storage (Processing) Storage Thumbnails

User Experience
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

The business issue is that the user experience is being impacted by the thumbnail
conversion process. So what do you do? Well, why don't you offload that high CPU
process to another service? What does this benefit though; how does it change our
attributes?

Well, you’ve reduced the memory allocations on the web servers, so now you can
have more sessions on the web servers. You've reduced a small amount of disk I/O,
because you won’t contribute to that any longer, and the web servers hardly use any
disk I/O, so that's okay. You've also moved the image storage to the thumbnail
processor. So now the web server itself doesn’t need to hold as much local storage,
and you're going to combine it all into the app server thumbnail processing machine.

The business logic is now handled by the web server, which is doing the user
experience in both ingesting and serving thumbnails. But now the image storage, the
processing, and the output thumbnail storage, is going to be handled by a separate
device. This makes logical sense, it's simple. You're not over-complicating it. This is a
natural, logical design. In fact, you can do this on paper before you do this in
production. It doesn't mean you have to keep deploying in this way, but you're using
these processes.

https://pixabay.com/en/countryside-tree-landscape-sunlight-2175353/
Objectives and Indicators

Objectives Indicators
Availability, 23/24 hours/day = 95.83% availability Aggregated server up/down time

99% of user operations completed in < 1 minute End to end latency

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

What about our service level objectives and indicators? Even though we added more
servers, our availability SLI remained the same. We’re looking at the aggregated
server uptime instead of a specific server, but in reality the SLI is still just the measure
of the service uptime from the user’s perspective.

The real change is that we’re going to add a new performance-based SLI to the
service. We determined that the users require our service to respond in under one
minute. Therefore, our new SLO will be: “Complete the user’s operation in less than
one minute” and we will use the end to end latency as our SLI, which represents the
overall latency that the user experiences while using the service. In this scenario, we
could find this by evaluating the latency of the HTTP requests.
YOUR TURN

Design challenge #1
Log aggregation

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

https://pixabay.com/en/the-strategy-win-champion-1080527/
Introducing log files for the photo thumbnail service

Single App
Server
Solution

Log files

ID 12345 Timestamp Payload 288 B

8B 16 B 8B 256 B

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

When we were logging data for the log files for the thumbnail service, we were storing
the log files on the local machine. Now, what are the log files look like? Well, they are
given an ID for the log entry; a session ID to help us map the session to a user; the
timestamp; then a payload, which consists of application-specific information about
the application.
Log data is now segregated
App Server
Thumbnail
Business Issue: For proper troubleshooting the Web Server Processing
segregated log files must be joined.

Design challenge:
Combine log files from separate locations and join
them together into a single log.

Web server Thumbnail


Web ID 12345 Timestamp Payload 288 B log files server log files

8B 16 B 8B 256 B

App ID 12345 Timestamp Payload 288 B


Joined log
files
Common field

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

Recall that each log entry contains an Entry ID, timestamp and payload. The objective
of the challenge is to design a system that appends log entries of type Web + Log
based on a shared Entry ID.
Developing one solution together

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

Remember, there are multiple valid solutions to this challenge.


The class is going to walk through one of these solutions together.
In later exercises you will be challenged to develop or add to the design and compare
your solution to the elements and reasoning in the sample solution.
Logs on two servers, aggregate to a single log

Web Logs App Logs

Logs Logs

Web App
Logs Logs

Logging Server

Logs

Web Logs Ingest Append Transform Output

App Logs
Daily Cron Batch Job

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

When the design migrated to two servers, the log files are now on two different
servers. To do troubleshooting, the log files should be re-integrated into a single
joined log file.

Ingest
Append
Transform

The slide shows the business logic defined for the new Logging Server component.
Business logic
Log entries: the Entry ID field is shared.
ID Entry ID Timestamp Payload

Matching Entry ID

Webserver Logs App Logs


ID A 12345 Timestamp A Payload A ID B 12345 Timestamp B Payload B

APPEND

Webserver + App logs


ID A 12345 Timestamp A Payload A Timestamp B Payload B

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

2 types of log entries, Webserver Logs (identified as type 'A') and Application Logs
(App Logs, identified as type 'B')

The business logic is "Append" which joins or appends parts of the App Log to the
matching Webserver Log.
Each log entry has Entry ID, timestamp, and some entry-specific payload.

The output is a series of appended logs.


Output log

● App Log = 25 thousand records / day


● The Payload for A and B records has a maximum size of 256 B

ID A 12345 Timestamp A Payload A Timestamp B Payload B 552 B

8B 16 B 8B 256 B 8B 256 B

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.
GCP lab

Deployment Manager: Package and deploy

Lab 2: How to customize an instance, install software and run an application at boot from
Deployment Manager. (Echo application).
© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

In the lab, you will be working with the simplest service possible: an echo application.

In this lab you will learn to take the pre-written python echo application, that uses
application framework libraries, and package it for deployment on the cloud using
python package manager.
In the previous lab you just started an instance. In this lab you will bring up an
instance and perform the customization necessary to update and install software, and
to host the echo application, and handle configuration of other elements in the
environment, such as networking.

You will build on this Deployment Manager experience in subsequent labs.


© 2018 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names
may be trademarks of the respective companies with which they are associated.

© 2018 Google Inc. All rights reserved. Google and the Google logo
are trademarks of Google Inc. All other company and product names may
be trademarks of the respective companies with which they are associated.

You might also like