High Availability
24 hours a day, 7 days a week, 365 days a year
Vik Nagjee
Product Manager, Core Technologies
InterSystems Corporation
Topics
What is High Availability (HA)?
Current HA strategies
Whats coming?
Questions & Discussion
What is High Availability (HA)?
Reliability
Fault-tolerance
Availability %
Downtime
per year
Downtime
per month
Downtime
per week
90%
36.5 days
72 hours
16.8 hours
95%
18.25 days
36 hours
8.4 hours
99%
3.65 days
7.20 hours
1.68 hours
Continuity
99.9%
8.67 hours
43.2 minutes
10.1 minutes
Redundancy
99.99%
52.6 minutes
4.32 minutes
1.01 minutes
99.999%
5.26 minutes
25.9 seconds
6.05 seconds
99.9999%
31.5 seconds
2.59 seconds
0.605 seconds
High Uptime
Operational
Minimal
Disruption
High Availability vs. Disaster Recovery
High Availability = fault detection & correction procedures to
maximize availability of critical services and applications,
often in an automated fashion.
Disaster Recovery = process of preparing for recovery or
continuation of technology infrastructure critical to an
organization after a natural or human-induced disaster.
High Availability Disaster Recovery!
Current HA Strategies
Failover = Automatic switch to redundant system
Uses some type of heartbeat software (e.g., HACMP)
Current Failover Options:
Failover Clusters
Concurrent Clusters
ECP Clusters
With Failover Cluster for Database
With Concurrent Cluster for Database
Failover Clusters
One active system (PROD), and one standby
system (STDBY), with a heartbeat connection
Windows Cluster, IBM HACMP, Sun Cluster,
HP Serviceguard, Red Hat Cluster Suite,
Veritas Cluster Services
Needs shared disk for install directory, WIJ,
database files, and journal files
Users/Applications connect to a DNS which is
mapped to PROD
In event of failure, 3rd party cluster software
fails Cach to STDBY node
Cach performs recovery on STDBY node
before allowing connections - open Txs are
rolled back, open locks are released, etc
Concurrent Clusters
AKA Cach Clusters
Can be configured on OpenVMS and
Tru64 UNIX
Two or more servers, each running an
instance of Cach and each with
access to all disks, concurrently
provide access to all data
Users connect to either one of the
clustered nodes; Cach provides data
and lock synchronization across nodes
If one machine fails, users can
immediately reconnect to any of the
remaining cluster nodes
Cach performs cluster-wide recovery
during failover logical and physical
data integrity is maintained
ECP Clusters with DB as Failover Cluster
Enterprise Cache Protocol (ECP) provides a
distributed, tiered system
Typical configuration:
N+1 application servers
Users load-balanced across app
servers
If any app server goes down, users can be
reconnected to other remaining app servers
If database goes down, users on app
servers will experience pause while DB
failover completes (here DB is configured as
a failover cluster)
Application servers will reconnect after
database has performed recovery
ECP Clusters with DB as Concurrent Cluster
Similar to previous example,
except DB server is configured
as a concurrent cluster
(OpenVMS or Tru64 UNIX)
App servers can connect to any
one of the nodes
If any node fails, the app
server(s) connected to that node
will reconnect to another
surviving node after failover
Cach performs cluster-wide
recovery during failover logical
and physical data integrity is
maintained
High Availability: Whats Coming?
Database Mirroring:
Delivers faster, automated failover
Eliminates requirement for shared disk configurations
Reduces dependency on 3rd party clustering software
Uses multiple redundant servers
Integrated ECP recovery
Database Mirroring
Multiple servers in Mirror Set - one is Primary,
others are Backup (1+)
TCP connections between mirror members
Primary PUSHES journal updates to Backups,
who ack and continuously de-journal
Primary role can flip from one server to another
within moments automated failover
All clients (except ECP) connect to a Mirror
Virtual IP mirror handles appropriate redirection
to current Primary
ECP protocol is mirror aware app servers will
connect directly to current primary, and will fail
over to new primary as appropriate. ECP will
perform recovery on reconnection.
Wrap-up
Questions & Discussion