Best Practices and Sizing Guide For Smart Data Integration When Used in SAP Data Warehouse Cloud
Best Practices and Sizing Guide For Smart Data Integration When Used in SAP Data Warehouse Cloud
Integration
When used in SAP Data Warehouse Cloud
2
This guide is intended for implementors of real-time replication using SAP HANA smart data integration
within SAP Data Warehouse Cloud and discusses the best practices that help achieve stable and performant
replication.
In SAP Data Warehouse Cloud, SAP HANA smart data integration based federation and replication
technology is used when federating or replicating data with Remote Tables. In such a case, a Connection
in SAP Data Warehouse Cloud is established based on an SAP HANA Remote Source that itself uses one of
the available SAP HANA smart data integration Adapters.
The relevant connection types in SAP Data Warehouse Cloud using SAP HANA smart data integration
technology are:
SAP Data Warehouse Cloud Connection Type SAP HANA smart data integration Adapter
Oracle OracleLogReaderAdapter
SAP BW ABAPAdapter
3
SAP S/4HANA Cloud CloudDataIntegrationAdapter
THE BASIC
SAP HANA smart data integration implements a generic framework to connect to any remote sources (SAP
Data Warehouse Cloud: Connection) to browse metadata, federate/move data into SAP HANA On Premise
or SAP HANA Cloud / SAP Data Warehouse Cloud. It also adds the necessary functions to move change
data in near real-time.
For browsing metadata, to federate queries and move data it extends SAP HANA smart data access. The
data provisioning framework provides adapters to connect to remote sources. The adapters provided by SAP
are written in C++ and Java:
- The C++ adapters are hosted by SAP HANA – Data provisioning Server (aka DP server). In SAP Data
Warehouse Cloud this affects the “OData” and “SAP SuccessFactors for Analytical Dashboards”
connection types.
- The Java adapters are hosted by Data Provisioning Agent (aka DP agent) that runs outside of SAP
Data Warehouse Cloud / SAP HANA landscape. In SAP Data Warehouse Cloud, this affects all
other connection types mentioned above. Please find here the relevant documentation for preparing
the Data Provisioning Agent Connectivity in SAP Data Warehouse Cloud.
For real-time change data capture, SAP HANA smart data integration adapters can use different change
data capture technologies (SAP Data Warehouse Cloud uses only the so called “trigger based” changed
data capture (CDC) option) and send the changes to SAP HANA DP server for further processing.
The materialization step is handled by SAP HANA TASK framework which executes SQL on virtual tables.
The requests to SAP HANA smart data integration adapters are routed automatically via SAP HANA DP
server. The change data is captured and pushed by adapter to SAP HANA DP server, where the changes
are processed and applied to the target table. The change data capture is expressed using an object called
remote subscription.
In SAP Data Warehouse Cloud, all these objects (SAP HANA Virtual Table, target table for replication, SAP
HANA Task and SAP HANA Remote Subscription) are generated based on SAP Data Warehouse Cloud
Remote Tables and respective user interaction in SAP Data Warehouse Cloud are not visible for end users.
All available modeling and administration actions are performed from the Remote Table Editor or Remote
Table Monitor.
4
Components
• Data Provisioning Agent (DPAgent) – Hosts the adapters used to connect to source of data and
sends data to SAP HANA
• Data Provisioning Server (DPServer) – The following are the critical subcomponents
o Receiver – Receives data from the agent and stores in persistent storage (disk)
o Distributor – Reads data from Receiver’s queue and distributes to relevant subscriptions
o Applier – Receives data from distributor and applies the appropriate DML (insert, update,
delete, upsert) to target.
Remote Sources (SAP Data Warehouse Cloud: Connection) define the connection to the source; virtual
tables and remote subscriptions define the change data that the agent/adapter will send. An instance of
receiver, distributor, applier is started for each remote source. The change data sent by the agent is applied
in a transactionally consistent manner, serially and in the order, it was received. The change data received
and stored in the persistent storage by the receiver is tied to the remote source.
DEPLOYMENT GUIDELINES
We generally recommend that you install the Data Provisioning Agent close to the data sources. If you have
data sources scattered across multiple geographical locations separated by distance and network latency,
you can deploy multiple Data Provisioning Agents. Install at least one in each geographical location.
The best performance is achieved when the agent is directly installed on the same server as the data source.
However, various operational and IT policy reasons may prevent you from installing the agent directly on the
data source servers. In these situations, we recommend that you install the Data Provisioning Agent on a
supported virtual machine close to the source. Installing the agents on the same machine as SAP HANA
server is NOT recommended.
When there is a firewall between the Data Provisioning Agent and the Data Provisioning Server in SAP
HANA, the connection is automatically configured to use JDBC/HTTPS mode. When using JDBC/HTTPS
mode, Gzip compression is used by default.
When the Data Provisioning Agent connects to the SAP HANA server over TCP/IP, compression is not
enabled by default, because the network latency is assumed to be negligible due to geographic proximity. If
the TCP/IP connection introduces significant network latency, configure compression using the Data
Provisioning Agent’s command line tool (dpagentconfig.ini -> set framework.compressData to 3 for tcp
connections to enable compression).
The Data Provisioning Agent can be installed on various versions of Windows and Linux. The operating
system required for the Data Provisioning Agent may depend on the operating system of the data source
system, where applicable.
For the latest complete information about operating system support for the Data Provisioning Agent and data
sources, refer to the Product Availability Matrix (PAM).
5
IMPLEMENTATION GUIDELINES
Partitioning
To improve performance for large source objects, consider partitioning the target replication table and the
underlying task of a Remote Table. Task partitioning allows SAP HANA to read, process and commit the
partitioned virtual table input sources in parallel.
For more details, please refer to Creating Partitions for Your Remote Tables.
For more details, please refer to Restrict Remote Table Data Loads.
Think of a remote source as a pipe. The Data Provisioning Agent pushes change data into this pipe, the
server then applies the data serially, in the order that it was received and in a transactionally consistent
manner. If you have more remote sources, you have better parallelism and faster replication.
Source tables should then be selected for replication from one of the remote sources, making sure that a
source table is not replicated in another remote source. In SAP Data Warehouse Cloud, this would involve
using several connections pointing to the same source with each source entity using one Remote Table in
one (and only one) of these connections.
6
Consider the following criteria for deciding how to distribute the intended set of source tables across the
remote sources:
• Size of transactions: Large/small transactions depending on amount of change data per transactions.
Large transactions are more efficient to replicate since such transactions can be batched up for the
apply process on the target.
• Type of change data: insert, update, delete, upsert
Consecutive insert/upsert or consecutive deletes can be batched up for the Applier process,
therefore faster. Update rows require a two-step process; first to delete the old row and second to
insert the new row hence much slower.
• Row size: overall size of each row in table
• LOBs: Tables containing LOB columns typically require multiple trips to the source to retrieve the LOB
data. This overhead slows down replication.
• Criticality of a set of tables: group critical tables into one remote source
Note that the full table size does not matter for real-time replication. The main consideration for performance
is volume of (unfiltered) changed data per day and the type of changes.
OPERATIONAL GUIDELINES
Suspend replication
For planned downtimes/maintenance, we recommended suspending replication prior to the maintenance
window and then resuming replication for normal operations.
In SAP Data Warehouse Cloud, you can perform this when Pausing and Restarting Real-Time Replication.
Exceptions
Real-Time Replication exceptions for Remote Tables are displayed on connection level in the Remote Table
Monitor in SAP Data Warehouse Cloud. They need to be processed with a “Retry” actions that replays the
failed transaction again. It is important to process the exceptions as soon as possible.
Any replication exception causes the replication to stop for all subscriptions attached to that connection
(remote source). For trigger-based replication as used in SAP Data Warehouse Cloud, replication stoppage
means that the change data keeps accumulating in the source system.
In SAP Data Warehouse Cloud, you can perform this by Resuming Real-Time Replication After a Fail.
Assumptions
The SAP HANA smart data integration Data Provisioning Agent sizing approach for initial load and real-time
replication focuses on the simplest use case where data from one source SAP HANA system is replicated to
a single SAP HANA target system via the SAP HANA adapter without any complex data transformation.
Other variants, such as replicating data from multiple different source systems, utilizing similar log-based
adapters, or loading to multiple SAP HANA targets can be calculated based on this sizing information. You
can therefore extrapolate the requirements of the single SAP HANA smart data integration configurations to
calculate the overall expected capacity as described in this document. All more advanced variants like
utilizing adapters that require more complex processing may require additional information to determine
proper sizing.
7
Categorization of Replication Tables
As input for sizing SAP HANA smart data integration for an SAP HANA scenario, you need to analyze tables
which will be federated and/or replicated and classify them into categories. Determine the following
information for all tables (or only for the most frequently modified (inserted, updated and deleted)):
a) The weighted average number of table columns (one value)
b) The weighted average record length (one value)
Based on the analysis, determine the appropriate category for the volume of data, either small (S), medium
(M), large (L), or extra-large (XL).
Example: Weighted categorization of replication-relevant tables in the relevant system based their
characteristics and modification rate:
Weighted M
Category
8
SMALL MEDIUM LARGE
Use Case A small scenario with: A midrange scenario with: An upper mid-range
• One source system • Approximately 1-3 scenario with:
different source • Up to 6 different
• Up to 40 tables systems source systems
• A weighted table size • And/or up to 100 • And/or up to 300
category of S-M tables in total tables in total
• Initial load of tables • A weighted table • A weighted table size
balanced based on size category of M-L category of M-XL
SAP HANA target
capacity • Initial load of tables • Initial load of tables
done sequentially done sequentially
• Modification rate less across sources and across sources and
than 1,500,000/hour balanced based on balanced based on
SAP HANA target SAP HANA target
The example above fits here
capacity capacity
SAP Data Warehouse • Single Remote Source • Separate remote • Separate remote
Cloud / SAP HANA / Connection source(s) / source(s) /
Cloud target (for connections for high connections for high
replication only) • ~ 1 additional CPU volume modification volume modification
core rate tables rate tables
• < 1 GB memory (not • ~ 2-4 additional • ~ 4-8 additional CPU
including memory CPU cores cores
growth over time as
data volume • 1-2 GB memory • 2-4 GB memory (not
increases) (not including including memory
memory growth growth over time as
over time as data data volume
volume increases) increases)
9
Additional Insights
Initial load sizing is largely dependent on the SAP Data Warehouse Cloud / SAP HANA Cloud target
capacity. SAP Note 2688382 provides additional information for an SAP HANA On Premise context
specifically with certain aspects that can serve as useful background information when configuring your SAP
Data Warehouse Cloud Tenant.
As noted in the template section above, for source tables with a high-volume modification rate (>5M/Hr), it is
recommended to create separate connections / remote sources to ensure near real-time data replication.
If your use case exceeds the parameters in the Large scenario above, either by number of source systems,
number of tables, number of tables to federate in parallel or modification rate volume, it may be necessary to
deploy multiple DPAgent instances, replicating the sizing guidelines referenced above for each DPAgent.
These sizing configurations are based upon the default DPAgent configuration settings. If modifications are
made to the configuration settings, the sizing considerations within this document could be impacted.
* Java processes can consume more than the Xmx setting for a variety of reasons. The JVM overhead can
range from anything between just a few percentages to over one hundred percent more than the Xmx
setting. We do not recommend setting Xmx value larger than 128GB.
SOURCE IMPACT
This chapter will provide additional input on the impact that SAP HANA smart data integration based Remote
Table Replication will have on the source system. Compared to data federation in SAP Data Warehouse
Cloud or frequent data snapshots, real-time replication requires only a few additional resources.
When you enable real-time access for a remote table in SAP Data Warehouse Cloud, the resource
consumption for the initial load is like a snapshot load and the impact can be mitigated by the same
measures: partitioning, filtering, and projection. During the actual real-time replication, the resource
consumption depends on the data-change volume and the change-data-capture technique. SAP Data
Warehouse Cloud makes use of trigger-based replication through database adapters and API-based
replication for ABAP-based source systems:
• Triggers update generated shadow tables and a so-called "trigger queue" in the source database.
Execution of these triggers slows down booking transactions. This impact can be mitigated by fine-
tuning the advanced connection properties in SAP Data Warehouse Cloud to the update behavior in
the source database (see chapter” Connection (Remote Source) properties”). Multiple connections to
the same source database can also help to ensure that data changes are fetched fast enough to
avoid congestion.
• In ABAP-based source systems, the ODP API uses a set of techniques to capture data changes:
ABAP background jobs to extract data changes through a so-called "extractor", database triggers or
similar hooks into booking transaction. Data changes are compressed and queued in a so-called
"delta queue" where they are kept after being fetched for a retention time of 24 hours. For further
details refer to Operational Data Provisioning.
For both trigger-based replication and API-based replication, creating a dedicated user for replicating data
into SAP Data Warehouse Cloud helps to control the resource consumption in the source system.
Please find all relevant information for calculating the number of capacity units required and for
configuring the size of your SAP Data Warehouse Cloud Tenant in our community:
https://community.sap.com/topics/data-warehouse-cloud
This chapter will provide additional input on the impact that SAP HANA smart data integration based Remote
Table Replication will have on capacity required for your SAP Data Warehouse Cloud (SAP HANA Cloud)
tenant.
10
In addition to guidance provided under the above link, SAP HANA smart data integration replication has
additional overhead. We recommend sizing your compute resources (memory) such that after all planned
source tables are replicated, there is at least 40% available free memory of the Global Allocation Limit (GAL).
Due consideration should be made for future data growth.
Realtime portion of replication is not expected to utilize large amounts of memory unless extremely large
transactions are the dominant type of transaction. The main consideration for the real-time portion is the
amount disk space available since pending unapplied transactions are stored in the persistent store
temporarily.
During the initial load of Remote Table, additional memory utilized is at least 3 times the raw uncompressed
data size of the partition. Other factors that influence memory utilized are partitions that may be defined for
the table.
The following are observed memory utilizations for typical SAP Tables MARM (30 columns), MBEW (110
columns), EKPO (345 columns):
11
#Partitions EKPO: EKPO: EKPO: EKPO:
FetchSize Memory Memory Memory
Used: Used: Used:
DPAgent DPServer Indexserver
MB MB MB
1 10,000 17200 421 600022
12
www.sap.com/contactsap
The information contained herein may be changed without prior notice. Some software products marketed by SAP SE and its distributors contain proprietary software components of other software vendors.
National product specifications may vary.
These materials are provided by SAP SE or an SAP affiliate company for informational purposes only, without representation or warranty of any kind, and SAP or its affiliated companies shall not be liable
for errors or omissions with respect to the materials. The only warranties for SAP or SAP affiliate company products and services are those that are set forth in the express warranty statements
accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.
In particular, SAP SE or its affiliated companies have no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or release any functionality
mentioned therein. This document, or any related presentation, and SAP SE’s or its affiliated companies’ strategy and possible future developments, products, and/or platform directions and functionality are
all subject to change and may be changed by SAP SE or its affiliated companies at any time for any reason without notice. The information in this document is not a commitment, promise, or legal obligation
to deliver any material, code, or functionality. All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially from expectations. Readers are
cautioned not to place undue reliance on these forward-looking statements, and they should not be relied upon in making purchasing decisions.
SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and other
countries. All other product and service names mentioned are the trademarks of their respective companies. See www.sap.com/trademark for additional trademark information and notices.