-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Improving S3 Throughput with AWS SDK for CPP v1.9
We’re happy to announce the General Availability of the AWS SDK for C++ v1.9. This release features deeper integration with the AWS Common Runtime (CRT) and improvements to the build process for AWS SDK for C++ from source. This release introduces a new Amazon S3 client, providing high throughput for Amazon S3 GET and PUT operations. The all new S3 Client is implemented on the top of the AWS Common Runtime (CRT) libraries, and is aptly named the "S3 CRT client". Lastly, there are also configuration updates related to Endpoint Discovery.
In this page, we go through the following updates in v1.9:
- AWS Common Runtime (CRT) Library Integration and S3 CRT Client
- Building AWS SDK for C++ and AWS Common Runtime libraries from source
- Configuring memory management for AWS Common Runtime libraries
- Configuring logging system for AWS Common Runtime libraries
- Working with the new S3 CRT client
- Endpoint Discovery Configuration Improvements
- Instructions to Migrate to v1.9
We appreciate all the feedback we receive through github -- keep it coming! We have addressed feedback about simplifying the dependencies and the CMake build process. In v1.9 we introduce a new C++ wrapper around the existing aws-c-*
libraries, which changes how C++ SDK consumes dependencies, and also allows us to easily extend AWS Common Runtime (CRT) integration with more CRT libraries.
This deeper integration with the CRT requires a different build process and considerations regarding memory management and message logging.
In AWS SDK for C++ v1.7, we started to integrate AWS Common Runtime (CRT) libraries by adding dependencies to aws-c-common, aws-checksums and aws-c-event-stream. Unfortunately, this design pattern led to continually adding more aws-c-*
libraries as new dependencies to SDK for C++.
For SDK for C++ v1.8 and earlier, all CRT libraries are built and installed via the CMake command. This led to unexpected behaviors like the CMake command requiring root permission. Debugging those CRT libraries is challenging because it is hard to find and update source code.
Version 1.9 addresses these issues by introducing an intermediate layer between the aws-sdk-cpp
and aws-c-*
libraries. This new layer is called aws-crt-cpp
, and is a git submodule of the SDK for C++. aws-crt-cpp
also has the aws-c-*
libraries (including aws-c-common
, aws-checksums
, aws-c-event-stream
, etc.) as its own git submodules. This allows the C++ SDK to get all CRT libraries recursively. It also pushes the installation of dependencies to the end of the process, rather than it occurring during the initial CMake command. Going forward, we will follow a clean separation of roles where 1) CMake will only do configuration, 2) the build step will only build the SDK and dependencies, and 3) the install step will only do installation.
A change in git syntax (provided in last section Instructions to Migrate to v1.9) is required to get the source code of all CRT libraries included as git submodules. This must be completed to migrate to v1.9. After you have recursively obtained the new source structure, you can build and install C++ SDK as usual. By default, the SDK for C++ will install CRT libraries for you. If you don’t want the SDK to build and install those dependencies, just specify -DBUILD_DEPS=OFF
in your CMake command.
The AWS Common Runtime (CRT) libraries have their own default memory management that controls memory allocation and deallocation. When integrated with SDK for C++, these defaults will be overridden by any memory management used in the SDK for C++. Thus, you don’t need to update your your management settings with this release - just be aware of that your custom memory management will apply to both the SDK for C++ and the CRT.
As a quick reminder, you can specify custom memory management like this:
MyMemoryManager sdkMemoryManager;
SDKOptions options;
options.memoryManagementOptions.memoryManager = &sdkMemoryManager;
See more details here: https://github.com/aws/aws-sdk-cpp/blob/master/Docs/Memory_Management.md
Currently, we don’t support using different memory systems for the SDK for C++ and Common Runtime libraries. If the ability to independently set these is important to you, let us know by submitting a feature request.
All log messages from the AWS Common Runtime (CRT) libraries will be redirected to the SDK for C++ by default. The log level and logging system you specify for the SDK for C++ also applies to the CRT. No changes are required in your application.
In the example that follows, log levels for both the SDK for C++ and the CRT are set to Trace
. The default logging system will put the log messages from both the SDK for C++ and the CRT into the same file.
SDKOptions options;
options.loggingOptions.logLevel = Aws::Utils::Logging::LogLevel::Trace;
If you are using Trace
log level for both, it’s likely that your log file will be overwhelmed by logs from CRT libraries, and become too large. You can independently control the logging for the CRT libraries, either by redirecting its output to a separate log file, or by setting a less verbose log level for messages from the CRT.
In the example that follows, the log level for CRT output only is set to Info
so it will not clutter the logs:
options.loggingOptions.crt_logger_create_fn =
[](){ return Aws::MakeShared<Aws::Utils::Logging::DefaultCRTLogSystem>("CRTLogSystem", Aws::Utils::Logging::LogLevel::Info); };
You can also implement your own logging system by extending CRTLogSystemInterface
and specify it via options.loggingOptions.crt_logger_create_fn
.
The S3 CRT client has a similar interface to a regular S3 Client, but has a totally different underlying implementation for PutObject
and GetObject
in order to achieve a better performance. The overall downloading performance can reach 90 Gbps on a c5n.18xlarge EC2 instance (with 100 Gpbs network bandwidth).
Behind the scenes, each GET or PUT operation turns into multiple HTTP requests. These requests are performed in parallel across connections to multiple S3 server addresses, allowing the client’s total throughput to exceed what it can get from any individual server. A custom DNS resolver continually searches for additional addresses. Splitting one large operation into smaller requests also means an individual failed request can be retried without restarting the operation for the whole file. A PUT operation turns into a multipart upload. A GET operation turns into multiple "ranged" GET requests.
The new S3 CRT Client and the regular S3 client use different HTTP libraries. The regular S3 client is using libcurl on Linux and Mac, and WinHTTP, WinINet and IXMLHTTPRequest2 on Windows. Its asynchronous model is based on a thread pool construct. However, the S3 CRT client is using an HTTP library from AWS Common Runtime (CRT) called aws-c-http, which provides asynchronous I/O based on an event loop construct.
The S3 CRT client is available with a new library: aws-cpp-sdk-s3-crt
. It has the same API signature as the existing aws-sdk-cpp-s3
library, but uses the new implementation for significantly better performance. To transfer data using the new functionality, use the ClientConfiguration
defined in Aws::S3Crt
namespace, rather than the standard ClientConfiguration
class defined in the Aws::Client
namespace.
The AWS SDK for C++ already offered a high-performance TransferManager
class for S3 GET and PUT: aws-sdk-cpp-transfer
. The new aws-cpp-sdk-s3-crt
service client should offer better performance and use async IO based on event loop group, while being simpler to configure and using less system resources.
In your CMake configuration, link against the new service client’s library:
find_package(AWSSDK REQUIRED COMPONENTS s3-crt)
Sample code that uploads and downloads a file can be found here: aws-doc-sdk-examples
Endpoint Discovery allows SDKs to access service endpoints (URLs to access various resources) while still allowing flexibility for AWS to alter URLs as needed. This pattern allows your code to automatically detect new endpoints. There are no fixed endpoints for some services, rather, you obtain the current endpoints during runtime by making a request to get the endpoints first. After retrieving the available endpoints, the code then uses the endpoint to access other operations. For example, in the Amazon Timestream C++ client, it will make a DescribeEndpoints
request to retrieve the available endpoints, and then use those to complete other operations such as CreateDatabase
or CreateTable
. You may find more details here: https://docs.aws.amazon.com/timestream/latest/developerguide/Using-API.endpoint-discovery.html. Endpoint discovery is required in some services, and optional in others.
In SDK for C++ v1.8 and before, enableEndpointDiscovery
defaulted to false, and you could turn on or turn off Endpoint Discovery explicitly only in the client configuration:
Aws::Client::ClientCOnfiguration config;
config.enableEndpointDiscovery = true;
In v1.9, we introduce a breaking change on this default behavior. If you don’t set this value, the client will decide whether it should use Endpoint Discovery or not. Endpoint Discovery is turned on by default for service clients that require endpoint discovery, e.g. Amazon Timestream C++ client. It is turned off by default for service clients where endpoint discovery is optional, e.g. AWS DynamoDB C++ client.
To let the C++ SDK decide the default behavior automatically, we changed the type of config.enableEndpointDiscovery
from bool
to Aws::Crt::Optional
.
You don’t need to update your application code if you’ve already turned on or off this value explicitly. However, if you have not explicitly set this, the SDK can determine if endpoint discovery should be turned on or off for different service clients.
Additionally, we’ve added support to configure it through environment variables or configuration file. The following order of precedence is used to resolve whether Endpoint Discovery is enabled or disabled:
- In the client configuration,
enableEndpointDiscovery
is set to true or false - In the environment variables,
AWS_ENABLE_ENDPOINT_DISCOVERY
is set to true or false - In the configuration file,
endpoint_discovery_enabled
is set to true or false - Defaults to either
true
orfalse
depending on whether the service requires Endpoint Discovery (e.g. Timestream defaults totrue
, DynamoDB defaults tofalse
)
The v1.9 SDK will build and install AWS Common Runtime (CRT) libraries for you by default but you need to checkout all git submodules before running CMake commands:
- New users: If you haven’t downloaded the source code for SDK for C++, you can get all git submodules recursively by:
git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp
- Existing users: If you’ve already downloaded source code for SDK for C++, e.g. in directory aws-sdk-cpp, you can update the git submodule by:
cd aws-sdk-cpp git checkout main git pull origin main git submodule update --init --recursive
For Endpoint Discovery configuration, v1.9 changes the default behavior for Amazon Timestream clients. Endpoint Discovery for Timestream clients is turned on by default in v1.9, whereas the default was off in prior versions. To turn it off, you now need to specify that explicitly.
GitHub issues. Customers can leave public feedback by opening a GitHub issue in the new repository. This is the preferred mechanism to give feedback so that other customers can engage in the conversation or add their 👍 to issues.
You can open pull requests for fixes or additions to the AWS SDK for C++ v1.9. All pull requests must be submitted under the Apache 2.0 license and will be reviewed by an SDK team member before merging. Accompanying unit tests are appreciated.