Highly customized and optimized BERT inference directly on NVIDIA (CUDA, CUBLAS) or Intel MKL, without tensorflow and its framework overhead.
ONLY BERT (Transformer) is supported.
- Tesla P4
- 28 * Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
- Debian GNU/Linux 8 (jessie)
- gcc (Debian 4.9.2-10+deb8u1) 4.9.2
- CUDA: release 9.0, V9.0.176
- MKL: 2019.0.1.20181227
- tensorflow: 1.12.0
- BERT: seq_length = 32
batch size | 128 (ms) | 32 (ms) |
---|---|---|
tensorflow | 255.2 | 70.0 |
cuBERT | 184.6 | 54.5 |
batch size | 128 (ms) | 1 (ms) |
---|---|---|
tensorflow | 1504.0 | 69.9 |
mklBERT | 984.9 | 24.0 |
Note: MKL should be run under OMP_NUM_THREADS=?
to control its thread
number. Other environment variables and their possible values includes:
KMP_BLOCKTIME=0
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
cuBERT can be accelerated by Tensor Core and Mixed Precision on NVIDIA Volta and Turing GPUs. We support mixed precision as variables stored in fp16 with computation taken in fp32. The typical accuracy error is less than 1% compared with single precision inference, while the speed achieves more than 2x acceleration.
We support following 2 pooling method.
- The standard BERT pooler, which is defined as:
with tf.variable_scope("pooler"):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token. We assume that this has been pre-trained
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
self.pooled_output = tf.layers.dense(
first_token_tensor,
config.hidden_size,
activation=tf.tanh,
kernel_initializer=create_initializer(config.initializer_range))
- Simple average pooler:
self.pooled_output = tf.reduce_mean(self.sequence_output, axis=1)
Following outputs are supported:
cuBERT_OutputType | python code |
---|---|
cuBERT_LOGITS | model.get_pooled_output() * output_weights + output_bias |
cuBERT_PROBS | probs = tf.nn.softmax(logits, axis=-1) |
cuBERT_POOLED_OUTPUT | model.get_pooled_output() |
cuBERT_SEQUENCE_OUTPUT | model.get_sequence_output() |
cuBERT_EMBEDDING_OUTPUT | model.get_embedding_output() |
mkdir build && cd build
# if build with CUDA
cmake -DCMAKE_BUILD_TYPE=Release -DcuBERT_ENABLE_GPU=ON -DCUDA_ARCH_NAME=Common ..
# or build with MKL
cmake -DCMAKE_BUILD_TYPE=Release -DcuBERT_ENABLE_MKL_SUPPORT=ON ..
make -j4
# install to /usr/local
# it will also install MKL if -DcuBERT_ENABLE_MKL_SUPPORT=ON
sudo make install
If you would like to run tfBERT_benchmark for performance comparison, please first install tensorflow C API from https://www.tensorflow.org/install/lang_c.
Download BERT test model bert_frozen_seq32.pb
and vocab.txt
from
Dropbox,
and put them under dir build
before run make test
or ./cuBERT_test
.
We provide simple Python wrapper by Cython, and it can be built and installed after C++ building as follows:
cd python
python setup.py bdist_wheel
# install
pip install dist/cuBERT-xxx.whl
# test
python cuBERT_test.py
Please check the Python API usage and examples at cuBERT_test.py for more details.
Java wrapper is implemented through JNA . After installing maven and C++ building, it can be built as follows:
cd java
mvn clean package # -DskipTests
When using Java JAR, you need to specify jna.library.path
to the
location of libcuBERT.so
if it is not installed to the system path.
And jna.encoding
should be set to UTF8 as -Djna.encoding=UTF8
in the JVM start-up script.
Please check the Java API usage and example at ModelTest.java for more details.
Pre-built python binary package (currently only with MKL on Linux) can be installed as follows:
-
Download and install MKL to system path.
-
Download the wheel package and
pip install cuBERT-xxx-linux_x86_64.whl
-
run
python -c 'import libcubert'
to verify your installation.
cuBERT is built with protobuf-c to avoid version and code conflicting with tensorflow protobuf.
Libraries compiled by CUDA with different versions are not compatible.
MKL is dynamically linked. We install both cuBERT and MKL in sudo make install
.
We assume the typical usage case of cuBERT is for online serving, where concurrent requests of different batch_size should be served as fast as possible. Thus, throughput and latency should be balanced, especially in pure CPU environment.
As the vanilla class Bert is not thread-safe
because of its internal buffers for computation, a wrapper class BertM
is written to hold locks of different Bert
instances for thread safety.
BertM
will choose one underlying Bert
instance by a round-robin
manner, and consequence requests of the same Bert
instance might be
queued by its corresponding lock.
One Bert
is placed on one GPU card. The maximum concurrent requests is
the number of usable GPU cards on one machine, which can be controlled
by CUDA_VISIBLE_DEVICES
if it is specified.
For pure CPU environment, it is more complicate than GPU. There are 2 level of parallelism:
-
Request level. Concurrent requests will compete CPU resource if the online server itself is multi-threaded. If the server is single-threaded (for example some server implementation in Python), things will be much easier.
-
Operation level. The matrix operations are parallelized by OpenMP and MKL. The maximum parallelism is controlled by
OMP_NUM_THREADS
,MKL_NUM_THREADS
, and many other environment variables. We refer our users to first read Using Threaded Intel® MKL in Multi-Thread Application and Recommended settings for calling Intel MKL routines from multi-threaded applications .
Thus, we introduce CUBERT_NUM_CPU_MODELS
for better control of request
level parallelism. This variable specifies the number of Bert
instances
created on CPU/memory, which acts same like CUDA_VISIBLE_DEVICES
for
GPU.
-
If you have limited number of CPU cores (old or desktop CPUs, or in Docker), it is not necessary to use
CUBERT_NUM_CPU_MODELS
. For example 4 CPU cores, a request-level parallelism of 1 and operation-level parallelism of 4 should work quite well. -
But if you have many CPU cores like 40, it might be better to try with request-level parallelism of 5 and operation-level parallelism of 8.
In summary, OMP_NUM_THREADS
or MKL_NUM_THREADS
defines how many threads
one model could use, and CUBERT_NUM_CPU_MODELS
defines how many models in
total.
Again, the per request latency and overall throughput should be balanced,
and it diffs from model seq_length
, batch_size
, your CPU cores, your
server QPS, and many many other things. You should take a lot benchmark
to achieve the best trade-off. Good luck!
- fanliwen
- wangruixin
- fangkuan
- sunxian