TensorRT  7.2.1.6
NVIDIA TensorRT
Looking for a C++ dev who knows TensorRT?
I'm looking for work. Hire me!
All Classes Namespaces Functions Variables Typedefs Enumerations Enumerator Friends Pages
bertQKVToContextPlugin

Table Of Contents

Description

Takes query, key and value tensors and computes scaled multi-head attention - computes scaled dot product attention scores softmax(K'Q/sqrt(HeadSize)) and returns values weighted by these attention scores.

Structure

The bertQKVToContextPlugin takes two inputs; input, and optionally input_mask.

input input is a tensor with shape [S, B, 3 * E] where B is the batch size and E is the hidden size. This plugin makes strong assumptions about its input:

  • The input tensor contains all 3 matrices Q, K, V
  • This input tensor is computed by multiplying a tensor of size [S, B, E] with the weights W_qkv of size [E, 3 * E]
  • The weight matrix W_qkv is NOT just the vertical concatenation of individual matrices ‘W_tmp = [W_q’, W_k', W_v']', but to start withW_tmp, reshaping it into[E, 3, N, H](whereN * H = EandNis number of heads,His head size) transposing it into[E, N, 3, H]and reshaping it back to[E, 3 * E]`. The interpretation is to layout the k-th heads of Q, K and V next to each other, instead of first all N heads of Q, then all N heads of K, then all heads of V

input_mask input_mask is a tensor of shape [B] where B is the batch size. The input mask is in the encoded in the format described in embLayerNormPlugin, and contains the number of valid elements from the start of the sequence. If provided, the attention scores, i.e. the softmax distribution, are only computed over the elements designated as valid by the input mask

The bertQKVToContextPlugin generates the following output:

output output is a tensor with shape [S, B, E] where B is the batch size.

Parameters

bertQKVToContextPlugin has plugin creator class QKVToContextPluginDynamicCreator and plugin class CustomQKVToContextPluginDynamic.

The parameters are defined below and consists of the following attributes:

Type Parameter Version Description
int type_id 1, 2 Integer encoding the DataType (0: FP32, 1: FP16)
int hidden_size 1, 2, 3 The hidden size, denoted by E above.
int num_heads 1, 2, 3 The number of self-attention heads.
bool has_mask 1, 2 Whether to use the input_mask input.
float dq_probs 1, 2, 3 inner layer scale factor when run in int8 precision, default 1.f/127.f.
int var_seqlen 2 Whether to use variable sequence length (0: disable, 1: enable), default 0.

Additional resources

Networks:

License

For terms and conditions for use, reproduction, and distribution, see the TensorRT Software License Agreement documentation.

Changelog

October 2020
Add v2 plugin that supports variable sequence length.
Add v3 plugin that supports int8 interleaved variable sequence length.

November 2019
This is the first release of this README.md file.

Known issues

This plugin only supports GPUs with compute capability >= 7.0. For more information see the CUDA GPU Compute Capability Support Matrix