Table Of Contents
This sample, samplePlugin, defines a custom layer that supports multiple data formats and demonstrates how to serialize/deserialize plugin layers. This sample also demonstrates how to use a fully connected plugin (FCPlugin
) as a custom layer and the integration with NvCaffeParser.
This sample implements the MNIST model (data/samples/mnist/mnist.prototxt
) with the difference that the custom layer implements the Caffe InnerProduct layer using gemm routines (Matrix Multiplication) in cuBLAS and tensor addition in cuDNN (bias offset). Normally, the Caffe InnerProduct layer can be implemented in TensorRT using the IFullyConnected layer. However, in this sample, we use FCPlugin
for this layer as an example of how to use plugins. The sample demonstrates plugin usage through the IPluginExt
interface and uses the nvcaffeparser1::IPluginFactoryExt
to add the plugin object to the network.
Specifically, this sample:
The FCPlugin
redefines the InnerProduct layer, which has a single output. Accordingly, getNbOutputs
returns 1
and getOutputDimensions
includes validation checks and returns the dimensions of the output:
The model is imported using the Caffe parser (see Importing A Caffe Model Using The C++ Parser API and Using Custom Layers When Importing A Model From a Framework). To use the FCPlugin
implementation for the InnerProduct layer, a plugin factory is defined which recognizes the name of the InnerProduct layer (inner product ip2
in Caffe).
The factory can then instantiate FCPlugin
objects as directed by the parser. The createPlugin
method receives the layer name, and a set of weights extracted from the Caffe model file, which are then passed to the plugin constructor. Since the lifetime of the weights and that of the newly created plugin are decoupled, the plugin makes a copy of the weights in the constructor.
FCPlugin
does not need any scratch space, therefore, for building the engine, the most important methods deal with the formats supported and the configuration. FCPlugin
supports two formats: NCHW in both single and half precision as defined in the supportsFormat
method.
Supported configurations are selected in the building phase. The builder selects a configuration with the networks configureWithFormat()
method, to give it a chance to select an algorithm based on its inputs. In this example, the inputs are checked to ensure they are in a supported format, and the selected format is recorded in a member variable. No other information needs to be stored in this simple case; in more complex cases, you may need to do so or even choose an ad-hoc algorithm for the given configuration.
The configuration takes place at build time, therefore, any information or state determined here that is required at runtime should be stored as a member variable of the plugin, and serialized and deserialized.
Fully compliant plugins support serialization and deserialization, as described in Serializing A Model In C++. In the example, FCPlugin
stores the number of channels and weights, the format selected, and the actual weights. The size of these variables makes up for the size of the serialized image; the size is returned by getSerializationSize
:
Eventually, when the engine is serialized, these variables are serialized, the weights converted is needed, and written on a buffer:
Then, when the engine is deployed, it is deserialized. As the runtime scans the serialized image, when a plugin image is encountered, it create a new plugin instance via the factory. The plugin object created during deserialization (shown below using new) is destroyed when the engine is destroyed by calling FCPlugin::destroy()
.
In the same order as in the serialization, the variables are read and their values restored. In addition, at this point the weights have been converted to selected format and can be stored directly on the device.
Before a custom layer is executed, the plugin is initialized. This is where resources are held for the lifetime of the plugin and can be acquired and initialized. In this example, weights are kept in CPU memory at first, so that during the build phase, for each configuration tested, weights can be converted to the desired format and then copied to the device in the initialization of the plugin. The method initialize
creates the required cuBLAS and cuDNN handles, sets up tensor descriptors, allocates device memory, and copies the weights to device memory. Conversely, terminate destroys the handles and frees the memory allocated on the device.
The core of the plugin is enqueue
, which is used to execute the custom layer at runtime. The call
parameters include the actual batch size, inputs, and outputs. The handles for cuBLAS and cuDNN operations are placed on the given stream; then, according to the data type and format configured, the plugin executes in single or half precision.
Note: The two handles are part of the plugin object, therefore, the same engine cannot be executed concurrently on multiple streams. In order to enable multiple streams of execution, plugins must be re-entrant and handle stream-specific data accordingly.
The plugin object created in the sample is cloned by each of the network, builder, and engine by calling the FCPlugin::clone()
method. The clone()
method calls the plugin constructor and can also clone plugin parameters, if necessary.
The cloned plugin objects are deleted when the network, builder, or engine are destroyed. This is done by invoking the FCPlugin::destroy()
method. void destroy() { delete this; }
In this sample, the following layers are used. For more information about these layers, see the TensorRT Developer Guide: Layers documentation.
Activation layer The Activation layer implements element-wise activation functions. Specifically, this sample uses the Activation layer with the type kRELU
.
Convolution layer The Convolution layer computes a 2D (channel, height, and width) convolution, with or without bias.
FullyConnected layer The FullyConnected layer implements a matrix-vector product, with or without bias.
Pooling layer The Pooling layer implements pooling within a channel. Supported pooling types are maximum
, average
and maximum-average blend
.
Scale layer The Scale layer implements a per-tensor, per-channel, or per-element affine transformation and/or exponentiation by constant values.
SoftMax layer The SoftMax layer applies the SoftMax function on the input tensor along an input dimension specified by the user.
make
in the <TensorRT root directory>/samples/samplePlugin
directory. The binary named sample_plugin
will be created in the <TensorRT root directory>/bin
directory. ```sh cd <TensorRT root directory>/samples/samplePlugin make `` Where
<TensorRT root directory>` is where you installed TensorRT.Verify that the sample ran successfully. If the sample runs successfully you should see output similar to the following: ``` &&&& RUNNING TensorRT.sample_plugin # ./build/x86_64-linux/sample_plugin [I] [TRT] Detected 1 input and 1 output network tensors. [I] Input: @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@%.-@@@@@@ @@@@@@*- %@@@@@ @@@@@= .-. @@@@@ @@@@= +@@ *@@@@@ @@@@* =@@ %@@@@@ @@@@..@@% @@@@@@ @@@# *@@- @@@@@@ @@@@: @@% @@@@@@ @@@@: @@- @@@@@@ @@@@: =+= +: *@@@@@ @@@@*. +@: *@@@@@ @@@@%#**#@: *@@@@@ @@@@@@@@: -@@@@@ @@@@@@@+ :@@@@@ @@@@@@@@* @@@@@ @@@@@@@@ %@@@@@ @@@@@@@@ #@@@@@ @@@@@@@@: +@@@@@ @@@@@@@@- +@@@@@ @@@@@@@@*:%@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@
[I] Output: 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: **********
&&&& PASSED TensorRT.sample_plugin # ./build/x86_64-linux/sample_plugin ```
This output shows that the sample ran successfully; PASSED
.
--help
optionsTo see the full list of available options and their descriptions, use the -h
or --help
command line option.
The following resources provide a deeper understanding about samplePlugin:
Models
Documentation
For terms and conditions for use, reproduction, and distribution, see the TensorRT Software License Agreement documentation.
February 2019 This is the first release of this README.md
file.
There are no known issues in this sample.