Table Of Contents
This sample, sampleINT8, performs INT8 calibration and inference.
Specifically, this sample demonstrates how to perform inference in 8-bit integer (INT8). INT8 inference is available only on GPUs with compute capability 6.1 or 7.x. After the network is calibrated for execution in INT8, output of the calibration is cached to avoid repeating the process. You can then reproduce your own experiments with any deep learning framework in order to validate your results on ImageNet networks.
INT8 engines are build from 32-bit network definitions, similarly to 32-bit and 16-bit engines, but with more configuration steps. In particular, the builder and network must be configured to use INT8, which requires per-tensor dynamic ranges. The INT8 calibrator can determine how best to represent weights and activations as 8-bit integers and sets the per tensor dynamic ranges accordingly. Alternatively, you can set custom per tensor dynamic ranges; this is covered in sampleINT8API.
This sample requires the MNIST training set. Download this script into the TensorRT-x.x.x.x/data/mnist
directory, where x.x.x.x
is your installed version of TensorRT, and execute the script to download the required data. The packaged MNIST model that is shipped with this sample is based on lenet.prototxt. For more information, see the MNIST BVLC Caffe example.
The data set used by this sample is based on the MNIST data set. The batch file generation from the above data set is described in Batch files for calibration.
Specifically, this sample performs the following steps:
Defining a network for INT8 execution is exactly the same as for any other precision. Weights should be imported as FP32 values, and the builder will calibrate the network to find appropriate quantization factors to reduce the network to INT8 precision. This sample imports the network using the NvCaffeParser:
Calibration is an additional step required when building networks for INT8. The application must provide TensorRT with sample input, in other words, calibration data. TensorRT will then perform inference in FP32 and gather statistics about intermediate activation layers that it will use to build the reduced precision INT8 engine.
Calibration must be performed using images representative of those which will be used at runtime. Since the sample is based around Caffe, any image preprocessing that caffe would perform prior to running the network (such as scaling, cropping, or mean subtraction) will be done in Caffe and captured as a set of files. The sample uses a utility class (MNISTBatchStream) to read these files and create appropriate input for calibration. Generation of these files is discussed in Batch files for calibration.
You can create calibration data stream (calibrationStream), for example:
The MNISTBatchStream class provides helper methods used to retrieve batch data. Batch stream object is used by the calibrator in order to retrieve batch data while calibrating. In general, the BatchStream class should provide implementation for getBatch()
and getBatchSize()
which can be invoked by IInt8Calibrator::getBatch()
and IInt8Calibrator::getBatchSize()
. Ideally, you can write your own custom BatchStream class to serve calibration data. For more information, see BatchStream.h
.
Note: The calibration data must be representative of the input provided to TensorRT at runtime; for example, for image classification networks, it should not consist of images from just a small subset of categories. For ImageNet networks, around 500 calibration images is adequate.
The application must implement the IInt8Calibrator
interface to provide calibration data and helper methods for reading/writing the calibration table file.
We can create calibrator object (calibrator
), for example: std::unique_ptr<IInt8Calibrator> calibrator;
TensorRT provides 4 implementations for IInt8Calibrator
:
See NvInfer.h
for more information on the IInt8Calibrator
interface variants.
This sample uses IInt8EntropyCalibrator2
by default. We can set the calibrator interface to use IInt8EntropyCalibrator2
as shown:
where calibrationStream
is a MNISTBatchStream object. The calibrator object should be configured to use the calibration batch stream.
In order to perform calibration, the interface must provide implementation for getBatchSize()
and getBatch()
to retrieve data from the BatchStream object.
The builder calls the getBatchSize()
method once, at the start of calibration, to obtain the batch size for the calibration set. The method getBatch()
is then called repeatedly to obtain batches from the application, until the method returns false. Every calibration batch must include exactly the number of images specified as the batch size.
For each input tensor, a pointer to input data in GPU memory must be written into the bindings array. The names array contains the names of the input tensors. The position for each tensor in the bindings array matches the position of its name in the names array. Both arrays have size nbBindings
.
Since the calibration step is time consuming, you can choose to provide the implementation for writeCalibrationCache()
to write calibration table to the appropriate location to be used for later runs. Then, implement readCalibrationCache()
method to read calibration table file from desired location.
During calibration, the builder will check if the calibration file exists using readCalibrationCache()
. The builder will re-calibrate only if either calibration file does not exist or is incompatible with the current TensorRT version or calibrator variant it was generated with.
For more information on implementing IInt8Calibrator
interface, see EntropyCalibrator.h
.
A calibration file stores activation scales for each network tensor. Activations scales are calculated using a dynamic range generated from a calibration algorithm, in other words, abs(max_dynamic_range) / 127.0f
.
The calibration file is called CalibrationTable<NetworkName>
, where <NetworkName>
is the name of your network, for example mnist
. The file is located in the TensorRT-x.x.x.x/data/mnist
directory, where x.x.x.x
is your installed version of TensorRT.
If the CalibrationTable
file is not found, the builder will run the calibration algorithm again to create it. The CalibrationTable
contents include:
Where:
<TRT-xxxx>-<xxxxxxx>
The TensorRT version followed by the calibration algorithm, for example, EntropyCalibration2.<layer name> :
value corresponds to the floating point activation scales determined during calibration for each tensor in the network.The CalibrationTable
file is generated during the build phase while running the calibration algorithm. After the calibration file is created, it can be read for subsequent runs without running the calibration again. You can provide implementation for readCalibrationCache()
to load calibration file from a desired location. If the read calibration file is compatible with calibrator type (which was used to generate the file) and TensorRT version, builder would skip the calibration step and use per tensor scales values from the calibration file instead.
config->setAvgTimingIterations(1);
config->setMinTimingIterations(1);
config->setMaxWorkspaceSize(1_GiB);
config->setFlag(BuilderFlag::kINT8);
builder->setMaxBatchSize(mParams.batchSize);
config->setInt8Calibrator(calibrator.get());
After we configure the builder, we can build the engine similar to any FP32 engine. ``` mEngine = std::shared_ptr<nvinfer1::ICudaEngine>( builder->buildEngineWithConfig(*network, *config), samplesCommon::InferDeleter()); ````
After the engine has been built, it can be used just like an FP32 engine. For example, inputs and outputs remain in 32-bit floating point.
This sample outputs Top-1 and Top-5 metrics for both FP32 and INT8 precision, as well as for FP16 if it is natively supported by the hardware. These numbers should be within 1%.
In this sample, the following layers are used. For more information about these layers, see the TensorRT Developer Guide: Layers documentation.
Activation layer The Activation layer implements element-wise activation functions.
Convolution layer The Convolution layer computes a 2D (channel, height, and width) convolution, with or without bias.
FullyConnected layer The FullyConnected layer implements a matrix-vector product, with or without bias.
SoftMax layer The SoftMax layer applies the SoftMax function on the input tensor along an input dimension specified by the user.
Download the MNIST dataset
gunzip
utility. For example, gunzip t10k-labels-idx1-ubyte.gz
.<TensorRT root directory>/samples/data/int8/mnist/
directoryCompile this sample by running make in the <TensorRT root directory>/samples/sampleINT8
directory. The binary named sample_int8
will be created in the <TensorRT root directory>/bin
directory.
``` cd <TensorRT root directory>/samples/sampleINT8 make ```
Where <TensorRT root directory>
is where you installed TensorRT.
./sample_int8
Verify that the sample ran successfully. If the sample runs successfully you should see output similar to the following:
``` &&&& RUNNING TensorRT.sample_int8 # ./sample_int8 [I] Building and running a GPU inference engine for INT8 sample [I] FP32 run:1800 batches of size 32 starting at 16 [I] [TRT] Detected 1 inputs and 1 output network tensors. [I] Processing next set of max 100 batches ... [I] Processing next set of max 100 batches [I] Top1: 0.998646, Top5: 1 [I] Processing 57600 images averaged 0.00262977 ms/image and 0.0834112 ms/batch. [I] FP16 run:1800 batches of size 32 starting at 16 [I] [TRT] Detected 1 inputs and 1 output network tensors. [I] Processing next set of max 100 batches ... [I] Processing next set of max 100 batches [I] Top1: 0.998646, Top5: 1 [I] Processing 57600 images averaged 0.00298152 ms/image and 0.0945682 ms/batch. [I] INT8 run:1800 batches of size 32 starting at 16 [I] [TRT] Detected 1 inputs and 1 output network tensors. [I] [TRT] Starting Calibration with batch size 50. [I] [TRT] Calibrated batch 0 in 0.00211989 seconds. ... [I] [TRT] Calibrated batch 9 in 0.00207574 seconds. [I] [TRT] Post Processing Calibration data in 0.180447 seconds. [I] [TRT] Calibration completed in 0.317902 seconds. [I] [TRT] Writing Calibration Cache for calibrator: TRT-7000-EntropyCalibration2 [I] [TRT] Detected 1 inputs and 1 output network tensors. [I] Processing next set of max 100 batches ... [I] Processing next set of max 100 batches [I] Top1: 0.998576, Top5: 1 [I] Processing 57600 images averaged 0.00227856 ms/image and 0.0722715 ms/batch. &&&& PASSED TensorRT.sample_int8 # ./sample_int8 ```
This output shows that the sample ran successfully; PASSED
.
--help
optionsTo see the full list of available options and their descriptions, use the -h
or --help
command line option.
The following resources provide a deeper understanding how to perform inference in INT8 using custom calibration:
INT8:
Models:
Blogs:
Videos:
Documentation:
For terms and conditions for use, reproduction, and distribution, see the TensorRT Software License Agreement documentation.
December 2019 This is the second release of this README.md
file.
There are no known issues in this sample.