TensorRT이용한 Xavier DLA (NVDLA) 실행
공식 문서의 챕터6을 토대로 실행본 것을 정리함.
Chapter 6: Working with DLA
trtexec에서 사용 가능한 option
- --avgRuns=100
- --deploy=resnet50.prototxt
- --fp16
- --batch=8
- --iterations=10000
- --output=prob
- --useDLACore=1
--useSpinWait: multi-process execution을 할 때 multiple inference sessions이 in parallel하게 실행되면 성능 저하를 발생 시킨다. 이러한 문제를 해결하기 위한 한 방법이
cudaEventBlockingSync
대신에--useSpinWait
를 사용해서 spin-wait based synchronization을 수행 하는 것이다.--allowGPUFallback: 아래 네 가지 경우에 대해서 DLA로 실행 불가능 할 때 GPU로 동작
- Thelayer operation is not supported on DLA.
- The parameters specified are out of supported range for DLA.
- The given batch size exceeds the maximum permissible DLA batch size. For more information, see DLA Supported Layers.
- A combination of layers in the network causes the internal state to exceed what the DLA is capable of supporting.
Multiple instances of trtexec can be launched simultaneously in this fashion for concurrent execution of the GPU and DLA’s.
- DLA supports a maximum batch size of 32 depending on the network, while the GPU can run higher batch sizes concurrently.
6.2. DLA Supported Layers
This section lists the layers supported by DLA along with the constraints associated with each layer.
Generic restrictions while running on DLA (applicable to all layers)
- Max batch size supported is 32.
- Note: Batch size for DLA is the product of all index dimensions except the CHW dimensions. For example, if input dimensions are NPQRS, the effective batch size is N*P.
- Input and output tensor data format should be FP16.
Layer specific restrictions
- Convolution, Deconvolution, and Fully Connected Layers
- Convolution and Deconvolution Layers
- Width and height of kernel size must be in the range [1, 32]
- Width and height of padding must be in the range [0, 31]
- Width and height of stride must be in the range [1,8] for Convolution Layer and [1,32] for Deconvolution layer
- Number of output maps must be in the range [1, 8192]
- Axis must be 1
- Grouped and dilated convolution supported. Dilation values must be in the range [1,32]
- Pooling Layer
- Operations supported: kMIN, kMAX, kAVERAGE
- Width and height of the window size must be in the range [1, 8]
- Width and height of padding must be in the range [0, 7]
- Width and height of stride must be in the range [1, 16]
- Activation Layer
- Functions supported: ReLU, Sigmoid, Hyperbolic Tangent
- Negative slope not supported for ReLU
- ElementWise Layer
- Operations supported: Sum, Product, Max, and Min
- Scale Layer
- Mode supported: Uniform, Per-Channel, and Elementwise
- LRN (Local Response Normalization) Layer
- Window size is configurable to 3, 5, 7, or 9
- Normalization region supported is: ACROSS_CHANNELS
- Concatenation Layer
- DLA supports concatenation only along the channel axis
https://docs.nvidia.com/deeplearning/sdk/pdf/TensorRT-Release-Notes.pdf
DLA: Activiation, Concatenation, Convolution, Deconvolution, ElementWise, FullyConnected, LRN, Poolling, and Scale.
Alexnet, GoogleNet, ResNet-50, LeNet for MNIST.
trtexec, --useDLACore=N
0
or 1
, --fp16
MNIST에 대해서는
./trtexec --deploy=data/mnist/mnist.prototxt --output=prob --useDLACore=0 --
fp16 --allowGPUFallback
trtexec는 아직 ONNX model을 DLA에서는 지원하지 않음
AlexNet 실행 결과
With DLA
sudo ./trtexec --deploy=/usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt --output=fc8 --fp16 --useDLACore=1
deploy: /usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt
output: fc8
fp16
useDLACore: 1
Input "data": 3x227x227
Output "fc8": 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=fc8, bindingIndex=1, buffers.size()=2
Average over 10 runs is 6.7926 ms (host walltime is 6.98827 ms, 99% percentile time is 11.3653).
Average over 10 runs is 6.27773 ms (host walltime is 6.415 ms, 99% percentile time is 6.33238).
Average over 10 runs is 6.34808 ms (host walltime is 6.51361 ms, 99% percentile time is 6.4911).
Average over 10 runs is 6.44476 ms (host walltime is 6.61664 ms, 99% percentile time is 6.55571).
Average over 10 runs is 6.46809 ms (host walltime is 6.63325 ms, 99% percentile time is 6.5311).
Average over 10 runs is 6.39661 ms (host walltime is 6.61909 ms, 99% percentile time is 6.51162).
Average over 10 runs is 6.3873 ms (host walltime is 6.92348 ms, 99% percentile time is 6.56896).
Average over 10 runs is 6.3655 ms (host walltime is 6.60092 ms, 99% percentile time is 6.47475).
Average over 10 runs is 6.42326 ms (host walltime is 6.57476 ms, 99% percentile time is 6.58432).
Average over 10 runs is 6.4514 ms (host walltime is 6.60459 ms, 99% percentile time is 6.53622).
Without DLA
sudo ./trtexec --deploy=/usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt --output=fc8 --fp16 --useDLACore=0
deploy: /usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt
output: fc8
fp16
useDLACore: 0
Input "data": 3x227x227
Output "fc8": 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=fc8, bindingIndex=1, buffers.size()=2
Average over 10 runs is 6.37521 ms (host walltime is 6.55121 ms, 99% percentile time is 7.07267).
Average over 10 runs is 6.43402 ms (host walltime is 6.61367 ms, 99% percentile time is 6.53213).
Average over 10 runs is 6.59314 ms (host walltime is 6.7674 ms, 99% percentile time is 6.8311).
Average over 10 runs is 6.46851 ms (host walltime is 6.63689 ms, 99% percentile time is 6.53619).
Average over 10 runs is 6.49032 ms (host walltime is 6.67388 ms, 99% percentile time is 6.56176).
Average over 10 runs is 6.50446 ms (host walltime is 6.66483 ms, 99% percentile time is 6.76147).
Average over 10 runs is 6.47384 ms (host walltime is 6.66675 ms, 99% percentile time is 6.61805).
Average over 10 runs is 6.45161 ms (host walltime is 6.63162 ms, 99% percentile time is 6.62525).
Average over 10 runs is 6.59969 ms (host walltime is 6.9295 ms, 99% percentile time is 8.73366).
Average over 10 runs is 6.48038 ms (host walltime is 6.63838 ms, 99% percentile time is 6.52291).
GPU-INT8
sudo ./trtexec --avgRuns=100 --deploy=/usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt --output=fc8 --batch=8 --iterations=10 --int8 --useSpinWait
avgRuns: 100
deploy: /usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt
output: fc8
batch: 8
iterations: 10
int8
useSpinWait
Input "data": 3x227x227
Output "fc8": 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=fc8, bindingIndex=1, buffers.size()=2
Average over 100 runs is 19.2004 ms (host walltime is 19.2577 ms, 99% percentile time is 27.2834).
Average over 100 runs is 4.89694 ms (host walltime is 4.91578 ms, 99% percentile time is 5.99552).
Average over 100 runs is 4.86918 ms (host walltime is 4.88175 ms, 99% percentile time is 4.96976).
Average over 100 runs is 4.87189 ms (host walltime is 4.8947 ms, 99% percentile time is 5.45152).
Average over 100 runs is 4.8832 ms (host walltime is 4.89491 ms, 99% percentile time is 6.77392).
Average over 100 runs is 4.87852 ms (host walltime is 4.89489 ms, 99% percentile time is 6.14464).
Average over 100 runs is 4.88434 ms (host walltime is 4.89542 ms, 99% percentile time is 6.13763).
Average over 100 runs is 4.86638 ms (host walltime is 4.87723 ms, 99% percentile time is 4.8999).
Average over 100 runs is 4.87463 ms (host walltime is 4.89783 ms, 99% percentile time is 5.61664).
Average over 100 runs is 4.87941 ms (host walltime is 4.89598 ms, 99% percentile time is 6.0657).
GPU-FP16
usr/src/tensorrt/bin$ ./trtexec --avgRuns=100 --deploy=/usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt --output=fc8 --batch=8 --iterations=10 --fp16 --useSpinWait
avgRuns: 100
deploy: /usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt
output: fc8
batch: 8
iterations: 10
fp16
useSpinWait
Input "data": 3x227x227
Output "fc8": 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=fc8, bindingIndex=1, buffers.size()=2
Average over 100 runs is 19.9483 ms (host walltime is 19.9793 ms, 99% percentile time is 26.7212).
Average over 100 runs is 17.2459 ms (host walltime is 17.2622 ms, 99% percentile time is 26.3752).
Average over 100 runs is 5.80046 ms (host walltime is 5.81265 ms, 99% percentile time is 8.1817).
Average over 100 runs is 5.09872 ms (host walltime is 5.11733 ms, 99% percentile time is 6.23514).
Average over 100 runs is 5.10308 ms (host walltime is 5.11465 ms, 99% percentile time is 6.37568).
Average over 100 runs is 5.1078 ms (host walltime is 5.12036 ms, 99% percentile time is 6.38995).
Average over 100 runs is 5.09496 ms (host walltime is 5.11859 ms, 99% percentile time is 5.84192).
Average over 100 runs is 5.10623 ms (host walltime is 5.11759 ms, 99% percentile time is 6.30518).
Average over 100 runs is 5.08028 ms (host walltime is 5.09227 ms, 99% percentile time is 5.1617).
Average over 100 runs is 5.0951 ms (host walltime is 5.11809 ms, 99% percentile time is 5.82054).
Avgrun 100
으로 변경
DLA=1
nvidia@jetson-0423718017159:/usr/src/tensorrt/bin$ sudo ./trtexec --avgRuns=100 --deploy=/usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt --output=fc8 --batch=8 --iterations=10000 --fp16 --useDLACore=1
avgRuns: 100
deploy: /usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt
output: fc8
batch: 8
iterations: 10000
fp16
useDLACore: 1
Input "data": 3x227x227
Output "fc8": 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=fc8, bindingIndex=1, buffers.size()=2
Average over 100 runs is 43.3741 ms (host walltime is 43.542 ms, 99% percentile time is 48.9472).
Average over 100 runs is 43.452 ms (host walltime is 43.6448 ms, 99% percentile time is 46.4364).
Average over 100 runs is 43.381 ms (host walltime is 43.5552 ms, 99% percentile time is 43.9859).
Average over 100 runs is 43.3211 ms (host walltime is 43.4982 ms, 99% percentile time is 43.649).
Average over 100 runs is 43.3026 ms (host walltime is 43.561 ms, 99% percentile time is 43.691).
Average over 100 runs is 43.2974 ms (host walltime is 43.4682 ms, 99% percentile time is 43.607).
Average over 100 runs is 43.2615 ms (host walltime is 43.4367 ms, 99% percentile time is 43.5681).
Average over 100 runs is 43.3321 ms (host walltime is 43.5066 ms, 99% percentile time is 43.6695).
Average over 100 runs is 43.3151 ms (host walltime is 43.4987 ms, 99% percentile time is 44.1478).
Average over 100 runs is 43.2724 ms (host walltime is 43.4773 ms, 99% percentile time is 43.5927).
DLA=0
nvidia@jetson-0423718017159:/usr/src/tensorrt/bin$ sudo ./trtexec --avgRuns=100 --deploy=/usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt --output=fc8 --batch=8 --iterations=10 --fp16 --useDLACore=0
avgRuns: 100
deploy: /usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt
output: fc8
batch: 8
iterations: 10
fp16
useDLACore: 0
Input "data": 3x227x227
Output "fc8": 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=fc8, bindingIndex=1, buffers.size()=2
Average over 100 runs is 43.958 ms (host walltime is 44.1817 ms, 99% percentile time is 46.4005).
Average over 100 runs is 43.8673 ms (host walltime is 44.0831 ms, 99% percentile time is 44.4324).
Average over 100 runs is 43.82 ms (host walltime is 44.0434 ms, 99% percentile time is 44.3187).
Average over 100 runs is 43.8821 ms (host walltime is 44.1185 ms, 99% percentile time is 46.3073).
Average over 100 runs is 43.8871 ms (host walltime is 44.0772 ms, 99% percentile time is 50.0316).
Average over 100 runs is 43.8578 ms (host walltime is 44.0774 ms, 99% percentile time is 44.4027).
Average over 100 runs is 43.8086 ms (host walltime is 44.0573 ms, 99% percentile time is 45.2116).
Average over 100 runs is 43.8388 ms (host walltime is 44.0205 ms, 99% percentile time is 46.1998).
Average over 100 runs is 43.8302 ms (host walltime is 44.0014 ms, 99% percentile time is 44.331).
Average over 100 runs is 43.8718 ms (host walltime is 44.0618 ms, 99% percentile time is 46.211).
ResNet-50을 이용한 테스트
모델 다운로드
https://github.com/KaimingHe/deep-residual-networks
https://developer.nvidia.com/embedded/jetson-agx-xavier-dl-inference-benchmarks
For GPU
$ ./trtexec --avgRuns=100 --deploy=resnet50.prototxt --int8 --batch=8 --iterations=10000 --output=prob --useSpinWait
For DLA (Core 0)
$ ./trtexec --avgRuns=100 --deploy=resnet50.prototxt --fp16 --batch=8 --iterations=10000 --output=prob --useDLACore=0 --useSpinWait --allowGPUFallback
For DLA (Core 1)
$ ./trtexec --avgRuns=100 --deploy=resnet50.prototxt --fp16 --batch=8 --iterations=10000 --output=prob --useDLACore=1 --useSpinWait --allowGPUFallback
실험결과
GPU 실행 FP16
./trtexec --avgRuns=100 --deploy=../data/ResNet-50-deploy.prototxt --fp16 --batch=8 --iterations=10 --output=prob --useSpinWait
avgRuns: 100
deploy: ../data/ResNet-50-deploy.prototxt
fp16
batch: 8
iterations: 10
output: prob
useSpinWait
Input "data": 3x224x224
Output "prob": 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 100 runs is 14.4263 ms (host walltime is 14.443 ms, 99% percentile time is 53.3525).
Average over 100 runs is 12.3128 ms (host walltime is 12.3288 ms, 99% percentile time is 14.1207).
Average over 100 runs is 12.3141 ms (host walltime is 12.3274 ms, 99% percentile time is 14.1016).
Average over 100 runs is 12.3157 ms (host walltime is 12.3302 ms, 99% percentile time is 14.3564).
Average over 100 runs is 12.2982 ms (host walltime is 12.3208 ms, 99% percentile time is 14.1487).
Average over 100 runs is 12.3076 ms (host walltime is 12.3231 ms, 99% percentile time is 14.0421).
Average over 100 runs is 12.2918 ms (host walltime is 12.3147 ms, 99% percentile time is 13.8296).
Average over 100 runs is 12.3104 ms (host walltime is 12.325 ms, 99% percentile time is 13.8336).
Average over 100 runs is 12.2957 ms (host walltime is 12.3222 ms, 99% percentile time is 13.9811).
Average over 100 runs is 12.3071 ms (host walltime is 12.3208 ms, 99% percentile time is 13.9845).
GPU 실행 INT8
nvidia@jetson-0423718017159:/usr/src/tensorrt/bin$ ./trtexec --avgRuns=100 --deploy=../data/ResNet-50-deploy.prototxt --int8 --batch=8 --iterations=10 --output=prob --useSpinWait
avgRuns: 100
deploy: ../data/ResNet-50-deploy.prototxt
int8
batch: 8
iterations: 10
output: prob
useSpinWait
Input "data": 3x224x224
Output "prob": 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 100 runs is 8.23188 ms (host walltime is 8.24839 ms, 99% percentile time is 24.6174).
Average over 100 runs is 7.36899 ms (host walltime is 7.38346 ms, 99% percentile time is 8.7689).
Average over 100 runs is 7.36345 ms (host walltime is 7.38333 ms, 99% percentile time is 8.55971).
Average over 100 runs is 7.3687 ms (host walltime is 7.3814 ms, 99% percentile time is 8.87165).
Average over 100 runs is 7.34474 ms (host walltime is 7.36685 ms, 99% percentile time is 7.89891).
Average over 100 runs is 7.37816 ms (host walltime is 7.3919 ms, 99% percentile time is 9.41402).
Average over 100 runs is 7.35202 ms (host walltime is 7.36458 ms, 99% percentile time is 8.94925).
Average over 100 runs is 7.3375 ms (host walltime is 7.35933 ms, 99% percentile time is 7.93171).
Average over 100 runs is 7.36859 ms (host walltime is 7.38325 ms, 99% percentile time is 9.11626).
Average over 100 runs is 7.3404 ms (host walltime is 7.35386 ms, 99% percentile time is 7.42304).
useDLACore=0
./trtexec --avgRuns=100 --deploy=../data/ResNet-50-deploy.prototxt --fp16 --batch=8 --iterations=10 --output=prob --useDLACore=0 --useSpinWait --allowGPUFallback
avgRuns: 100
deploy: ../data/ResNet-50-deploy.prototxt
fp16
batch: 8
iterations: 10
output: prob
useDLACore: 0
useSpinWait
allowGPUFallback
Input "data": 3x224x224
Output "prob": 1000x1x1
Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 100 runs is 49.0278 ms (host walltime is 49.155 ms, 99% percentile time is 50.5068).
Average over 100 runs is 48.9775 ms (host walltime is 49.0705 ms, 99% percentile time is 49.794).
Average over 100 runs is 48.9935 ms (host walltime is 49.0898 ms, 99% percentile time is 49.5647).
Average over 100 runs is 49.006 ms (host walltime is 49.1076 ms, 99% percentile time is 51.2941).
Average over 100 runs is 48.9827 ms (host walltime is 49.0781 ms, 99% percentile time is 49.2135).
Average over 100 runs is 48.9586 ms (host walltime is 49.0567 ms, 99% percentile time is 49.1777).
Average over 100 runs is 48.9577 ms (host walltime is 49.0548 ms, 99% percentile time is 49.793).
Average over 100 runs is 48.9677 ms (host walltime is 49.062 ms, 99% percentile time is 49.1632).
Average over 100 runs is 48.9815 ms (host walltime is 49.0734 ms, 99% percentile time is 50.0357).
Average over 100 runs is 48.9672 ms (host walltime is 49.0723 ms, 99% percentile time is 49.3834).
useDLACore=1
./trtexec --avgRuns=100 --deploy=../data/ResNet-50-deploy.prototxt --fp16 --batch=8 --iterations=10 --output=prob --useDLACore=1 --useSpinWait --allowGPUFallback
avgRuns: 100
deploy: ../data/ResNet-50-deploy.prototxt
fp16
batch: 8
iterations: 10
output: prob
useDLACore: 1
useSpinWait
allowGPUFallback
Input "data": 3x224x224
Output "prob": 1000x1x1
Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 100 runs is 48.7231 ms (host walltime is 48.8172 ms, 99% percentile time is 53.6832).
Average over 100 runs is 48.6207 ms (host walltime is 48.7205 ms, 99% percentile time is 49.832).
Average over 100 runs is 48.5483 ms (host walltime is 48.6472 ms, 99% percentile time is 48.8971).
Average over 100 runs is 48.5748 ms (host walltime is 48.6652 ms, 99% percentile time is 49.6947).
Average over 100 runs is 48.603 ms (host walltime is 48.695 ms, 99% percentile time is 51.5523).
Average over 100 runs is 48.5622 ms (host walltime is 48.6576 ms, 99% percentile time is 49.5412).
Average over 100 runs is 48.5704 ms (host walltime is 48.6607 ms, 99% percentile time is 48.9002).
Average over 100 runs is 48.5734 ms (host walltime is 48.6943 ms, 99% percentile time is 49.1745).
Average over 100 runs is 48.5792 ms (host walltime is 48.6702 ms, 99% percentile time is 48.9667).
Average over 100 runs is 48.5813 ms (host walltime is 48.6755 ms, 99% percentile time is 48.8755).
포럼에서 나온 내용으로 NVDLA를 실행해본 결과
TensorRT 5 docs and examples -soloved
Jetson-Inference
를 통해서 DLA
를 실행하는 코드를 얻을 수 있다. dev brnach에 UFF SSD
에서 이러한 GPU/DLA
코드를 포함하는 작업을 수행한다.
예제는 공식문서 chapter6에서 확인할 수 있다.
AlexNet과 같은 정보들은 https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet 에서 모델을 다운로드할 수 있다.
- prototxt를 https://nvidia.box.com/shared/static/o0w0sl3obqxj21u09c0cwzw4khymz7hh.prototxt 이곳에서 다운로드 한다. 이유는 GPU fallback을 활성화 하지 않기 때문이다.
포럼에 나의 정보를 공유함. 다른 사람과 다르게 DLA의 유무에 상관없이 실행 시간이 비슷하게 나온다.
'AI > NVIDIA' 카테고리의 다른 글
Upgrade CUDA, cuDNN (1) | 2021.07.10 |
---|---|
TensorRT 개론 및 Docker기반 실행 (0) | 2021.02.04 |
NVDLA: NVIDIA Deep Learning Accelerator (DLA) 개론 (0) | 2019.02.08 |
NVIDIA AI Tech Workshop at NIPS 2018 -- Session3: Inference and Quantization (0) | 2019.02.06 |
DeepStream을 통한 low precision YOLOv3 실행 (0) | 2019.01.24 |