Benchmarking with TensorRT-LLM

Benchmarking with TensorRT-LLM

October 10, 2024
Share this:

Introduction

Introduction

Here at ProX PC, we do a lot of hardware evaluation and testing that we freely publish and make available to the public. At the moment, most of our testing is focused on content creation workflows like video editing, photography, and game development. However, we’re currently evaluating AI/ML-focused benchmarks to implement into our testing suite to better understand how hardware choices affect the performance of these workloads. One of these benchmarks comes from NVIDIA in the form of TensorRT-LLM, and in this post, I’d like to talk about TensorRT-LLM and share some preliminary inference results from a selection of NVIDIA GPUs.

Here’s how TensorRT-LLM is described: “TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.“

Based on the name alone, it’s safe to assume that TensorRT-LLM performance benchmarks will scale closely with Tensor Core performance. Since all the GPUs I tested feature 4th-generation Tensor Cores, comparing the Tensor Core count per GPU should give us a reasonable metric to estimate the performance for each model. However, as the results will soon show, there is more to an LLM workload than raw computational power. The width of a GPU’s memory bus, and more holistically, the overall memory bandwidth, is an important variable to consider when selecting GPUs for machine learning tasks.

GPU VRAM (GB) Tensor Cores Memory Bus Width Memory Bandwidth
NVIDIA GeForce RTX 4090 24 512 384-bit ~1000 GB/s
NVIDIA GeForce RTX 4080 SUPER 16 320 256-bit ~735 GB/s
NVIDIA GeForce RTX 4080 16 304 256-bit ~715 GB/s
NVIDIA GeForce RTX 4070 Ti SUPER 16 264 256-bit ~670 GB/s
NVIDIA GeForce RTX 4070 Ti 12 240 192-bit ~500 GB/s
NVIDIA GeForce RTX 4070 SUPER 12 224 192-bit ~500 GB/s
NVIDIA GeForce RTX 4070 12 184 192-bit ~500 GB/s
NVIDIA GeForce RTX 4060 Ti 8 136 128-bit ~290 GB/s

NVIDIA was kind enough to send us a package for TensorRT-LLM v0.5.0 containing a number of scripts to simplify the installation of the dependencies, create virtual environments, and properly configure the environment variables. This is all incredibly helpful when you expect to run benchmarks on a great number of systems! Additionally, these scripts are intended to set TensorRT-LLM up on Windows, making it much easier for us to implement into our current benchmark suite.

However, although TensorRT-LLM supports tensor-parallelism and pipeline parallelism, it appears that multi-GPU usage may be restricted to Linux, as the documentation states that “TensorRT-LLM is supported on bare-metal Windows for single-GPU inference.” Another limitation of this tool is that we can only use it to test NVIDIA GPUs, leaving out CPU inference, AMD GPUs, and Intel GPUs. Although considering the current state of NVIDIA’s dominance in this field, there’s still value in a tool for comparing the capabilities and relative performance of NVIDIA GPUs.

Another consideration is that, like TensorRT for StableDiffusion, an engine must be generated for each LLM model and GPU combination. However, I was surprised to find that an engine generated for one GPU did not prevent the benchmark from being completed when used with another GPU. Using mismatched engines did occasionally impact performance depending on the test variables, so as expected, the best practice is to generate a new engine for each GPU. I also suspect the output text generated would likely be meaningless when an incorrect engine is used, but these benchmarks don’t display any output.

Despite all these caveats, we look forward to seeing how different GPUs perform with this LLM package with TensorRT optimizations. We will start by only looking at NVIDIA’s GeForce line, but we hope to expand this testing to include the Professional RTX cards and a range of other LLM packages in the future.

Test Setup

Test Setup

  • CPU: AMD Threadripper PRO 5995WX 64-Core
  • CPU Cooler: Noctua NH-U14S TR4-SP3 (AMD TR4)
  • Motherboard: ASUS Pro WS WRX80E-SAGE SE WIFI
  • BIOS Version: 1201
  • RAM: 8x Micron DDR4-3200 16GB ECC Reg. (128GB total)
  • GPUs:
    • NVIDIA GeForce RTX 4090 24GB Founders Edition
    • NVIDIA GeForce RTX 4080 16GB Founders Edition
    • PNY GeForce RTX 4070 Ti SUPER Verto 16GB
    • NVIDIA GeForce RTX 4070 SUPER 12GB Founders Edition
    • Asus GeForce RTX 4070 Ti STRIX OC 12GB
    • NVIDIA GeForce RTX 4070 12GB Founders Edition
    • Asus GeForce RTX 4060 Ti TUF OC 8GB
  • Driver Version: 551.23 for all except NVIDIA GeForce RTX 4080 SUPER 16GB Founders Edition (Driver Version: 551.31)
  • PSU: Super Flower LEADEX Platinum 1600W
  • Storage: Samsung 980 Pro 2TB
  • OS: Windows 11 Pro 22H2 build 22621.3007
  • Software: TensorRT-LLM v0.50, TensorRT 9.1.0.4, cuDNN 8.9.5, CUDA 12

The TensorRT-LLM package we received was configured to use the Llama-2-7b model, quantized to a 4-bit AWQ format. Although TensorRT-LLM supports a variety of models and quantization methods, I chose to stick with this relatively lightweight model to test a number of GPUs without worrying too much about VRAM limitations.

For each row of variables below, I ran five consecutive tests per GPU and averaged the results.

  • Input Length: 100, Output Length: 100, Batch Size: 1
  • Input Length: 100, Output Length: 100, Batch Size: 8
  • Input Length: 2048, Output Length: 1024, Batch Size: 1
  • Input Length: 2048, Output Length: 1024, Batch Size: 8

Results

Results

The results below show the average of 5 runs for each row of variables tested. The throughput for each GPU was measured in tokens per second.

GPU Input Length Output Length Batch Size Tokens/Second
NVIDIA GeForce RTX 4090 100 100 1 1190
NVIDIA GeForce RTX 4080 SUPER 100 100 1 973
NVIDIA GeForce RTX 4080 100 100 1 971
NVIDIA GeForce RTX 4070 Ti SUPER 100 100 1 908
NVIDIA GeForce RTX 4070 Ti 100 100 1 789
NVIDIA GeForce RTX 4070 SUPER 100 100 1 786
NVIDIA GeForce RTX 4070 100 100 1 753
NVIDIA GeForce RTX 4060 Ti 100 100 1 610
NVIDIA GeForce RTX 4090 100 100 8 8471
NVIDIA GeForce RTX 4080 SUPER 100 100 8 6805
NVIDIA GeForce RTX 4080 100 100 8 6760
NVIDIA GeForce RTX 4070 Ti SUPER 100 100 8 6242
NVIDIA GeForce RTX 4070 Ti 100 100 8 5223
NVIDIA GeForce RTX 4070 SUPER 100 100 8 5101
NVIDIA GeForce RTX 4070 100 100 8 4698
NVIDIA GeForce RTX 4060 Ti 100 100 8 3899
NVIDIA GeForce RTX 4090 2048 1024 1 83.44
NVIDIA GeForce RTX 4080 SUPER 2048 1024 1 68.80
NVIDIA GeForce RTX 4080 2048 1024 1 68.70
NVIDIA GeForce RTX 4070 Ti SUPER 2048 1024 1 63.67
NVIDIA GeForce RTX 4070 Ti 2048 1024 1 55.75
NVIDIA GeForce RTX 4070 SUPER 2048 1024 1 55.26
NVIDIA GeForce RTX 4070 2048 1024 1 50.80
NVIDIA GeForce RTX 4060 Ti 2048 1024 1 41.64
NVIDIA GeForce RTX 4090 2048 1024 8 664.38
NVIDIA GeForce RTX 4080 SUPER 2048 1024 8 517.71
NVIDIA GeForce RTX 4080 2048 1024 8 516.24
NVIDIA GeForce RTX 4070 Ti SUPER 2048 1024 8 471.10
NVIDIA GeForce RTX 4070 Ti 2048 1024 8 405.20
NVIDIA GeForce RTX 4070 SUPER 2048 1024 8 403.12
NVIDIA GeForce RTX 4070 2048 1024 8 366.90
NVIDIA GeForce RTX 4060 Ti 2048 1024 8 305.38

Conclusion

Conclusion

In summary, this benchmark provided valuable insights into the performance differences of a variety of NVIDIA GPUs with respect to running large language models (LLMs). The results indicate a clear advantage in terms of throughput as we progress through the RTX series, with the RTX 4090 significantly outperforming other models. When selecting a GPU for LLM tasks, it is important to consider the input length, output length, and batch size, as these factors greatly influence performance.

Understanding these variables will help users make informed decisions when choosing a GPU for their specific needs in deep learning applications.

Visit: www.proxpc.com

Read More Related Topics:

Share this:

Related Posts

View more
Chat with us