Feature Image

NVIDIA GeForce RTX 4090 Vs RTX 3090 Deep Learning Benchmark

March 6, 2024
Share this:

 

Released on October 12th, 2022, the NVIDIA GeForce RTX 4090 became the newest flagship GPU for gamers, content creators, and deep-learning researchers. Its arrival sparked immediate interest in how it stacks up against its predecessor, the NVIDIA GeForce RTX 3090, especially in the context of deep learning workloads. In this post, we dive into a detailed benchmark comparison of these two GPUs, focusing on their performance for deep learning model training.

 

By the end of this article, you'll understand the strengths and weaknesses of each GPU and be able to make an informed decision on which card is best suited for your deep learning needs.

 

NVIDIA RTX 4090 Highlights

 

The NVIDIA GeForce RTX 4090 brings several key improvements over the RTX 3090, making it a compelling option for deep learning:

 

  • Memory: Both GPUs come with 24 GB of memory, but the RTX 4090's training throughput and training throughput per dollar are significantly higher than the RTX 3090 across a variety of deep learning models. These models span use cases in computer vision, natural language processing, speech recognition, and recommendation systems.
  • Power Consumption: The RTX 4090 consumes 450W of power, which is notably higher than the 3090's 350W. Despite this, the training throughput per watt of the RTX 4090 is comparable to that of the RTX 3090.
  • Multi-GPU Training: Training scales reasonably well in multi-GPU setups, particularly in our tests using two RTX 4090 cards.

 

Let's now delve into the specific performance metrics, comparing both GPUs in terms of training throughput, cost-efficiency, and power efficiency.

 

PyTorch Training Throughput

 

The core metric for evaluating a GPU’s performance in deep learning is its training throughput, measured in terms of how many samples it can process per second when training a model. Here’s a look at the training throughput for both the RTX 3090 and RTX 4090 across several popular models, including ResNet50 (vision), SSD (object detection), and TransformerXL (natural language processing).

 

GPU/Model ResNet50 (Images/sec) SSD (Images/sec) BERT Base (Tokens/sec) TransformerXL (Tokens/sec) Tacotron2 NCF (Recommendations/sec)
RTX 3090 TF32 144 513 85 12101 25350 14714953
RTX 3090 FP16 236 905 172 22863 25018 25118176
RTX 4090 TF32 224 721 137 22750 32910 17476573
RTX 4090 FP16 379 1301 297 40427 32661 32192491

 

Analysis of Results

 

Across all tested models, the RTX 4090 demonstrates a significant improvement in training throughput over the RTX 3090, particularly in FP16 precision, which is often used to accelerate training without sacrificing too much accuracy. For instance:

 

  • In the ResNet50 model, the RTX 4090 processes 379 images/second in FP16, compared to the RTX 3090’s 236 images/second — a 1.6x improvement.
  • Similarly, for BERT Base finetuning, the RTX 4090 delivers 297 tokens/second in FP16, compared to the RTX 3090’s 172 tokens/second, marking a 1.7x improvement.

 

Overall, the RTX 4090 shows 1.3x to 1.9x higher training throughput than the RTX 3090 depending on the model and precision settings.

 

Training Throughput per Dollar

 

While performance is critical, cost-efficiency is another important factor, especially for researchers and students working on tight budgets. The price of the RTX 4090 is set at $1599, while the RTX 3090 costs $1400. When we normalize the results for training throughput per dollar, the RTX 4090 still leads in most cases.

 

Throughput/$ Results:

 

  • Depending on the model and precision, the RTX 4090 offers 1.2x to 1.6x higher training throughput per dollar compared to the RTX 3090.
  • This means that while the RTX 4090 is more expensive, it provides greater performance per dollar spent, making it a cost-effective solution for users who prioritize both budget and training speed.

 

For individuals or institutions looking to maximize their return on investment, the RTX 4090 provides better long-term value despite the slightly higher initial cost.

 

Training Throughput per Watt

 

Power consumption is another factor to consider, especially for users operating in environments where energy efficiency is a concern. The RTX 4090's 450W power consumption is significantly higher than the RTX 3090’s 350W. Despite this, when normalized for training throughput per watt, the RTX 4090 remains competitive.

 

Power Efficiency Results:

 

  • Across various models, the RTX 4090 delivers 0.92x to 1.5x the training throughput per watt compared to the RTX 3090.
  • While it consumes more power, the RTX 4090 makes up for it with improved performance, making it an acceptable trade-off for users who need more training speed but want to maintain similar power efficiency to the RTX 3090.

 

Multi-GPU Scaling

 

Multi-GPU setups are crucial for large-scale deep learning projects, where training times need to be minimized across even larger datasets. Although the RTX 4090 no longer supports NVLink (NVIDIA’s high-bandwidth interconnect technology), it still scales effectively in multi-GPU configurations using the PCIe Gen 4 interface.

 

2x RTX 4090 Scaling Results:

 

  • In our tests with two RTX 4090s, most models achieved near 2x training throughput compared to a single RTX 4090. For instance, in ResNet50 FP16, two RTX 4090s processed nearly 758 images/second, almost double the throughput of a single card.
  • However, not all models scaled perfectly. For example, BERT Base fine-tuning with two RTX 4090s only achieved a 1.7x improvement, highlighting some inefficiencies in specific models when running in a multi-GPU setup.
  • Comparatively, two RTX 4090s outperformed two RTX 3090s across all tested models, demonstrating the improved multi-GPU efficiency of the RTX 4090 even without NVLink.

 

Key Considerations for the RTX 4090

 

Before purchasing the RTX 4090 for deep learning, there are a few factors to keep in mind:

 

  1. Size: The RTX 4090 is a large GPU, occupying 3.5 PCIe slots due to its width of 61 mm (2.4 inches). Make sure your motherboard and chassis have enough space to accommodate this card.
  2. Power Supply: With a 450W power requirement, NVIDIA recommends a minimum system power of 850W for a workstation with a single RTX 4090. If you’re planning on running two GPUs, you may need to consider a 1000W PSU or higher.

 

Conclusion

 

The NVIDIA GeForce RTX 4090 is a powerful GPU that offers substantial improvements over its predecessor, the RTX 3090, for deep learning workloads. With up to 1.9x higher training throughput, better cost-efficiency, and comparable power efficiency, the RTX 4090 is an excellent choice for deep learning practitioners, especially those looking to balance performance and budget.

 

While its size and power consumption may be drawbacks for some users, the performance gains are undeniable. Whether you’re a student, researcher, or creator working with machine learning models, the RTX 4090 provides the horsepower needed for faster training times and more complex models. Additionally, the card scales well in multi-GPU configurations, making it a solid option for large-scale deep learning projects.

 

In the future, we anticipate more comprehensive benchmarks, including FP8 performance and broader model tests, which will further solidify the RTX 4090’s position as a leader in the deep learning space.

For more info visit www.proxpc.com

 

Share this:

Related Posts

View more
Chat with us