NVIDIA announces TensorRT-LLM for Windows that boosts LLMs by up to 4 times with RTX GPUs

NVIDIA is already the kind of generative AI in terms of hardware. Its GPUs power data centers used by Microsoft, OpenAI, and others to run AI services like Bing Chat, ChatGPT, and more. Today, NVIDIA announced a new software tool designed to boost the performance of large language models (LLMs) on local Windows PCs.

In a blog post, NVIDIA announced that its TensorRT-LLM open-sourced library, which was previously released for data centers, is now available for Windows PCs. The big feature is that TensorRT-LLM allows LLMs to run up to four times faster on Windows PCs if they have NVIDIA GeForce RTX GPUs.

NVIDIA describes the benefits of TensorRT-LLM for both developers and end users in the post:

At higher batch sizes, this acceleration significantly improves the experience for more sophisticated LLM use — like writing and coding assistants that output multiple, unique auto-complete results at once. The result is accelerated performance and improved quality that lets users select the best of the bunch.

The blog post showed an example of how TensorRT-LLM works. It asked the question, "How does NVIDIA ACE generate emotional responses?" to the standard LLaMa 2 LLM, and it failed to offer an accurate response.

However, when an LLM is paired with a vector library or vector database, and then asked the same question, it generated not only an accurate answer, but the TensorRT-LLM library created a faster response. TensorRT-LLM should be available soon on NVIDIA's developer site.

NVIDIA also added some AI-based features in today's new GeForce driver update. That includes the new 1.5 version of its RTX Video Super Resolution feature for better upscaling and fewer compression effects when viewing online videos. It also added TensorRT AI acceleration for Stable Diffusion Web UI, allowing people with GeForce RTX GPUs to get images from the AI art creator faster than normal.