MLOps with Google Kubernetes Engine and NVIDIA A100 Multi-Instance-GPUs easily and inexpensively

The rapid growth of artificial intelligence is driving up the size of data sets as well as the size and complexity of networks. AI-enabled applications like e-commerce product recommendations, voice-based assistants, and contact center automation require dozens to hundreds of trained AI models. Inference Serving supports infrastructure managers in the provision, administration and scaling of […]

Speeding Up Deep Learning Inference Using TensorFlow, ONNX, and NVIDIA TensorRT

This post was updated July 20, 2021 to reflect NVIDIA TensorRT 8.0 updates. In this post, you learn how to deploy TensorFlow trained deep learning models using the new TensorFlow-ONNX-TensorRT workflow. This tutorial uses NVIDIA TensorRT 8.0.0.3 and provides two code samples, one for TensorFlow v1 and one for TensorFlow v2. TensorRT is an inference […]

Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT

Deep learning is revolutionizing the way that industries are delivering products and services. These services include object detection, classification, and segmentation for computer vision, and text extraction, classification, and summarization for language-based applications. These applications must run in real time. Most of the models are trained in floating-point 32-bit arithmetic to take advantage of a […]

Accelerated model inference for machine learning in Google Cloud Dataflow with NVIDIA GPUs

In partnership with NVIDIA, Google Cloud announced today that Dataflow is introducing GPUs to the world of big data processing to open up new possibilities. With Dataflow GPU, users can now leverage the power of NVIDIA GPUs in their machine learning inference workflows. Here we show you how you can access these performance benefits with […]

NVIDIA announces TensorRT 8, which reduces BERT-Large inference to 1 millisecond

NVIDIA today announced TensorRT 8.0, which reduces BERT-Large inference latency to 1.2 ms with new optimizations. This version also offers double the accuracy for INT8 precision with Quantization Aware Training and significantly higher performance with support for Sparsity, which was introduced in Ampere GPUs. TensorRT is a high-performance deep learning inference SDK that includes an […]

Scaling inference in high-energy particle physics at Fermilab with NVIDIA Triton Inference Server

Research in high energy physics aims to understand the secrets of the universe by describing the basic constituents of matter and the interactions between them. There are various experiments on earth to restore the first moments of the universe. Two examples of the most complex experiments in the world are the Large Hadron Collider (LHC) […]

Simplify AI inference in production with NVIDIA Triton

Machine AI learning unlocks groundbreaking applications in areas such as online product recommendations, image classification, chatbots, prognoses and manufacturing quality checks. AI has two parts: training and inference. Inference is the production phase of AI. The trained model and associated code are deployed in the data center, in the public cloud, or on the edge […]

Minimizing Deep Learning Inference Latency with NVIDIA Multi-Instance GPU

Recently, NVIDIA unveiled the A100 GPU model, based on the NVIDIA Ampere architecture. Ampere introduced many features, including Multi-Instance GPU (MIG), that play a special role for deep learning-based (DL) applications. MIG makes it possible to use a single A100 GPU as if it were multiple smaller GPUs, maximizing utilization for DL workloads and providing […]

Deploy AI deep learning models with NVIDIA Triton Inference Server

In the world of machine learning, models are trained using existing data sets and then used to draw inferences about new data. In a previous post, Simplifying and Scaling Inference Provisioning with NVIDIA Triton 2.3, we discussed the inference workflow and the need for an efficient inference provisioning solution. In this post we introduced Triton […]

Jumpstarting AI with a COVID-19 CT Inference Pipeline and the NVIDIA Clara Deploy QuickStart Virtual Machine

Getting AI up and running in hospitals has never been more important. Until recently, connecting an inference pipeline to perform analysis has had its challenges and limitations. There is a considerable amount of complexity in setting up and maintaining the hardware and software, deployment, configuration, and all workflow steps in an AI inference research pipeline. […]