- Comments Off on Accelerated model inference for machine learning in Google Cloud Dataflow with NVIDIA GPUs
- Accelerated, cloud, Dataflow, Google, GPUs, Inference, Learning, Machine, Model, Nvidia
In partnership with NVIDIA, Google Cloud announced today that Dataflow is introducing GPUs to the world of big data processing to open up new possibilities. With Dataflow GPU, users can now leverage the power of NVIDIA GPUs in their machine learning inference workflows. Here we show you how you can access these performance benefits with BERT.
Google Cloud’s Dataflow is a managed service for performing a variety of data processing patterns, including streaming and batch analysis. Recently added GPU support can now accelerate machine learning inference workflows that run in dataflow pipelines.
For more exciting new features, check out the Google Cloud introductory post. In this post, we show the performance benefits and TCO improvement with NVIDIA GPU acceleration by providing a bidirectional encoder representation of Transformers (BERT) model that is aligned with the answering questions tasks in Dataflow. We show the TensorFlow inference in Dataflow with CPU, how to run the same code on the GPU with a significant increase in performance, show the best performance after converting the model via NVIDIA TensorRT, and deploy it via TensorRT’s Python API with Dataflow. Take a look at the NVIDIA sample code to try it out now.
Figure 1. Dataflow architecture and GPU runtime.
There are several steps that we will cover in this post. We’ll start by creating an environment on our local machine to run all of these Dataflow jobs. For more information, see the Dataflow Python Quick Start Guide.
Create an environment
It is recommended to create a virtual environment for Python, we use virtualenv here:
If you use Dataflow, it is necessary to synchronize the Python version in your development environment with the Python version of the Dataflow runtime. More specifically, you should use the same version of Python and the same version of Apache Beam SDK when running a Dataflow pipeline to avoid unexpected errors.
Now let’s activate the virtual environment.
One of the most important things to look out for before activating a virtual environment is to make sure that you are not working in another virtual environment as this will usually cause problems.
After activating our virtual environment, we are ready to install the required packages. Although our jobs are running on Dataflow, we still need a few packages locally so that Python doesn’t complain when we run our code locally.
pip install Apache Beam[gcp]
pip install TensorFlow == 2.3.1
You can experiment with different versions of TensorFlow, but the key here is to match the version you have here and the version you will be using in the Dataflow environment. Apache Beam and its Google Cloud components are also required.
Get the finely tuned BERT model
NVIDIA NGC has a wealth of resources, ranging from GPU-optimized containers to fine-tuned models. We are exploring several NGC resources.
The first resource we’ll be using is a large BERT model that is aligned with the SquadV2 question-and-answer task and has 340 million parameters. The following command downloads the BERT model.
wget –content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_savedmodel_large_qa_squad2_amp_384/versions/19.03.0/zip -O bert_tf_savedmodel_large_qa_squad2_amp_384_19.03.0.zip
The BERT model you just downloaded will use Automatic Mixed Precision (AMP) during training and the sequence length is 384.
We also need a vocabulary file and get it from a BERT checkpoint which can be obtained from NGC with the following command:
wget –content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_ckpt_large_qa_squad2_amp_128/versions/19.03.1/zip -O bert_tf_ckpt_large_qa_squad2_amp_128_19.03.1.zip
Now that we have these resources, all we have to do is decompress them and look in our working folder. We will be using a custom Docker container and these models will be included in our image.
We’re going to use a custom Dockerfile derived from a GPU-optimized NGC TensorFlow container. NGC TensorFlow (TF) containers are the best option if you are accelerating TF models with NVIDIA GPUs.
We then add a few more steps to copy these models and the files we have. You can find the Dockerfile here and below is a snapshot of the Dockerfile.
FROM nvcr.io/nvidia/tensorflow:20.11-tf2-py3 RUN pip install –no-cache-dir apache-beam[gcp]== 2.26.0 ipython pytest pandas && mkdir -p / workspace / tf_beam COPY –from = apache / beam_python3.6_sdk: 2.26.0 / opt / apache / beam / opt / apache / beam ADD. / workspace / tf_beam WORK DIRECTORY / workspace / tf_beam INPUT POINT [ “/opt/apache/beam/boot”]
The next steps are to create the Docker file and transfer it to the Google Container Registry (GCR). You can do this with the following command. Alternatively, you can use the script created here. If you are using the script from our repository you can just bash build_and_push.sh. To run
project_id = “
If you’ve already authenticated your Google account, you can simply run the python files provided here by calling the run_cpu.sh and run_gpu.sh scripts available in the same repository.
CPU-TensorFlow-Inference in the data flow (TF-CPU)
The bert_squad2_qa_cpu.py file in the repo was developed to answer questions based on a descriptive text document. The stack size is 16, which means we will answer 16 questions in each inference interview and there are 16,000 questions (1,000 question stacks). Note that in a given use case, BERT can be optimized for other tasks.
By default, when a job runs in Dataflow it is automatically scaled based on real-time CPU usage. If you want to disable this function, you have to set autoscaling_algorithm to NONE. That way, you can choose how many people you want to employ throughout the life of your job. Alternatively, you can let Dataflow automatically scale your job and limit the maximum number of workers that can be used by setting the max_num_workers parameter.
We recommend setting a job name instead of using the auto-generated name to better follow your jobs by setting the job_name parameter. This job name is the prefix for the compute instance that is running your job.
Acceleration with GPU (TF-GPU)
To do the same TensorFlow inference job on data flow with GPU support, we need to set the following parameters. Refer to the Dataflow GPU documentation for more information. Refer to the Dataflow GPU documentation for more information.
–experiment “worker_accelerator = type: nvidia-tesla-t4; count: 1; install-nvidia-driver”
The previous parameter allows us to connect an NVIDIA T4 Tensor Core GPU to the Dataflow worker VM, which is also visible as the Compute VM instance doing our job. Dataflow will automatically install the required NVIDIA drivers that support CUDA11.
The bert_squad2_qa_gpu.py file is almost identical to the bert_squad2_qa_cpu.py file. This means that with very little to no change, we can run our jobs with NVIDIA GPUs. In our examples we have some additional GPU setups, e.g. For example, set memory growth using the following code.
Physical_devices = tf.config.list_physical_devices (‘GPU’) tf.config.experimental.set_memory_growth (physical_devices, True)
Inference with NVIDIA optimized libraries
NVIDIA TensorRT optimizes deep learning models for inference and offers low latency and high throughput (more information). Here we take NVIDIA TensorRT optimization on the BERT model and use it to answer questions about a dataflow pipeline with GPU at the speed of light. Users can follow the BERT Github repository of the TensorRT demo.
We also use Polygraphy, a high-level Python API for TensorRT, to load the TensorRT engine file and run inference. In the dataflow code, the TensorRT model is encapsulated with a shared utility class so that all threads of a dataflow worker process can make use of it.
Comparison of CPU and GPU runs
In Table 10 we have given the total run times and resources for the trial runs. The final cost of a Dataflow job is a linear combination of total vCPU time, total storage time, and total disk usage. There is also a GPU component for the GPU case.
|frame||machine||Workers count||Total execution time||Total vCPU time||Total storage time||Total HDD PD time||TCO improvement|
|TF GPU||N1-standard-4 + T4||1||0:35:51||2.25||8.44||140.64||9.2x|
|TensorRT||N1-standard-4 + T4||1||0:09:51||0.53||1.99||33.09||38x|
Table. Total runtime and resource usage for TF-CPU, TF-GPU, and TensorRT sample runs.
Note that the table above was made based on a run and the exact number may vary slightly, but after our experiments the proportions have not changed much.
The total savings including cost and run time savings are more than 10 times when our model is accelerated with NVIDIA GPUs (TF-GPU) compared to using CPUs (TF-CPU). This means that by using NVIDIA GPUs to infer this task, we can get faster runtimes and lower costs than if your model was running on CPUs only.
NVIDIA-optimized inference libraries such as TensorRT allow the user to run more complex and larger models on the GPU in Dataflow. TensorRT accelerates the same job 3.6x faster compared to running it with TF-GPU, resulting in a 4.2x cost saving. Compare TensorRT with TF-CPU, we get 17x less execution time, which gives about 38x less calculation.
In this post, we compared the inference performance of TF-CPU, TF-GPU, and TensorRT for the question answer task performed in Google Cloud Dataflow. Dataflow users can get great benefits by using GPU workers and NVIDIA optimized libraries.
Accelerating deep learning model inference with NVIDIA GPUs and NVIDIA software is easy. By adding or changing a few lines, we can run models with TF-GPU or TensorRT. We’ve provided scripts and source files here and here for reference.
We thank Shan Kulandaivel, Valentyn Tymofiieev, and Reza Rokni of the Google Cloud Dataflow team, and Jill Milton and Fraser Gardiner of NVIDIA for their support and valuable feedback.