If you've usedCloud Machine Learning (ML) Engine, you know that it can train and deploy any TensorFlow, scikit-learn, and XGBoost models at large scale in the cloud. But did you know that Cloud ML Engine also allows you to use TensorFlow’s profiling mechanisms that can help you analyze and improve your model's performance even further?
Whether you use low-level APIs such as tf.Graph and tf.Session or high-level APIs such astf.Estimator andtf.Dataset, it can sometimes be useful to understand how models perform at a lower level to tune your code for efficiency. For example, you might be interested in details about model architectures (e.g., device placement and tensor shapes) or about the performance of specific batch steps (e.g., execution time, memory consumption, or expensive operations).
In this post, we show you different tools that can help you gain useful insights into your Cloud ML Engine profiling information so that you can squeeze out that extra bit of performance for your models.
The examples presented in this post are based on this codelab and this notebook, which analyze a US natality dataset to predict newborns’ birth weights. While not necessary, you can follow the codelab first to familiarize yourself with the code. You can find the full code samples for this post and their prerequisites here.
Basic profiling
The simplest tool at your disposal for profiling the model training process is tf.train.ProfilerHook. ProfilerHook captures profile traces for individual steps that can give you an overview of the individual TensorFlow operations (i.e., the low-level API calls associated with the nodes in your TensorFlow graph), their dependencies and how they are attributed to hardware devices (CPUs, GPUs, and TPUs). In turn, `ProfilerHook` can help you identify possible bottlenecks so you can make targeted improvements to your pipeline and choose the right Cloud ML Engine cluster configuration.
If you already use TensorBoard to visualize your TensorFlow graph and store the profile traces in the same directory as the one used for your checkpoints, you will see two additional tabs named “Memory” and “Compute Time” in TensorBoard, at the bottom of the right sidebar. You will also see information about the total compute time, memory size, and tensor output shapes when clicking on a node, as described here.
Capturing traces for every step over the entire training process is often impractical, because that process can become resource-intensive, significantly increase your training times, and generate volumes of information that are too large to analyze. To reduce the volume of generated information, you can lower the sampling rate by using either the save_steps
or the save_secs
attributes to only save profile traces respectively every N steps or N seconds. Below is an example that captures traces every 10 steps: