Running deep neural networks using divide and conquer checkpointing and tensor streaming
The application of reverse mode automatic differentiation (AD) to a differentiable program requires saving the intermediate outputs of each operation on a data structure called the tape for use during the reverse sweep. This puts a bound on the length of the differentiable program or the depth of a deep neural network because memory is always limited and the tape has to be able to fit on the available memory. This problem is more severe for programs that run on the GPU because GPU memory is extremely limited (only 12GB for most consumer GPUs). Further, for parameterized programs like neural networks, where our goal is to compute the gradients of model parameters with respect to a scalar loss value, we also need to store these model parameters on the limited available memory. These model parameters can grow to be hundreds of GBs in size, inhibiting even the instantiation of such deep networks on the GPU. In this thesis research, I present Scorch, a new deep learning framework with built-in support for two key features: (1) divide-and-conquer checkpointing i.e. rearranging the application of reverse-mode AD to trade-off space vs. time by storing only parts of the tape at a certain time at the cost of recomputing the other parts later and (2) tensor streaming between the CPU RAM and GPU RAM synchronized with the execution of reverse-mode AD. Divide-and-conquer checkpointing lifts the memory bound caused by the tape as we can run a program that would typically have a huge tape on limited memory by only keeping a small portion of the tape live at any given time. Tensor streaming lifts the memory bound caused by the size of the model parameters as we use CPU RAM to store these parameters and stream them seamlessly to the GPU as required. These techniques allow us to run gradient descent on a long running differentiable program or a deep neural network of any arbitrary size. Scorch is evaluated on several large real-world examples and show that Scorch is able to create and run gradient descent on the popular image classification network ResNet with a depth of upto 250,000 layers, a popular image segmentation network DraNet with a depth of upto 250,000 layers and a popular language generation model GPT-3 with 175 billion parameters, all while maintaining highly efficient use of CPU and GPU resources.