This is generally achieved by utilizing the GPUĪs much as possible and thus filling GPU memory to its limit. Maximizing the throughput (samples/second) leads to lower training cost.
When training large models, there are two aspects that should be considered at the same time:
If you have access to a machine with multiple GPUs, these approaches are still valid, plus you can leverage additional methods outlined in the multi-GPU section.