The paper Efficient Matrix-Free Approximations of Second-Order Information, with Applications to Pruning and Optimization written by Elias Frantar, Eldar Kurtic, and Assistant Professor Dan Alistarh of IST Austria introduces the Matrix-Free Approximate Curvature (M-FAC) method of pruning. M-FAC builds on advances from the WoodFisher pruning paper to efficiently use first-order information (gradients) to determine optimal weights to prune by approximating the corresponding second-order information. This algorithm is shown to outperform magnitude pruning as well as other second-order pruning techniques on a variety of one-shot and gradual pruning tasks.
SparseML makes it easy to use the M-FAC pruning algorithm as part of sparsification
recipes to improve pruning recovery by providing an MFACPruningModifier
.
The MFACPruningModifier
contains the same settings as the magnitude
pruning modifiers and contains extra settings for the M-FAC algorithm including
num_grads
, fisher_block_size
, and available_gpus
. Ideal values will depend
on the system available to run on and model to be pruned.
The following is an example MFACPruningModifier
to be used in place of other
pruning modifiers in a recipe:
pruning_modifiers:
- !MFACPruningModifier
params: __ALL_PRUNABLE__
init_sparsity: 0.05
final_sparsity: 0.85
start_epoch: 1.0
end_epoch: 61.0
update_frequency: 4.0
num_grads: {0.0: 256, 0.5: 512, 0.75: 1024, 0.83: 1400}
fisher_block_size: 10000
available_gpus: ["cuda:0"]
To approximate the second order information in the M-FAC algorithm, first order
gradients are used. num_grads
specifies the number of recent gradient samples to store
of a model while training.
This value can be an int where that constant value will be used throughout pruning. Alternatively the value can be a dictionary of float sparsity values to the number of gradients that should be stored when that sparsity level (between 0.0 and 1.0) is reached. If a dictionary is used, then 0.0 must be included as a key for the base number of gradients to store (i.e. {0: 64, 0.5: 128, 0.75: 256}).
Storing gradients can be expensive, as for a dense model, each additional gradient sample stored requires about the same memory that the entire model needs. This is why the dictionary option allows for more gradients to be stored as the model gets more sparse.
If a M-FAC pruning run is unexpectedly killed, the reason could likely be that the gradient storage requirements exceeded the system's RAM. A safe rule of thumb for initial number of gradients is the number should be no greater than 1/4 of the available CPU RAM divided by the model size.
To limit the computational cost of calculating second order information, the M-FAC algorithm may compute a block diagonal matrix of a certain block size that is sufficient for generating the necessary information for pruning.
The fisher_block_size
specifies this block size. If using GPUs to perform the
M-FAC computations, the GPUs should have num_grads * fisher_block_size
extra
memory during training so each block can be stored and computed sequentially on a GPU.
The default block size is 2000, and generally block sizes between 1000 and 10000 may be
ideal. If None
is provided, the full matrix will be computed without blocks.
available_gpus
is a list of GPU devices names to perform the WoodFisher computation
with. If not provided, computation will be done on the CPU.
Tutorials for using M-FAC with SparseML are provided in the tutorials directory. Currently there are tutorials available for one-shot and gradual pruning with M-FAC.
For Neural Magic Support, sign up or log in to our Neural Magic Community Slack. Bugs, feature requests, or additional questions can also be posted to our GitHub Issue Queue.