This module implements building-blocks for larger neural network models in the Keras-style. This module does not implement a general autograd system in order emphasize conceptual understanding over flexibility.
-
Activations. Common activation nonlinearities. Includes:
- Rectified linear units (ReLU) (Hahnloser et al., 2000)
- Leaky rectified linear units (Maas, Hannun, & Ng, 2013)
- Hyperbolic tangent (tanh)
- Logistic sigmoid
- Affine
-
Losses. Common loss functions. Includes:
- Squared error
- Categorical cross entropy
- VAE Bernoulli loss (Kingma & Welling, 2014)
- Wasserstein loss with gradient penalty (Gulrajani et al., 2017)
-
Wrappers. Layer wrappers. Includes:
- Dropout (Srivastava, et al., 2014)
-
Layers. Common layers / layer-wise operations that can be composed to create larger neural networks. Includes:
- Fully-connected
- Sparse evolutionary (Mocanu et al., 2018)
- Dot-product attention (Luong, Pho, & Manning, 2015; Vaswani et al., 2017)
- 1D and 2D convolution (with stride, padding, and dilation) (van den Oord et al., 2016; Yu & Kolton, 2016)
- 2D "deconvolution" (with stride and padding) (Zeiler et al., 2010)
- Restricted Boltzmann machines (with CD-n training) (Smolensky, 1996; Carreira-Perpiñán & Hinton, 2005)
- Elementwise multiplication
- Summation
- Flattening
- Softmax
- Max & average pooling
- 1D and 2D batch normalization (Ioffe & Szegedy, 2015)
- 1D and 2D layer normalization (Ba, Kiros, & Hinton, 2016)
- Recurrent (Elman, 1990)
- Long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997)
-
Optimizers. Common modifications to stochastic gradient descent. Includes:
- SGD with momentum (Rummelhart, Hinton, & Williams, 1986)
- AdaGrad (Duchi, Hazan, & Singer, 2011)
- RMSProp (Tieleman & Hinton, 2012)
- Adam (Kingma & Ba, 2015)
-
Learning Rate Schedulers. Common learning rate decay schedules.
- Constant
- Exponential decay
- Noam/Transformer scheduler (Vaswani et al., 2017)
- King/Dlib scheduler (King, 2018)
-
Initializers. Common weight initialization strategies.
- Glorot/Xavier uniform and normal (Glorot & Bengio, 2010)
- He/Kaiming uniform and normal (He et al., 2015)
- Standard normal
- Truncated normal
-
Modules. Common multi-layer blocks that appear across many deep networks. Includes:
- Bidirectional LSTMs (Schuster & Paliwal, 1997)
- ResNet-style "identity" (i.e.,
same
-convolution) residual blocks (He et al., 2015) - ResNet-style "convolutional" (i.e., parametric) residual blocks (He et al., 2015)
- WaveNet-style residual block with dilated causal convolutions (van den Oord et al., 2016)
- Transformer-style multi-headed dot-product attention (Vaswani et al., 2017)
-
Models. Well-known network architectures. Includes:
vae.py
: Bernoulli variational autoencoder (Kingma & Welling, 2014)wgan_gp.py
: Wasserstein generative adversarial network with gradient penalty (Gulrajani et al., 2017; Goodfellow et al., 2014)
-
Utils. Common helper functions, primarily for dealing with CNNs. Includes:
im2col
col2im
conv1D
conv2D
dilate
deconv2D
minibatch
- Various weight initialization utilities
- Various padding and convolution arithmetic utilities