-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add diagonal scaling method and re-enable LM adaptive + ellipsoidal #393
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Only some minor comments.
assert isinstance(self._damping, torch.Tensor) | ||
damping = self._damping.view(-1, 1) | ||
if ellipsoidal_damping: | ||
damping = linearization.diagonal_scaling(damping) | ||
# Deliberately using Atb before updating the variables, according to | ||
# the LM reference above | ||
den = (delta * (damping * delta + linearization.Atb.squeeze(2))).sum(dim=1) / 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In trust region method, den
is also called predicted reduction
. In addition, a test can be added for den
as we discussed before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add this to TrustRegion
class in the Dogleg PR (remind me if I forget). We'll unify these two in a later PR.
@fantaosha That's correct. Out of our 3 sparse solvers, 2 don't need to explicitly construct AtA. The only exception is On the other hand, your question made me realize that caching the solution of linear solvers could also be useful now that we are allowing retries. I'll add this in a later PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Only one minor comment.
4bbea0f
to
9dd36ba
Compare
146e0bf
to
1d703b1
Compare
122798c
to
8c5c348
Compare
4a78a03
to
be37500
Compare
8c5c348
to
bed3dab
Compare
be37500
to
1e9dcb9
Compare
bed3dab
to
b4535c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…acebookresearch#393) * Add a linearization method to do hessian scaling. * Add caching for diag(AtA).
This adds a linearization diagonal scaling method and implementations for dense and sparse linearization. The sparse implementation could be faster with a custom kernel, to avoid python loop over rows, but this should be good enough to get an initial implementation.