Add fp16 support to official ResNet. #3687

robieta · 2018-03-21T20:25:34Z

This PR adds fp16 support to official/resnet. The vast majority of the work was already done by Reed a month ago (including wonderful comments), and this just rolls those changes into the current master.

Currently training is I/O bound; however synthetic data runs confirm the fp16 accelleration to ~4000 images/sec during training.

I will update when I have run results.

karmel · 2018-03-22T00:29:56Z

Quick thought before a more full review: there is a future in which we support a number of different quantization options, at least for some models. Is it feasible generalize this to a dtype being passed around rather than fixing to only fp16? You don't need to handle multiple dtypes for now, but at least not fix all the params and names to "fp16".

tfboyd · 2018-03-22T00:41:49Z

yes, that is how the other platforms handle it. I doubt we would handle int8 but you never know and maybe. passing the dtype is popular.

…

On Wed, Mar 21, 2018 at 5:31 PM Karmel Allison ***@***.***> wrote: Quick thought before a more full review: there is a future in which we support a number of different quantization options, at least for some models. Is it feasible generalize this to a dtype being passed around rather than fixing to only fp16? You don't need to handle multiple dtypes for now, but at least not fix all the params and names to "fp16". — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#3687 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AWZesniQ33UDnkGyMoAtMen3BzMDG-qkks5tgvDLgaJpZM4S1tIc> .

robieta · 2018-03-23T17:00:46Z

My understanding is that once automatic fp16 scaling is ready all of this will be ripped out, and at that point passing dtype will be natural. If you're worried about the interface changing it's no trouble to have dtype as the CLI flag. The only thing that might be weird is loss_scaling, as that would still have to be an fp16 CLI flag.

karmel · 2018-03-23T18:27:24Z

fp16 today, int8 tomorrow. Let's generalize to dtype. Then allow for a loss_scale to be passed in for the selected dtype. If not passed in, select from a dict of defaults based on the selected dtype? Open to pros and cons on that, but I think we should anticipate more dtypes rather than fixing on the one currently ready.

tfboyd · 2018-03-23T20:14:46Z

+1 to karmel's idea. That matches or is better than what I have seen from other platforms.

…

On Fri, Mar 23, 2018 at 11:28 AM Karmel Allison ***@***.***> wrote: fp16 today, int8 tomorrow. Let's generalize to dtype. Then allow for a loss_scale to be passed in for the selected dtype. If not passed in, select from a dict of defaults based on the selected dtype? Open to pros and cons on that, but I think we should anticipate more dtypes rather than fixing on the one currently ready. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#3687 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AWZesvt3euTht7pQJx4lgVOKmWsgHJqbks5thT7BgaJpZM4S1tIc> .

robieta · 2018-04-02T18:39:39Z

The interface now uses dtype and loss_scale instead of fp16 and fp16_loss_scale.

I did a training run on the V100's, and got 75.87%. I also did 3 runs with master (I had to use a smaller batch size due to OOM's on P100's) and got a mean of 75.75% and a std of 0.28%. So the use of fp16 does not seem to affect accuracy.

Also @karmel, this PR conflicts with your checkpoint PR because it changes the namespace with the custom getter, so we will need to coordinate those two PR's.

karmel

Some superb commenting in this PR-- much appreciated, thank you.

karmel · 2018-04-02T20:43:50Z

official/resnet/cifar10_main.py

+  return resnet_run_loop.resnet_model_fn(
+      dtype=params['dtype'],
+      features=features, labels=labels, mode=mode, model_class=Cifar10Model,
+      resnet_size=params['resnet_size'],


nit: line separations here are inconsistent (ln2 versus the rest)

karmel · 2018-04-02T20:44:26Z

official/resnet/cifar10_main.py

-                                         loss_filter_fn=loss_filter_fn,
-                                         multi_gpu=params['multi_gpu'])
+  return resnet_run_loop.resnet_model_fn(
+      dtype=params['dtype'],


nit: TF convention would suggest dtype should be the last kwarg for a function.

karmel · 2018-04-02T20:46:59Z

official/resnet/resnet_model.py

@@ -351,7 +351,8 @@ def __init__(self, resnet_size, bottleneck, num_classes, num_filters,
               kernel_size,
               conv_stride, first_pool_size, first_pool_stride,
               second_pool_size, second_pool_stride, block_sizes, block_strides,
-               final_size, version=DEFAULT_VERSION, data_format=None):
+               final_size, version=DEFAULT_VERSION, data_format=None,
+               dtype=None):


For this and the child classes above: you should be able to make the default tf.float32, rather than none.

Done. I added a global default and use that.

karmel · 2018-04-02T20:48:41Z

official/resnet/resnet_model.py

@@ -418,6 +421,60 @@ def __init__(self, resnet_size, bottleneck, num_classes, num_filters,
    self.block_sizes = block_sizes
    self.block_strides = block_strides
    self.final_size = final_size
+    self.dtype = dtype or tf.float32


Not suggesting we should, but, to consider: do some type-checking on this? If someone passes in np.float32, does this work? If an invalid type is passed in, the resulting error will be cryptic. Maybe we should have an ALLOWED_TYPES somewhere, and validate those?

Actually, yes. It turns out that tensorflow will coerce numpy dtypes into tf dtypes, which would result in very subtle issues. (I'm not even sure it would hard fail.) This seems worthwhile.

karmel · 2018-04-02T20:49:18Z

official/resnet/resnet_model.py

+                     *args, **kwargs):
+    """Creates variables in fp32, then casts to fp16 if necessary.
+
+      This function is a custom getter. A custom getter is a function with the


nit: indentation not necessary

Every time. One day I'll learn...

karmel · 2018-04-02T21:10:37Z

official/resnet/resnet_run_loop.py

+      # so small, they underflow to 0. To avoid this, we multiply the loss by
+      # loss_scale to make these tensor values loss_scales times bigger.
+      scaled_grad_vars = optimizer.compute_gradients(loss * loss_scale)
+      unscaled_grad_vars = [(grad / loss_scale, var)


Can you comment to explain the second step here too?

karmel · 2018-04-02T21:15:53Z

official/utils/arg_parsers/parsers.py

+
+
+def parse_dtype_info(flags):
+  """Convert dtype string to tf dtype, and set loss_scale default as needed.


Tests, please.

karmel · 2018-04-02T21:17:02Z

official/utils/arg_parsers/parsers.py

+      self.add_argument(
+          "--dtype", "-dt",
+          default="fp32",
+          choices=["fp16", "float16", "fp32", "float32"],


This is nice, but also 2x the maintenance. Should we just enforce one nomenclature or the other?

I think it's fine simply because the first thing we do is convert them out of strings. But then again I'd write a 1000 page opus in defense of shaving off a single character from a CLI arg. I'm going to leave it for now, but if you decide to decree I won't fight it.

karmel · 2018-04-02T21:18:17Z

official/utils/arg_parsers/parsers.py

+               "but the loss scale helps avoid some intermediate gradients "
+               "from underflowing to zero. If not provided the default for "
+               "fp16 is 128 and 1 for all other dtypes.",
+      )


Happy face.

karmel · 2018-04-02T21:20:01Z

official/resnet/resnet_model.py

-    inputs = tf.identity(inputs, 'final_dense')
-
-    return inputs
+    with self._model_variable_scope():


Aw, snap. Can you generated new checkpoints and SavedModels for this?

robieta · 2018-04-03T00:46:50Z

@karmel I'll address comments specifically shortly, but two points of note:

tf.cast is sometimes a literal no-op. So if layer is a tf.Tensor, then layer is tf.cast(layer, tf.float32) is True. However if it is a tf.SparseTensor of float32's then layer is tf.cast(layer, tf.float32) is False. The reason seems to be that SparseTensors allow heterogeneous dtypes. So we take the tf.cast's out of conditionals if we are sure we won't use SparseTensors. This seems dubious since there are categorical variables. (We could also request tf.cast change it's behavior if ALL elements of a SparseTensor are the requested dtype.)
Rolling a lot of the dtype logic into a util would be cleaner and more generalizable. However I think it's better if that sort of code is provided by TensorFlow. So we need to determine if this is just a short term stopgap so we can do some V100 testing, or if the automated mixed precision code is far enough out to warrant a formal utility in the mean time.

robieta · 2018-04-03T19:59:32Z

Discussed offline.

Just unconditionally cast, and note in a comment that for SparseTensor of fp32's it may still not be a no-op.
Leave mixed precision code in official/resnet in the hope that we can soon rip it out and replace it with mixed precision management from tf proper. If we have a second model where we want fp16 and the tf proper version isn't ready, we can reevaluate moving that code to official/utils

robieta · 2018-04-04T17:10:45Z

@karmel I have addressed your comments.

karmel

Almost there.

karmel · 2018-04-04T20:42:55Z

official/resnet/resnet_model.py

@@ -36,6 +36,9 @@
 _BATCH_NORM_DECAY = 0.997
 _BATCH_NORM_EPSILON = 1e-5
 DEFAULT_VERSION = 2
+DEFAULT_DTYPE = tf.float32
+CASTABLE_TYPES = (tf.float16,)
+ALLOWED_TYPES = (tf.float32,) + CASTABLE_TYPES


nit: (DEFAULT_DTYPE, ) +...

karmel · 2018-04-04T20:44:05Z

official/resnet/resnet_model.py

+    been called if no custom getter was used. Custom getters typically get a
+    variable with `getter`, then modify it in some way.
+
+    This custom getter will create an fp32 variable. If an low precision


nit: a low precision

karmel · 2018-04-04T20:50:52Z

official/utils/arg_parsers/parsers.py

+      "float16": tf.float16,
+      "fp32": tf.float32,
+      "float32": tf.float32,
+  }.get(flags.dtype, flags.dtype)


On second read, I think we should stick with just two options-- fp32, fp16.

Also, let's move this to a module dict, then you can just get DTYPE_MAP.keys() for choices below.

karmel · 2018-04-04T20:51:35Z

official/utils/arg_parsers/parsers.py

+  }.get(flags.dtype, flags.dtype)
+
+  if flags.dtype is None or isinstance(flags.dtype, str):
+    raise ValueError("Invalid dtype: {}".format(flags.dtype))


Why not just try to do DTYPE_MAP[flags.dtype] and catch the KeyError? Fewer lines, no isinstance check.

karmel · 2018-04-04T21:01:18Z

official/utils/arg_parsers/parsers.py

+    flags.loss_scale = {
+        "float16": 128,
+        "float32": 1,
+    }[flags.dtype.name]


Also a constant? Perhaps in the same DTYPE_MAP. Ideally keyed on the same args, rather than a different string drawn from the TF name, which is not in our control.

karmel · 2018-04-04T21:02:55Z

official/utils/arg_parsers/parsers.py

+      self.add_argument(
+          "--dtype", "-dt",
+          default="fp32",
+          choices=["fp16", "float16", "fp32", "float32", "int8"],


int8 is not actually a choice currently, and will fail with a ValueError above, correct? In any case, a call to .keys() on a constant will keep us in sync.

karmel · 2018-04-04T21:03:42Z

official/utils/arg_parsers/parsers_test.py

+      args = parser.parse_args(["--dtype", dtype_str, "--loss_scale", "5"])
+      parsers.parse_dtype_info(args)
+
+      assert args.loss_scale == 5


Test that int8/invalid types raise errors, given the discrepancy currently in choices versus this function.

robieta · 2018-04-04T21:47:03Z

I made the changes. Was able to refactor parse_dtype_info() to still be idempotent, but much cleaner and with everything keyed off of DTYPE_MAP.

karmel

Another nit or two, but looks good. Can you also post results/tensorboards for the record?

karmel · 2018-04-04T21:53:01Z

official/utils/arg_parsers/parsers.py

+    ValueError: If an invalid dtype is provided.
+  """
+  if not (flags.dtype is None or isinstance(flags.dtype, str)):
+    return  # Make function idempotent


Shouldn't this instead be if dtype is in the set of allowed tf dtypes? Otherwise, you could pass in an int and get through, right? Although maybe that's being too defensive if we assume this only gets called with flags generated by the argparser, which requires a string.

I think I was being overly defensive because I was afraid of odd behavior in "in" or dict keying. But I now use "in (tf.float16, tf.float32)" elsewhere, so I suppose doing the same here doesn't introduce any additional risk.

karmel · 2018-04-04T21:53:35Z

official/utils/arg_parsers/parsers.py

+
+  flags.loss_scale = (flags.loss_scale if flags.loss_scale else
+                      default_loss_scale)
+


nit: this can just be = flags.loss_scale or default_loss_scale

robieta · 2018-04-04T22:19:48Z

Cool. I'll hold off merging until I have checkpoints to go along.

jonasrauber · 2018-04-09T20:39:39Z

@robieta Great work! I have a couple of questions regarding the performance:

Currently training is I/O bound; however synthetic data runs confirm the fp16 accelleration to ~4000 images/sec during training.

Currently? Is there any hope that this will not be I/O-bound in the near future?

That’s 4000 images/sec on ImageNet, right? Which GPU? 4000 compared to fp32 with the same setup and on the same GPU?

robieta · 2018-04-10T23:17:55Z

@jonasrauber Hi. #3887 significantly improves performance. There's still work to be done, but the model will be much faster (and no longer I/O bound so far as I can tell) once that gets merged. 4k/sec figure is for 8 Nvidia V100s. fp32 is unsurprisingly exactly half of fp16. Again, I should emphasize that these are very ad-hoc measurements, so don't read too too much into them.

* Add fp16 support to resnet. * address PR comments * add dtype checking to model definition * delint * more PR comments * few more tweaks * update resnet checkpoints

robieta requested review from karmel, nealwu, reedwm and tfboyd March 21, 2018 20:25

robieta requested a review from k-w-w as a code owner March 21, 2018 20:25

googlebot added the cla: yes label Mar 21, 2018

tensorflowbutler assigned benoitsteiner Mar 25, 2018

yifeif unassigned benoitsteiner Mar 26, 2018

robieta force-pushed the float16_resnet branch 4 times, most recently from 0d5ec60 to f102f82 Compare April 2, 2018 18:34

karmel suggested changes Apr 2, 2018

View reviewed changes

karmel mentioned this pull request Apr 2, 2018

Adding new checkpoints for Resnets #3813

Merged

qlzh727 added the kokoro:force-run label Apr 2, 2018

kokoro-team removed the kokoro:force-run label Apr 2, 2018

robieta force-pushed the float16_resnet branch 2 times, most recently from 89f164e to 44f53ff Compare April 4, 2018 17:10

karmel suggested changes Apr 4, 2018

View reviewed changes

robieta force-pushed the float16_resnet branch from 44f53ff to 84457f5 Compare April 4, 2018 21:44

karmel approved these changes Apr 4, 2018

View reviewed changes

robieta force-pushed the float16_resnet branch from 843eef4 to df475fa Compare April 9, 2018 16:36

robieta requested a review from a team as a code owner April 9, 2018 16:36

Taylor Robie added 7 commits April 9, 2018 10:53

Add fp16 support to resnet.

3330729

address PR comments

152de0f

add dtype checking to model definition

90b0054

delint

5f1fc5a

more PR comments

4e72363

few more tweaks

7f21745

update resnet checkpoints

8dd5c42

robieta force-pushed the float16_resnet branch from df475fa to 8dd5c42 Compare April 9, 2018 17:54

robieta merged commit fbb27cf into master Apr 9, 2018

robieta deleted the float16_resnet branch April 9, 2018 20:08



		def parse_dtype_info(flags):
		"""Convert dtype string to tf dtype, and set loss_scale default as needed.


		flags.loss_scale = (flags.loss_scale if flags.loss_scale else
		default_loss_scale)

Add fp16 support to official ResNet. #3687

Add fp16 support to official ResNet. #3687

Conversation

robieta commented Mar 21, 2018

karmel commented Mar 22, 2018

tfboyd commented Mar 22, 2018 via email

robieta commented Mar 23, 2018

karmel commented Mar 23, 2018

tfboyd commented Mar 23, 2018 via email

robieta commented Apr 2, 2018

karmel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robieta commented Apr 3, 2018 • edited Loading

robieta commented Apr 3, 2018

robieta commented Apr 4, 2018

karmel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robieta commented Apr 4, 2018

karmel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robieta commented Apr 4, 2018

jonasrauber commented Apr 9, 2018

robieta commented Apr 10, 2018

robieta commented Apr 3, 2018 •

edited

Loading