Resnet distribution strategies #3887

robieta · 2018-04-05T17:19:02Z

This PR pulls the work of the distribution strategies team back into official/resnet. Specifically it removes replicate model function and tower optimizer, incorporates some changes to the dataset pipeline, and uses various distribution strategies in the estimators.

@guptapriya Would you be so kind as to check that this is a faithful port of your code?

So far I have only performed one run: ResNet_50_v2 on ImageNet. It converged to 76% in 2 days. I will be performing a full battery of runs in the coming days.

k-w-w

Can you add a short description of DistributionStrategy to the README? The first time I heard the term, it wasn't immediately obvious that it referred to multi gpus. And looking up "DistributionStrategy" on google didn't really help.

k-w-w · 2018-04-05T17:29:55Z

official/utils/arg_parsers/parsers.py

@@ -118,6 +118,7 @@ def __init__(self, add_help=False, data_dir=True, model_dir=True,
          metavar="<BS>"
      )

+    # TODO(taylorrobie@): depricate and only use DistributionStrategies


nit: (sp) deprecate

k-w-w · 2018-04-05T17:30:06Z

official/utils/arg_parsers/parsers.py

@@ -151,6 +152,7 @@ def __init__(self, add_help=False, num_parallel_calls=True, inter_op=True,
               intra_op=True, use_synthetic_data=True, max_train_steps=True):
    super(PerformanceParser, self).__init__(add_help=add_help)

+    # TODO(taylorrobie@): depricate and only use DistributionStrategies


nit: (sp) deprecate

k-w-w · 2018-04-05T17:35:14Z

official/resnet/resnet_run_loop.py

+      tf.contrib.data.map_and_batch(
+          lambda value: parse_record_fn(value, is_training),
+          batch_size=per_device_batch_size,
+          num_parallel_batches=1))


Why is num_parallel_batches set to 1 (wouldn't more improve performance)?

It still parallelizes the lambda call. So we probably only need >1 if there are stragglers, n_cores > batch_size.

if you want to make this more generic, the ideal thing is num_parallel_batches = num_cores / batch_size
Since usually batch_size > num_cores, i used 1. But yeah you could do that division and take the cieling or something?

Interestingly, when I had tried this with replicate_model_fn alone, it actually slowed performance. Same for tf_cnn. See tensorflow/benchmarks#137 , and the abandoned branch in which I attempted- https://github.com/tensorflow/models/compare/feat/contrib-data . This is JFYI assuming that DistStrat gets better performance, but not a bad idea to take in num_cpus in any case, as Transformer uses that as well, will be useful elsewhere.

Yes, that's definitely interesting. We saw a performance bump both in OneDeviceStrategy and MirroredStrategy with this approach.. agreed that it might still be worthwhile to take in the num cores and use that to get the num_parallel_batches

I just tested this on a trivial imagenet resnet (so any limit is the input pipeline), and found no difference with num_parallel_batches. (This is 32 cores, 4xP100, batch_size=512.) I'm hard pressed to think of a case where num_cores < batch_size makes sense, so I'm inclined to leave it at 1 to keep the code simple.

That sounds good to me if you add a comment describing the other approach. CC @mrry in case you have thoughts.

Setting it to 1 seems like a fine way to go. Any larger would probably lead to congestion on the threadpool queues. The ideal number for a batch size of 512 is probably between 0 and 1, and there's an outstanding bug to support more precise control of the parallelism here. /cc @jsimsa

k-w-w · 2018-04-05T17:36:08Z

official/resnet/resnet_run_loop.py

+    accuracy = tf.metrics.accuracy(
+        tf.argmax(labels, axis=1), predictions['classes'])
+  else:
+    # Metrics are currently no compatible with distribution strategies


nit: not compatible

k-w-w · 2018-04-05T17:46:31Z

official/resnet/resnet_run_loop.py

-        model_function,
-        loss_reduction=tf.losses.Reduction.MEAN)
+  # TODO(taylorrobie@): remove when per_device is no longer needed.
+  assign_multi_gpu(flags)


How does this relate to per_device?

Also, it appears flags.multi_gpu is only used in the line warn_on_multi_gpu_export(flags.multi_gpu). This can be changed to use flags.use_distribution_strategy, and the multi_gpu flag can be completely removed.

It turns out we can probably remove multi_gpu altogether. Karmel just has to check that it doesn't break saved models.

guptapriya

Thanks @robieta! This looks great! Left some small comments and suggestions. Definitely a faithful port of the example, thank you.

guptapriya · 2018-04-05T22:30:34Z

official/resnet/cifar10_test.py

@@ -110,14 +109,6 @@ def test_cifar10_model_fn_train_mode_v1(self):
  def test_cifar10_model_fn_trainmode__v2(self):
    self.cifar10_model_fn_helper(tf.estimator.ModeKeys.TRAIN, version=2)

-  def test_cifar10_model_fn_train_mode_multi_gpu_v1(self):


could we instead change these tests to run distributed strategy version on multiple GPUs?

guptapriya · 2018-04-05T22:32:02Z

official/resnet/resnet_run_loop.py

-import tensorflow as tf  # pylint: disable=g-bad-import-order
+# pylint: disable=g-bad-import-order
+import tensorflow as tf
+from tensorflow.contrib.distribute.python import mirrored_strategy


actually you shouldn't need this. you can directly use them as
tf.contrib.distribute.MirroredStrategy
tf.contrib.distribute.OneDeviceStrategy

guptapriya · 2018-04-05T22:33:45Z

official/resnet/resnet_run_loop.py

+      tf.contrib.data.map_and_batch(
+          lambda value: parse_record_fn(value, is_training),
+          batch_size=per_device_batch_size,
+          num_parallel_batches=1))


if you want to make this more generic, the ideal thing is num_parallel_batches = num_cores / batch_size
Since usually batch_size > num_cores, i used 1. But yeah you could do that division and take the cieling or something?

guptapriya · 2018-04-05T22:33:55Z

official/resnet/resnet_run_loop.py


  # Operations between the final prefetch and the get_next call to the iterator
  # will happen synchronously during run time. We prefetch here again to
  # background all of the above processing work and keep it out of the
  # critical training path.
-  dataset = dataset.prefetch(1)
+  dataset.prefetch(buffer_size=tf.contrib.data.AUTOTUNE)


Please add instructional comments-- what is autotune? What does it do?

guptapriya · 2018-04-05T22:35:08Z

official/resnet/resnet_run_loop.py


    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    train_op = tf.group(optimizer.minimize(loss, global_step), update_ops)
  else:
    train_op = None

-  accuracy = tf.metrics.accuracy(
-      tf.argmax(labels, axis=1), predictions['classes'])
+  if not distribute_lib.has_distribution_strategy():


this can be tf.contrib.distribute.has_distribution_strategy()

If we can check in code whether we have a dist strat via a tf function, why do we need to pass around use_distribution_strategies?

You're absolutely right @karmel. Anytime one is in the distribution strategy scope, the has_distribution_strategy check should work. I believe in our earlier code, we didn't have the input function running inside the scope, so we had to pass around this boolean. But Since then we've moved input processing to be under distribution scope as well and can now use has_distribution_strategy() almost everywhere.

I just checked where all it's used. I think it can be replaced with has_distribution_strategy in all places except this one in resnet_main . And you already suggested removing that check from there entirely. So I think we can get rid of this flag entirely!

Indeed, that flag is gone.

guptapriya · 2018-04-05T22:39:02Z

official/resnet/resnet_run_loop.py

-  from tensorflow.python.client import device_lib  # pylint: disable=g-import-not-at-top
-
-  local_device_protos = device_lib.list_local_devices()
-  num_gpus = sum([1 for d in local_device_protos if d.device_type == 'GPU'])


i wonder if this check is still useful in some form... for e.g. should we check that actual number of gpus available >= gpus_for_distribution_strategy

I feel like that check should live under the hood in DistStrat-- we had to break our own rules and import from the private API here. Is it feasible to do that on the DistStrat side? Or just ignore for now and let the error bubble up when it is hit...

+1 on living under DistStrat.

guptapriya · 2018-04-05T22:41:32Z

official/utils/arg_parsers/parsers.py

+
+  Args:
+    add_help: Create the "--help" flag. False if class instance is a parent.
+    batch_size: Create a flag to specify the batch size. (Instead of the one


what is this batch size here?

Global. I will make that more clear.

guptapriya · 2018-04-05T22:42:05Z

official/utils/arg_parsers/parsers.py

+    )
+
+    self.add_argument(
+        "--gpus_for_distribution_strategy", "-gds",


wonder if we should just rename gpus_for_distribution_strategy -> num_gpus everywhere?

Yes please. And I think we can also plausibly use num_gpus=0, num_gpus=1 to represent the fact that we want diststrat even in single-device cases. That would allow us to ditch the second arg, and roll this back into the main parser, where we already have multi_gpu.

karmel · 2018-04-06T14:55:51Z

official/resnet/cifar10_main.py

@@ -104,20 +104,18 @@ def preprocess_image(image, is_training):


 def input_fn(is_training, data_dir, batch_size, num_epochs=1,
-             num_parallel_calls=1, multi_gpu=False):
+             use_distribution_strategy=False,
+             gpus_for_distribution_strategy=1):


I feel like we should generalize, rather than tie ourselves to the particular name of how we are distributing. Can we also reduce this to num_gpus, and then use dist strat if num_gpus > 1? That won't extend to future implementations, but we'll have to change this for the future implementations anyways.

(Ditto throughout.)

I think that's reasonable. I made the default 1 if tf.test.is_built_with_cuda() else 0. My justification is that if you don't specify --num_gpus and you have a gpu people generally expect tensorflow to do work there.

+1
Think of the params etc that we had as something done in crunch time :) so it would be great to change it to whatever makes most sense for a user.

karmel · 2018-04-06T14:58:05Z

official/resnet/imagenet_test.py

-
-  def test_resnet_model_fn_train_mode_multi_gpu_v2(self):
-    self.resnet_model_fn_helper(tf.estimator.ModeKeys.TRAIN, version=2,
-                                multi_gpu=True)


Removing tests without replacement? That seems unlike you, @robieta .

The thing is model_fn used to be device aware because of replicate_model_fn, and now it isn't. That's why I can safely remove that test. And we don't currently have the infrastructure for me to set up a multi-gpu end-to-end test.

karmel · 2018-04-06T15:00:23Z

official/resnet/resnet_run_loop.py


 from official.resnet import resnet_model
 from official.utils.arg_parsers import parsers
 from official.utils.export import export
 from official.utils.logs import hooks_helper
 from official.utils.logs import logger
+# pylint: enable=g-bad-import-order


Once we move to top-level imports as @guptapriya notes above, we can just switch back to the single-line import order.

karmel · 2018-04-06T15:03:24Z

official/resnet/resnet_run_loop.py


  Returns:
    Dataset of (image, label) pairs ready for iteration.
  """
+
+  # TODO(taylorrobie@) remove when DistributionStrategies uses global batch size


Let's not leave TODOs in the public code; a comment explaining that this is only necessary for a short period because etc etc is sufficient. Also, nit, for future reference: TODO(taylorrobie), and, this is public code, so, robieta.

Duly noted.

karmel · 2018-04-06T15:06:02Z

official/resnet/resnet_run_loop.py

    total_examples = num_epochs * examples_per_epoch
-    dataset = dataset.take(batch_size * (total_examples // batch_size))
+    dataset = dataset.take(
+        per_device_batch_size * (total_examples // per_device_batch_size))


Is this still necessary? I would imagine not? This was originally a fix for the fact that replicate_model_fn would error out, as noted in the comment (which should also be updated in the unexpected case that this is still relevant). It would be great to remove this, because then we can not pass multi-gpu knowledge this far in, and compute batch size closer to the top of the processing, pass in the desired batch_size here without this func caring whether it's global or per device.

Yes. Testing confirms that we can get rid of this entire section.

karmel · 2018-04-06T15:37:22Z

official/resnet/resnet_run_loop.py

          'version': flags.version,
      })

  if flags.benchmark_log_dir is not None:
    benchmark_logger = logger.BenchmarkLogger(flags.benchmark_log_dir)
-    benchmark_logger.log_run_info("resnet")
+    benchmark_logger.log_run_info('resnet')


I swear I've changed this in about 5 branches now.

No kidding.

karmel · 2018-04-06T15:38:45Z

official/utils/arg_parsers/parsers.py

@@ -118,6 +118,7 @@ def __init__(self, add_help=False, data_dir=True, model_dir=True,
          metavar="<BS>"
      )

+    # TODO(taylorrobie@): deprecate and only use DistributionStrategies


Ditto on Todos.

karmel · 2018-04-06T15:40:29Z

official/utils/arg_parsers/parsers.py

@@ -151,6 +152,7 @@ def __init__(self, add_help=False, num_parallel_calls=True, inter_op=True,
               intra_op=True, use_synthetic_data=True, max_train_steps=True):
    super(PerformanceParser, self).__init__(add_help=add_help)

+    # TODO(taylorrobie@): deprecate and only use DistributionStrategies


Right now, only MNIST uses, correct? Can you follow up with making that use DistStrat as well? And, in theory, WideDeep should be easy, because it's just estimators... is that true? Can you check and update as well if so?

karmel · 2018-04-06T15:42:13Z

official/utils/arg_parsers/parsers.py

+    )
+
+    self.add_argument(
+        "--gpus_for_distribution_strategy", "-gds",


Yes please. And I think we can also plausibly use num_gpus=0, num_gpus=1 to represent the fact that we want diststrat even in single-device cases. That would allow us to ditch the second arg, and roll this back into the main parser, where we already have multi_gpu.

karmel · 2018-04-06T15:42:44Z

official/utils/arg_parsers/parsers.py

+        "--gpus_for_distribution_strategy", "-gds",
+        type=int, default=2,
+        help="[default: %(default)s] How many GPUs to use with the "
+             "DistributionStrategies API.",


We need links to the docs somewhere, and this seems like a good place. Maybe also in the comment noting that multi-GPU is experimental.

we have a reasonable README here that we can link to?
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/distribute/README.md

karmel · 2018-04-06T17:13:56Z

I just checked with a saved_model exported from this PR, and it does indeed appear that this fixes the problem with replicate_model_fn where you had to export the savedmodel on single gpu even if training on multi. CC @isaprykin and @k-w-w , who are interested parties. @robieta -- can you remove warn_on_multi_gpu as well?

guptapriya

Thanks for the thorough review @karmel ! We should definitely make this more usable than our version which was done in a hurry. I left my responses to some of your questions.

guptapriya · 2018-04-06T19:20:02Z

official/resnet/cifar10_main.py

@@ -104,20 +104,18 @@ def preprocess_image(image, is_training):


 def input_fn(is_training, data_dir, batch_size, num_epochs=1,
-             num_parallel_calls=1, multi_gpu=False):
+             use_distribution_strategy=False,
+             gpus_for_distribution_strategy=1):


+1
Think of the params etc that we had as something done in crunch time :) so it would be great to change it to whatever makes most sense for a user.

guptapriya · 2018-04-06T19:20:43Z

official/resnet/resnet_run_loop.py

+      tf.contrib.data.map_and_batch(
+          lambda value: parse_record_fn(value, is_training),
+          batch_size=per_device_batch_size,
+          num_parallel_batches=1))


Yes, that's definitely interesting. We saw a performance bump both in OneDeviceStrategy and MirroredStrategy with this approach.. agreed that it might still be worthwhile to take in the num cores and use that to get the num_parallel_batches

guptapriya · 2018-04-06T19:26:54Z

official/resnet/resnet_run_loop.py


    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    train_op = tf.group(optimizer.minimize(loss, global_step), update_ops)
  else:
    train_op = None

-  accuracy = tf.metrics.accuracy(
-      tf.argmax(labels, axis=1), predictions['classes'])
+  if not distribute_lib.has_distribution_strategy():


You're absolutely right @karmel. Anytime one is in the distribution strategy scope, the has_distribution_strategy check should work. I believe in our earlier code, we didn't have the input function running inside the scope, so we had to pass around this boolean. But Since then we've moved input processing to be under distribution scope as well and can now use has_distribution_strategy() almost everywhere.

I just checked where all it's used. I think it can be replaced with has_distribution_strategy in all places except this one in resnet_main . And you already suggested removing that check from there entirely. So I think we can get rid of this flag entirely!

guptapriya · 2018-04-06T19:27:40Z

official/resnet/resnet_run_loop.py

-           'Found {} GPUs with a batch size of {}; try --batch_size={} instead.'
-          ).format(num_gpus, batch_size, batch_size - remainder)
-    raise ValueError(err)
+  if use_distribution_strategy and gpus_for_distribution_strategy > 1:


guptapriya · 2018-04-06T19:29:12Z

official/resnet/resnet_run_loop.py

@@ -355,21 +370,35 @@ def resnet_main(flags, model_function, input_function, shape=None):
      allow_soft_placement=True)

  # Set up a RunConfig to save checkpoint and set session config.
-  run_config = tf.estimator.RunConfig().replace(save_checkpoints_secs=1e9,
-                                                session_config=session_config)
+  if not flags.use_distribution_strategy:


I think that's a good idea. When we originally added these flags, we were not removing the previous multi gpu approach. So we didn't want to enable if by default. But I think now it makes sense. OneDeviceStrategy should pretty much do what having no distribution strategy does.

guptapriya · 2018-04-06T19:29:43Z

official/resnet/resnet_run_loop.py

+      distribution = mirrored_strategy.MirroredStrategy(
+          num_gpus=flags.gpus_for_distribution_strategy
+      )
+    run_config = tf.estimator.RunConfig(distribute=distribution).replace(


ah yes, sorry we just changed it after a discussion. you're right, it's called train_distribute now. thanks for catching!

guptapriya · 2018-04-06T19:31:05Z

official/resnet/resnet_run_loop.py

+          num_gpus=flags.gpus_for_distribution_strategy
+      )
+    run_config = tf.estimator.RunConfig(distribute=distribution).replace(
+        save_checkpoints_secs=1e9, session_config=session_config)


Hm, I am not a 100% sure. We did test with setting those as env variables during performance tuning but did not find significant benefits. But I don't believe we changed anything in dist strategy itself to support/not support them explicitly. @isaprykin perhaps can shed more light?

guptapriya · 2018-04-06T19:31:47Z

official/utils/arg_parsers/parsers.py

+    )
+
+    self.add_argument(
+        "--gpus_for_distribution_strategy", "-gds",


guptapriya · 2018-04-06T19:33:09Z

official/utils/arg_parsers/parsers.py

+        "--gpus_for_distribution_strategy", "-gds",
+        type=int, default=2,
+        help="[default: %(default)s] How many GPUs to use with the "
+             "DistributionStrategies API.",


we have a reasonable README here that we can link to?
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/distribute/README.md

robieta

I've started addressing comments. I will do another round shortly.

robieta · 2018-04-06T17:41:33Z

official/resnet/cifar10_main.py

@@ -104,20 +104,18 @@ def preprocess_image(image, is_training):


 def input_fn(is_training, data_dir, batch_size, num_epochs=1,
-             num_parallel_calls=1, multi_gpu=False):
+             use_distribution_strategy=False,
+             gpus_for_distribution_strategy=1):


I think that's reasonable. I made the default 1 if tf.test.is_built_with_cuda() else 0. My justification is that if you don't specify --num_gpus and you have a gpu people generally expect tensorflow to do work there.

robieta · 2018-04-06T18:12:55Z

official/resnet/resnet_run_loop.py

-import tensorflow as tf  # pylint: disable=g-bad-import-order
+# pylint: disable=g-bad-import-order
+import tensorflow as tf
+from tensorflow.contrib.distribute.python import mirrored_strategy


robieta · 2018-04-06T18:13:20Z

official/resnet/resnet_run_loop.py


 from official.resnet import resnet_model
 from official.utils.arg_parsers import parsers
 from official.utils.export import export
 from official.utils.logs import hooks_helper
 from official.utils.logs import logger
+# pylint: enable=g-bad-import-order


robieta · 2018-04-06T18:24:06Z

official/resnet/resnet_run_loop.py

    total_examples = num_epochs * examples_per_epoch
-    dataset = dataset.take(batch_size * (total_examples // batch_size))
+    dataset = dataset.take(
+        per_device_batch_size * (total_examples // per_device_batch_size))


Yes. Testing confirms that we can get rid of this entire section.

robieta · 2018-04-06T18:42:42Z

official/resnet/resnet_run_loop.py

+  def input_fn(is_training, data_dir, batch_size,  # pylint: disable=unused-argument, missing-docstring
+               use_distribution_strategy=False,
+               gpus_for_distribution_strategy=1, *args, **kwargs):  # pylint: disable=unused-argument
+    # TODO(taylorrobie@) cull DistributionStrategies uses global batch size


Yeah, once the input_fn is device blind there's a whole lot of variable passing that can be ripped out. So satisfying.

robieta · 2018-04-06T21:59:37Z

official/resnet/resnet_run_loop.py

@@ -355,21 +370,35 @@ def resnet_main(flags, model_function, input_function, shape=None):
      allow_soft_placement=True)

  # Set up a RunConfig to save checkpoint and set session config.
-  run_config = tf.estimator.RunConfig().replace(save_checkpoints_secs=1e9,
-                                                session_config=session_config)
+  if not flags.use_distribution_strategy:


robieta · 2018-04-06T22:05:08Z

official/resnet/resnet_run_loop.py

-                                                session_config=session_config)
+  if not flags.use_distribution_strategy:
+    run_config = tf.estimator.RunConfig().replace(
+        save_checkpoints_secs=1e9, session_config=session_config)


Fine by me.

robieta · 2018-04-06T22:05:43Z

official/resnet/resnet_run_loop.py

+      distribution = mirrored_strategy.MirroredStrategy(
+          num_gpus=flags.gpus_for_distribution_strategy
+      )
+    run_config = tf.estimator.RunConfig(distribute=distribution).replace(


Perils of working from contrib. ^^

robieta · 2018-04-06T22:07:26Z

official/resnet/resnet_run_loop.py

          'version': flags.version,
      })

  if flags.benchmark_log_dir is not None:
    benchmark_logger = logger.BenchmarkLogger(flags.benchmark_log_dir)
-    benchmark_logger.log_run_info("resnet")
+    benchmark_logger.log_run_info('resnet')


No kidding.

robieta · 2018-04-06T22:08:04Z

official/utils/arg_parsers/parsers.py

+    )
+
+    self.add_argument(
+        "--gpus_for_distribution_strategy", "-gds",


robieta · 2018-04-09T18:54:47Z

It appears that some git tomfoolery has occurred. I will sort it out, and apologies to those of you who got dragged in as owners.

robieta · 2018-04-09T20:38:59Z

@karmel I think we're ready for you to take another pass.

qlzh727 · 2018-04-10T03:44:42Z

official/utils/arg_parsers/parsers.py

@@ -99,14 +99,14 @@ class BaseParser(argparse.ArgumentParser):
    model_dir: Create a flag for specifying the model file directory.
    train_epochs: Create a flag to specify the number of training epochs.
    epochs_between_evals: Create a flag to specify the frequency of testing.
-    batch_size: Create a flag to specify the batch size.
+    batch_size: Create a flag to specify the global batch size.
    multi_gpu: Create a flag to allow the use of all available GPUs.


Missing doc for num_gpu

karmel · 2018-04-10T15:26:22Z

official/resnet/resnet_run_loop.py

+  # critical training path. Setting buffer_size to tf.contrib.data.AUTOTUNE
+  # allows DistributionStrategies to adjust how many batches to fetch based
+  # on how many devices are present.
+  dataset.prefetch(buffer_size=tf.contrib.data.AUTOTUNE)


http://knowyourmeme.com/memes/auto-tune

karmel · 2018-04-10T15:27:18Z

official/resnet/resnet_run_loop.py

@@ -122,7 +106,7 @@ def get_synth_input_fn(height, width, num_channels, num_classes):
    An input_fn that can be used in place of a real one to return a dataset
    that can be used for iteration.
  """
-  def input_fn(is_training, data_dir, batch_size, *args):  # pylint: disable=unused-argument
+  def input_fn(is_training, data_dir, batch_size, *args, **kwargs):  # pylint: disable=unused-argument,missing-docstring


This function should be short enough to not require a docstring (< 10 lines); are you sure this extra disable is necessary?

It was erroring on the Kokoro lint, but looks like it isn't now. It will forever remain a mystery.

karmel · 2018-04-10T15:28:39Z

official/resnet/resnet_run_loop.py

@@ -355,21 +370,35 @@ def resnet_main(flags, model_function, input_function, shape=None):
      allow_soft_placement=True)


Tagging @isaprykin on this one too-- do we need allow_soft_placement still?

karmel · 2018-04-10T15:30:01Z

official/utils/arg_parsers/parsers.py

    multi_gpu: Create a flag to allow the use of all available GPUs.
    hooks: Create a flag to specify hooks for logging.
  """

  def __init__(self, add_help=False, data_dir=True, model_dir=True,
               train_epochs=True, epochs_between_evals=True, batch_size=True,
-               multi_gpu=True, hooks=True):
+               multi_gpu=True, num_gpu=True, hooks=True):


I know it will be ripped out in a coming PR, but, for now, let's set the default for multi_gpu=False

karmel · 2018-04-10T15:31:17Z

official/utils/arg_parsers/parsers.py

    if multi_gpu:
      self.add_argument(
          "--multi_gpu", action="store_true",
          help="If set, run across all available GPUs."
      )

+    if num_gpu:
+      self.add_argument(
+          "--num_gpus", "-ng",


I feel like we are reaching some upper bound of abbreviation strings, but, then again, I am the type that always prefers the fully explicit versions.

U wil pry my CLI abr frm my cld, ded hnds.

karmel · 2018-04-10T15:32:44Z

official/utils/arg_parsers/parsers.py

+      self.add_argument(
+          "--num_gpus", "-ng",
+          type=int,
+          default=1 if tf.test.is_built_with_cuda() else 0,


Can you add a test to make sure this default gets set correctly in the arg parsers test? I think there's a way to force one mode or the other... if not, have you at least confirmed that this does work correctly on GPU/CPU?

I tested manually and confirmed. Probably not worth the effort to make a formal test simply because at that point it's more of a test of tf.test than the model garden.

karmel · 2018-04-10T15:33:27Z

official/utils/arg_parsers/parsers.py

+          type=int,
+          default=1 if tf.test.is_built_with_cuda() else 0,
+          help="[default: %(default)s] How many GPUs to use with the "
+               "DistributionStrategies API.",


Can you add a note here that reflects the details in the readme? Specifically, that 0==CPU, 1==GPU, default is what you built TF with.

more changes to resnet_run_loop use AUTOTUNE in prefetch first pass at resnet with functional distribution strategies fix syntax error delint aesthetic tweaks delint and fix typos rip multi_gpu flag out of resnet entirely. Subject to saved model load verification update cifar10 and imagenet tests to reflect that the model function no longer need to know about multi_gpu fix imagenet test start addressing PR comments more PR response work

This reverts commit 32aa656.

* begin transfer from contrib fork more changes to resnet_run_loop use AUTOTUNE in prefetch first pass at resnet with functional distribution strategies fix syntax error delint aesthetic tweaks delint and fix typos rip multi_gpu flag out of resnet entirely. Subject to saved model load verification update cifar10 and imagenet tests to reflect that the model function no longer need to know about multi_gpu fix imagenet test start addressing PR comments more PR response work * misc tweaks * add a comment * final pr tweaks * fix parsers

…#4033) This reverts commit 32aa656.

robieta requested review from karmel, guptapriya and k-w-w April 5, 2018 17:19

robieta requested a review from nealwu as a code owner April 5, 2018 17:19

googlebot added the cla: yes label Apr 5, 2018

k-w-w reviewed Apr 5, 2018

View reviewed changes

guptapriya reviewed Apr 5, 2018

View reviewed changes

karmel suggested changes Apr 6, 2018

View reviewed changes

guptapriya reviewed Apr 6, 2018

View reviewed changes

robieta commented Apr 6, 2018

View reviewed changes

robieta force-pushed the resnet_distribution_strategies branch from a819c2c to 4a1207a Compare April 6, 2018 22:11

robieta requested a review from a team as a code owner April 6, 2018 22:11

robieta force-pushed the resnet_distribution_strategies branch 2 times, most recently from 4bcd195 to 667a760 Compare April 9, 2018 18:41

robieta requested review from aquariusjay, MarkDaoust and YknZhu as code owners April 9, 2018 18:41

googlebot added cla: no and removed cla: yes labels Apr 9, 2018

robieta force-pushed the resnet_distribution_strategies branch from 667a760 to 11192e9 Compare April 9, 2018 19:03

tensorflow deleted a comment from googlebot Apr 9, 2018

karmel added cla: yes and removed cla: no labels Apr 9, 2018

karmel removed request for MarkDaoust, YknZhu and aquariusjay April 9, 2018 20:11

qlzh727 reviewed Apr 10, 2018

View reviewed changes

karmel approved these changes Apr 10, 2018

View reviewed changes

robieta mentioned this pull request Apr 10, 2018

Add fp16 support to official ResNet. #3687

Merged

robieta force-pushed the resnet_distribution_strategies branch from 68f04d5 to 84d6989 Compare April 11, 2018 16:58

karmel mentioned this pull request Apr 11, 2018

Adding flag to set random seeds. #3956

Closed

Taylor Robie added 5 commits April 12, 2018 12:53

misc tweaks

c84a155

add a comment

64d6dde

final pr tweaks

677ca05

fix parsers

2f41c72

robieta force-pushed the resnet_distribution_strategies branch from bebd187 to 2f41c72 Compare April 12, 2018 19:53

robieta merged commit 32aa656 into master Apr 12, 2018

robieta deleted the resnet_distribution_strategies branch April 12, 2018 21:22

robieta pushed a commit that referenced this pull request Apr 19, 2018

Revert "Resnet distribution strategies (#3887)"

2e20b1a

This reverts commit 32aa656.

robieta pushed a commit that referenced this pull request Apr 19, 2018

Revert "Resnet distribution strategies (#3887)" (#4033)

823da31

This reverts commit 32aa656.

omegafragger pushed a commit to omegafragger/models that referenced this pull request May 15, 2018

Revert "Resnet distribution strategies (tensorflow#3887)" (tensorflow…

24a942a

…#4033) This reverts commit 32aa656.

		@@ -355,21 +370,35 @@ def resnet_main(flags, model_function, input_function, shape=None):
		allow_soft_placement=True)

Resnet distribution strategies #3887

Resnet distribution strategies #3887

Conversation

robieta commented Apr 5, 2018

k-w-w left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guptapriya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karmel commented Apr 6, 2018

guptapriya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robieta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment