-
-
Notifications
You must be signed in to change notification settings - Fork 16.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runs not logging separately in wandb.ai #1937
Comments
👋 Hello @thesauravs, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution. If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available. For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com. RequirementsPython 3.8 or later with all requirements.txt dependencies installed, including $ pip install -r requirements.txt EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit. |
@thesauravs thanks for the bug report. I am able to reproduce this I think. I've slightly updated code to reproduce below (COCO128 will autodownload on first use, so you don't need to manually download before training). @AyushExel this appears to be a similar --resume issue as the one I thought we fixed in PR #1852 before. I verify this is reproducible in the current master. I will look into it. !git clone https://github.com/ultralytics/yolov5 # clone repo
%cd yolov5
%pip install -qr requirements.txt # install dependencies
import torch
from IPython.display import Image, clear_output # to display images
clear_output()
print('Setup complete. Using torch %s %s' % (torch.__version__, torch.cuda.get_device_properties(0) if torch.cuda.is_available() else 'CPU'))
# Weights & Biases (optional)
%pip install -q wandb
!wandb login # use 'wandb disabled' or 'wandb enabled' to disable or enable
# Train YOLOv5s on COCO128 for 2 epochs batch size 16
!python train.py --img 640 --batch 16 --epochs 2 --data coco128.yaml --weights yolov5s.pt --nosave --cache --name bat_16
# Train YOLOv5s on COCO128 for 2 epochs batch size 8
!python train.py --img 640 --batch 8 --epochs 2 --data coco128.yaml --weights yolov5s.pt --nosave --cache --name bat_8 |
I see a separate bug also here. The training mosaics are plotted in daemon threads, but it appears that they may fail to render and save before the wandb.log() command is later run. I'l think of a fix for this second issue as well. File "train.py", line 518, in <module>
train(hyp, opt, device, tb_writer, wandb)
File "train.py", line 322, in train
wandb.log({"Mosaics": [wandb.Image(str(x), caption=x.name) for x in save_dir.glob('train*.jpg')]})
File "train.py", line 322, in <listcomp>
wandb.log({"Mosaics": [wandb.Image(str(x), caption=x.name) for x in save_dir.glob('train*.jpg')]})
File "/usr/local/lib/python3.6/dist-packages/wandb/data_types.py", line 1555, in __init__
self._initialize_from_path(data_or_path)
File "/usr/local/lib/python3.6/dist-packages/wandb/data_types.py", line 1625, in _initialize_from_path
self._image = pil_image.open(path)
File "/usr/local/lib/python3.6/dist-packages/PIL/Image.py", line 2862, in open
"cannot identify image file %r" % (filename if filename else fp)
PIL.UnidentifiedImageError: cannot identify image file 'runs/train/bat_16/train_batch2.jpg' |
I've merged a fix for the mosaic daemon bug, which was the secondary error I observed above. I've retested after this fix and verified that the secondary issue is resolved, but that the primary issue of resuming wandb runs remains. The runs are not resumed locally, only on wandb. @AyushExel do you have ideas what might be happening? |
@glenn-jocher I noticed something which might be useful for debugging. Somehow, the checkpoint from the first run folder is being loaded from the first training run, so every time it resumes the same run. I fixed that problem by loading a checkpoint only if
This solution works perfectly, but I'm not sure why this occurs in the first place! Any ideas? |
@AyushExel @thesauravs ok I found the problem. The official models in https://github.com/ultralytics/yolov5/releases were still showing wandb_id keys from their trainings. I thought I'd stripped these and updated the models, but apparently I didn't complete the process correctly. I've repeated this again and reuploaded all 4 models, now stripped of their wandb_ids, so this should be solved now. @thesauravs if you delete your local models and rerun your commands (allowing the updated models to autodownload I believe everything will work correctly. Can you test this out and verify that the problem is solved on your side? |
🐛 Bug
Every time a new run is performed, wandb.ai logs the new run into the existing one.
To Reproduce (REQUIRED)
Input:
Output:
Expected behaviour
separate log for separate runs should be available on wandb.ai
Environment
The text was updated successfully, but these errors were encountered: