fix: Fixed GPU RAM estimation #64

frgfm · 2022-06-13T16:55:52Z

This PR fixes the GPU RAM estimation problem by:

adding a second option to retrieve the GPU RAM information when parsing nvidia smi fails
adding a safeguard in the crawler

What this PR will not solve:

when several models are on the same GPUs, the RAM usage will be blended between the two. For now there is no viable solution to distinguish their RAM usage. It's thus recommended to run torchscan when no other object than your model is on GPU

Closes #63

cc @joonas-yoon

codecov · 2022-06-13T17:00:02Z

Codecov Report

Merging #64 (76aca8b) into main (f11e201) will decrease coverage by 1.42%.
The diff coverage is 40.00%.

@@            Coverage Diff             @@
##             main      #64      +/-   ##
==========================================
- Coverage   94.35%   92.93%   -1.43%     
==========================================
  Files          10       10              
  Lines         656      665       +9     
==========================================
- Hits          619      618       -1     
- Misses         37       47      +10

Impacted Files	Coverage Δ
torchscan/crawler.py	`84.32% <ø> (ø)`
torchscan/process/memory.py	`39.13% <40.00%> (-32.30%)`	⬇️

joonas-yoon · 2022-06-14T07:52:06Z

checkout to this branch fisrt, and install it in notebook as following command:

import sys
!{sys.executable} -m pip uninstall torchscan -y
!{sys.executable} -m pip install torchscan/.

got result:

Processing ./torchscan
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing wheel metadata ... done
Requirement already satisfied: torch>=1.5.0 in /home/jupyter/.conda/envs/joonas/lib/python3.9/site-packages (from torchscan==0.1.2.dev0) (1.11.0)
Requirement already satisfied: typing-extensions in /home/jupyter/.conda/envs/joonas/lib/python3.9/site-packages (from torch>=1.5.0->torchscan==0.1.2.dev0) (3.7.4.3)
Building wheels for collected packages: torchscan
Building wheel for torchscan (PEP 517) ... done
Created wheel for torchscan: filename=torchscan-0.1.2.dev0-py3-none-any.whl size=30391 sha256=9fb4bc758c8f16683bdef0ec1cf9cd684a9a6d15d04eac11f02ab15cd39cb0da
Stored in directory: /tmp/pip-ephem-wheel-cache-te9qtths/wheels/73/72/2c/7aef77450243410db62e4ec62b085f39cdaaf84259bda8aef1
Successfully built torchscan
Installing collected packages: torchscan
Successfully installed torchscan-0.1.2.dev0

But it still prints negative size:

Model size (params + buffers): 13.65 Mb
Framework & CUDA overhead: -24.21 Mb
Total RAM usage: -10.56 Mb

frgfm · 2022-06-14T09:35:57Z

checkout to this branch fisrt, and install it in notebook as following command:
import sys
!{sys.executable} -m pip uninstall torchscan -y
!{sys.executable} -m pip install torchscan/.

Thanks but are you positive this is the snippet you used to install it?
If so, apart from checkout out, you need to install for the folder, which is called "torch-scan" not "torchscan". So I think it should be:

!{sys.executable} -m pip install -e torch-scan/.

Let me know if that fixes the problem :)

joonas-yoon · 2022-06-15T00:42:39Z

d'oh! I missed option -e, i will try again.

the reason for "torchscan" is, that is the name of directory I unzipped

thanks for letting me know :)

joonas-yoon · 2022-06-15T00:58:22Z

Script

netG = Generator().to(device)
summary(netG, (nz, 1, 1))
netD = Discriminator().to(device)
summary(netD, (3, 64, 64))

Output

______________________________________________________________
Layer        Type               Output Shape         Param #  
==============================================================
generator    Generator          (-1, 3, 64, 64)      0        
├─main       Sequential         (-1, 3, 64, 64)      0        
|    └─0     ConvTranspose2d    (-1, 512, 4, 4)      819,200  
|    └─1     BatchNorm2d        (-1, 512, 4, 4)      2,049    
|    └─2     ReLU               (-1, 512, 4, 4)      0        
|    └─3     ConvTranspose2d    (-1, 256, 8, 8)      2,097,152
|    └─4     BatchNorm2d        (-1, 256, 8, 8)      1,025    
|    └─5     ReLU               (-1, 256, 8, 8)      0        
|    └─6     ConvTranspose2d    (-1, 128, 16, 16)    524,288  
|    └─7     BatchNorm2d        (-1, 128, 16, 16)    513      
|    └─8     ReLU               (-1, 128, 16, 16)    0        
|    └─9     ConvTranspose2d    (-1, 64, 32, 32)     131,072  
|    └─10    BatchNorm2d        (-1, 64, 32, 32)     257      
|    └─11    ReLU               (-1, 64, 32, 32)     0        
|    └─12    ConvTranspose2d    (-1, 3, 64, 64)      3,072    
|    └─13    Tanh               (-1, 3, 64, 64)      0        
==============================================================
Trainable params: 3,576,704
Non-trainable params: 0
Total params: 3,576,704
--------------------------------------------------------------
Model size (params + buffers): 13.65 Mb
Framework & CUDA overhead: 1914.35 Mb
Total RAM usage: 1928.00 Mb
--------------------------------------------------------------
Floating Point Operations on forward: 857.74 MFLOPs
Multiply-Accumulations on forward: 428.96 MMACs
Direct memory accesses on forward: 432.46 MDMAs
______________________________________________________________

________________________________________________________________
Layer            Type             Output Shape         Param #  
================================================================
discriminator    Discriminator    (-1, 1, 1, 1)        0        
├─main           Sequential       (-1, 1, 1, 1)        0        
|    └─0         Conv2d           (-1, 64, 32, 32)     3,072    
|    └─1         LeakyReLU        (-1, 64, 32, 32)     0        
|    └─2         Conv2d           (-1, 128, 16, 16)    131,072  
|    └─3         BatchNorm2d      (-1, 128, 16, 16)    513      
|    └─4         LeakyReLU        (-1, 128, 16, 16)    0        
|    └─5         Conv2d           (-1, 256, 8, 8)      524,288  
|    └─6         BatchNorm2d      (-1, 256, 8, 8)      1,025    
|    └─7         LeakyReLU        (-1, 256, 8, 8)      0        
|    └─8         Conv2d           (-1, 512, 4, 4)      2,097,152
|    └─9         BatchNorm2d      (-1, 512, 4, 4)      2,049    
|    └─10        LeakyReLU        (-1, 512, 4, 4)      0        
|    └─11        Conv2d           (-1, 1, 1, 1)        8,192    
================================================================
Trainable params: 2,765,568
Non-trainable params: 0
Total params: 2,765,568
----------------------------------------------------------------
Model size (params + buffers): 10.56 Mb
Framework & CUDA overhead: 1923.74 Mb
Total RAM usage: 1934.30 Mb
----------------------------------------------------------------
Floating Point Operations on forward: 208.47 MFLOPs
Multiply-Accumulations on forward: 104.11 MMACs
Direct memory accesses on forward: 106.95 MDMAs
________________________________________________________________

Installed version with commit 76aca8b

There is no more negatives 👍

frgfm · 2022-06-15T12:22:18Z

Ah perfect :)

frgfm added 2 commits June 13, 2022 18:50

fix: Fixed GPU RAM estimation

180126b

refactor: Added safeguard for CUDA overhead estimation

d022ee1

frgfm added bug module: process labels Jun 13, 2022

frgfm added this to the 0.1.2 milestone Jun 13, 2022

frgfm self-assigned this Jun 13, 2022

frgfm mentioned this pull request Jun 13, 2022

Negative RAM usage #63

Closed

frgfm added 2 commits June 13, 2022 21:46

fix: Improved GPU RAM estimation

3c0cb1a

fix: Fixed typo

133b0d3

frgfm added 4 commits June 14, 2022 12:32

feat: Made GPU RAM estimation more robust

8af303f

style: Fixed black

0e654d8

chore: Fixed mypy config

dcad4f8

fix: Fixed typo

76aca8b

frgfm merged commit 7c269b7 into main Jun 15, 2022

frgfm deleted the negative-ram branch June 15, 2022 12:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fixed GPU RAM estimation #64

fix: Fixed GPU RAM estimation #64

frgfm commented Jun 13, 2022 •

edited

Loading

codecov bot commented Jun 13, 2022 •

edited

Loading

joonas-yoon commented Jun 14, 2022 •

edited

Loading

frgfm commented Jun 14, 2022

joonas-yoon commented Jun 15, 2022 •

edited

Loading

joonas-yoon commented Jun 15, 2022

frgfm commented Jun 15, 2022

fix: Fixed GPU RAM estimation #64

fix: Fixed GPU RAM estimation #64

Conversation

frgfm commented Jun 13, 2022 • edited Loading

codecov bot commented Jun 13, 2022 • edited Loading

Codecov Report

joonas-yoon commented Jun 14, 2022 • edited Loading

frgfm commented Jun 14, 2022

joonas-yoon commented Jun 15, 2022 • edited Loading

joonas-yoon commented Jun 15, 2022

frgfm commented Jun 15, 2022

frgfm commented Jun 13, 2022 •

edited

Loading

codecov bot commented Jun 13, 2022 •

edited

Loading

joonas-yoon commented Jun 14, 2022 •

edited

Loading

joonas-yoon commented Jun 15, 2022 •

edited

Loading