Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fixed GPU RAM estimation #64

Merged
merged 8 commits into from
Jun 15, 2022
Merged

fix: Fixed GPU RAM estimation #64

merged 8 commits into from
Jun 15, 2022

Conversation

frgfm
Copy link
Owner

@frgfm frgfm commented Jun 13, 2022

This PR fixes the GPU RAM estimation problem by:

  • adding a second option to retrieve the GPU RAM information when parsing nvidia smi fails
  • adding a safeguard in the crawler

What this PR will not solve:

  • when several models are on the same GPUs, the RAM usage will be blended between the two. For now there is no viable solution to distinguish their RAM usage. It's thus recommended to run torchscan when no other object than your model is on GPU

Closes #63

cc @joonas-yoon

@frgfm frgfm added bug Something isn't working module: process Related to process labels Jun 13, 2022
@frgfm frgfm added this to the 0.1.2 milestone Jun 13, 2022
@frgfm frgfm self-assigned this Jun 13, 2022
@codecov
Copy link

codecov bot commented Jun 13, 2022

Codecov Report

Merging #64 (76aca8b) into main (f11e201) will decrease coverage by 1.42%.
The diff coverage is 40.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #64      +/-   ##
==========================================
- Coverage   94.35%   92.93%   -1.43%     
==========================================
  Files          10       10              
  Lines         656      665       +9     
==========================================
- Hits          619      618       -1     
- Misses         37       47      +10     
Impacted Files Coverage Δ
torchscan/crawler.py 84.32% <ø> (ø)
torchscan/process/memory.py 39.13% <40.00%> (-32.30%) ⬇️

@frgfm frgfm mentioned this pull request Jun 13, 2022
@joonas-yoon
Copy link
Contributor

joonas-yoon commented Jun 14, 2022

checkout to this branch fisrt, and install it in notebook as following command:

import sys
!{sys.executable} -m pip uninstall torchscan -y
!{sys.executable} -m pip install torchscan/.

got result:

Processing ./torchscan
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing wheel metadata ... done
Requirement already satisfied: torch>=1.5.0 in /home/jupyter/.conda/envs/joonas/lib/python3.9/site-packages (from torchscan==0.1.2.dev0) (1.11.0)
Requirement already satisfied: typing-extensions in /home/jupyter/.conda/envs/joonas/lib/python3.9/site-packages (from torch>=1.5.0->torchscan==0.1.2.dev0) (3.7.4.3)
Building wheels for collected packages: torchscan
Building wheel for torchscan (PEP 517) ... done
Created wheel for torchscan: filename=torchscan-0.1.2.dev0-py3-none-any.whl size=30391 sha256=9fb4bc758c8f16683bdef0ec1cf9cd684a9a6d15d04eac11f02ab15cd39cb0da
Stored in directory: /tmp/pip-ephem-wheel-cache-te9qtths/wheels/73/72/2c/7aef77450243410db62e4ec62b085f39cdaaf84259bda8aef1
Successfully built torchscan
Installing collected packages: torchscan
Successfully installed torchscan-0.1.2.dev0

But it still prints negative size:

Model size (params + buffers): 13.65 Mb
Framework & CUDA overhead: -24.21 Mb
Total RAM usage: -10.56 Mb

@frgfm
Copy link
Owner Author

frgfm commented Jun 14, 2022

checkout to this branch fisrt, and install it in notebook as following command:

import sys
!{sys.executable} -m pip uninstall torchscan -y
!{sys.executable} -m pip install torchscan/.

Thanks but are you positive this is the snippet you used to install it?
If so, apart from checkout out, you need to install for the folder, which is called "torch-scan" not "torchscan". So I think it should be:

!{sys.executable} -m pip install -e torch-scan/.

Let me know if that fixes the problem :)

@joonas-yoon
Copy link
Contributor

joonas-yoon commented Jun 15, 2022

d'oh! I missed option -e, i will try again.

the reason for "torchscan" is, that is the name of directory I unzipped

thanks for letting me know :)

@joonas-yoon
Copy link
Contributor

Script

netG = Generator().to(device)
summary(netG, (nz, 1, 1))
netD = Discriminator().to(device)
summary(netD, (3, 64, 64))

Output

______________________________________________________________
Layer        Type               Output Shape         Param #  
==============================================================
generator    Generator          (-1, 3, 64, 64)      0        
├─main       Sequential         (-1, 3, 64, 64)      0        
|    └─0     ConvTranspose2d    (-1, 512, 4, 4)      819,200  
|    └─1     BatchNorm2d        (-1, 512, 4, 4)      2,049    
|    └─2     ReLU               (-1, 512, 4, 4)      0        
|    └─3     ConvTranspose2d    (-1, 256, 8, 8)      2,097,152
|    └─4     BatchNorm2d        (-1, 256, 8, 8)      1,025    
|    └─5     ReLU               (-1, 256, 8, 8)      0        
|    └─6     ConvTranspose2d    (-1, 128, 16, 16)    524,288  
|    └─7     BatchNorm2d        (-1, 128, 16, 16)    513      
|    └─8     ReLU               (-1, 128, 16, 16)    0        
|    └─9     ConvTranspose2d    (-1, 64, 32, 32)     131,072  
|    └─10    BatchNorm2d        (-1, 64, 32, 32)     257      
|    └─11    ReLU               (-1, 64, 32, 32)     0        
|    └─12    ConvTranspose2d    (-1, 3, 64, 64)      3,072    
|    └─13    Tanh               (-1, 3, 64, 64)      0        
==============================================================
Trainable params: 3,576,704
Non-trainable params: 0
Total params: 3,576,704
--------------------------------------------------------------
Model size (params + buffers): 13.65 Mb
Framework & CUDA overhead: 1914.35 Mb
Total RAM usage: 1928.00 Mb
--------------------------------------------------------------
Floating Point Operations on forward: 857.74 MFLOPs
Multiply-Accumulations on forward: 428.96 MMACs
Direct memory accesses on forward: 432.46 MDMAs
______________________________________________________________

________________________________________________________________
Layer            Type             Output Shape         Param #  
================================================================
discriminator    Discriminator    (-1, 1, 1, 1)        0        
├─main           Sequential       (-1, 1, 1, 1)        0        
|    └─0         Conv2d           (-1, 64, 32, 32)     3,072    
|    └─1         LeakyReLU        (-1, 64, 32, 32)     0        
|    └─2         Conv2d           (-1, 128, 16, 16)    131,072  
|    └─3         BatchNorm2d      (-1, 128, 16, 16)    513      
|    └─4         LeakyReLU        (-1, 128, 16, 16)    0        
|    └─5         Conv2d           (-1, 256, 8, 8)      524,288  
|    └─6         BatchNorm2d      (-1, 256, 8, 8)      1,025    
|    └─7         LeakyReLU        (-1, 256, 8, 8)      0        
|    └─8         Conv2d           (-1, 512, 4, 4)      2,097,152
|    └─9         BatchNorm2d      (-1, 512, 4, 4)      2,049    
|    └─10        LeakyReLU        (-1, 512, 4, 4)      0        
|    └─11        Conv2d           (-1, 1, 1, 1)        8,192    
================================================================
Trainable params: 2,765,568
Non-trainable params: 0
Total params: 2,765,568
----------------------------------------------------------------
Model size (params + buffers): 10.56 Mb
Framework & CUDA overhead: 1923.74 Mb
Total RAM usage: 1934.30 Mb
----------------------------------------------------------------
Floating Point Operations on forward: 208.47 MFLOPs
Multiply-Accumulations on forward: 104.11 MMACs
Direct memory accesses on forward: 106.95 MDMAs
________________________________________________________________

Installed version with commit 76aca8b

There is no more negatives 👍

@frgfm
Copy link
Owner Author

frgfm commented Jun 15, 2022

Ah perfect :)

@frgfm frgfm merged commit 7c269b7 into main Jun 15, 2022
@frgfm frgfm deleted the negative-ram branch June 15, 2022 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module: process Related to process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Negative RAM usage
2 participants