Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Huggingface2CoreML convert file #18

Merged
merged 1 commit into from
Sep 22, 2023

Conversation

jxiong22
Copy link
Contributor

This file offers a Hugging Face version, clip-vit-base-patch32, with separate text and image encoders. It also fixes image encoder precision errors.

@mazzzystar
Copy link
Owner

@jxiong22
Thanks for your great work ! Will check out the model & precision asap : )

@mazzzystar
Copy link
Owner

I've run your scripts, and this is what I've found:

Text Encoder

max_seq_length=76
PyTorch TextEncoder ckpt out for "a photo of a cat":
>>> tensor([ 0.1555,  0.0733, -0.2448, -0.2212, -0.1934,  0.2052, -0.3175, -0.7824,
        -0.1816,  0.1943], grad_fn=<SliceBackward0>)

CoreML TextEncoder ckpt out for "a photo of a cat":
>>> [ 0.15597123  0.07306742 -0.24520203 -0.22147842 -0.1928216   0.20461065
 -0.31734398 -0.7811562  -0.1815224   0.19413628]
max_seq_length=77
PyTorch TextEncoder ckpt out for "a photo of a cat":
>>> tensor([ 0.1555,  0.0733, -0.2448, -0.2212, -0.1934,  0.2052, -0.3175, -0.7824,
        -0.1816,  0.1943], grad_fn=<SliceBackward0>)

CoreML TextEncoder ckpt out for "a photo of a cat":
>>> [ 0.15597123  0.07306742 -0.24520203 -0.22147842 -0.1928216   0.20461065
 -0.31734398 -0.7811562  -0.1815224   0.19413628]

It seems to be the same output whenever you set max_seq_length to be 77 or 76.

Image Encoder

I've compared both my original hard-coded norm vs transformer.Normalize(), and the later performs way better in L1 distance precision : )

Using hard-coded norm:

PyTorch ImageEncoder ckpt out for IMG_2115.jpg:
>>> tensor([ 0.2361, -0.0980, -0.0022,  0.2364,  0.1279, -0.1041,  0.3530,  0.0853,
        -0.0293,  0.0784], grad_fn=<SliceBackward0>)

CoreML ImageEncoder ckpt out for IMG_2115.jpg:
>>> [ 0.350561    0.025289   -0.13452446  0.1267291  -0.1897895  -0.14739564
  0.05819088  0.30193368 -0.2142085   0.27992135]

The average L1 distance is 0.173860315

Using transformers.Normalize():

PyTorch ImageEncoder ckpt out for jpg:
>>> tensor([ 0.3709,  0.0213, -0.0549,  0.1092, -0.2229, -0.2131,  0.0909,  0.2031,
        -0.2159,  0.2603], grad_fn=<SliceBackward0>)

CoreML ImageEncoder ckpt out for jpg:
>>> [ 0.34969723  0.02547197 -0.134884    0.12836841 -0.19035722 -0.14880568
  0.05857127  0.30046463 -0.21491489  0.2801294 ]

The average L1 distance is 0.037187212.

Really thanks for the PR ! This would help me to improve current image encoder performance. As for text encoder, the precision is already Okay, so that would not be a big issue I think.

Could you validate the precision influence for max_seq_length ? Once we've got the same result, I'll merge this PR : )

@jxiong22
Copy link
Contributor Author

That's weird. On my end, I can't figure out why I'm getting bad results when I set max_seq_length to 77. If 77 works well for you, that's even better. What versions of Python, coremltool, and Torch are you using?

max_seq_length = 76:
PyTorch TextEncoder ckpt out for "a photo of a cat":
>>> tensor([ 0.1555,  0.0733, -0.2448, -0.2212, -0.1934,  0.2052, -0.3175, -0.7824,
        -0.1816,  0.1943], grad_fn=<SliceBackward0>)

CoreML TextEncoder ckpt out for "a photo of a cat":
>>> [ 0.15560171  0.0732335  -0.24512495 -0.22117633 -0.19336982  0.20523793
 -0.3182205  -0.78206545 -0.18144566  0.19457956]
max_seq_length = 77:
PyTorch TextEncoder ckpt out for "a photo of a cat":
>>> tensor([ 0.1555,  0.0733, -0.2448, -0.2212, -0.1934,  0.2052, -0.3175, -0.7824,
        -0.1816,  0.1943], grad_fn=<SliceBackward0>)

CoreML TextEncoder ckpt out for "a photo of a cat":
>>> [-0.066312    0.17878246  0.40718645 -0.08806399  0.26841    -0.22685118
  0.2679821  -1.7103907  -0.33836532  0.28941655]

@mazzzystar
Copy link
Owner

My envs are:

Python 3.9.15
accelerate-0.23.0
transformers-4.33.2
coremltoos (latest, 7.0)
torch 1.13.1

@jxiong22
Copy link
Contributor Author

There has been no improvement in your environment. The results are still unsatisfactory with max_seq_length = 77 on my side.

@mazzzystar
Copy link
Owner

Okay, then I'll try to verify this issue, and feedback here if there is a progress.

@jxiong22
Copy link
Contributor Author

Thank you. If 77 is functional for you, how about setting it as the default value and including a comment for future reference in case others encounter the same issue as I did?

@mazzzystar
Copy link
Owner

Is is OKay that I merge this PR now and add a notes in README, so others can refer to your notebook for export ?

@jxiong22
Copy link
Contributor Author

jxiong22 commented Sep 22, 2023 via email

@mazzzystar mazzzystar merged commit 0cc1953 into mazzzystar:main Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants