Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix heap buffer overflow caused by prefetch. #459

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kishorenc
Copy link
Contributor

The prefetching goes past size since datal is initialized 1 index past data.

I uncovered this issue on ASAN when SSE is enabled. Also spotted earlier here: #107

cc @dyashuni

kishorenc and others added 2 commits May 5, 2023 19:01
The prefetching goes past `size` since `datal` is initialized 1 index past `data`. I uncovered this issue on ASAN when SSE is enabled.

Also spotted earlier here: nmslib#107
_mm_prefetch((char *) (visited_array + *(datal + j + 1)), _MM_HINT_T0);
_mm_prefetch(getDataByInternalId(*(datal + j + 1)), _MM_HINT_T0);
_mm_prefetch((char *) (visited_array + *(datal + j)), _MM_HINT_T0);
_mm_prefetch(getDataByInternalId(*(datal + j)), _MM_HINT_T0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that fix defeats the purpose of the prefetch, as it need to be done for the next element.
Is there any evidence this prefetch affects the program behavior?
If so, I guess we probably should come with a different fix, e.g. min(j+1, max_value)

Copy link
Contributor Author

@kishorenc kishorenc May 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that a cache line is atleast 64 bytes, so does the extra +1 really matter? What matters is that the loop is moving forward via the j variable increments so we keep fetching ahead. We should be able to confirm this via a benchmark: I don't think there will be any degradation in performance.

There are actually a couple of other places the same issue happens, which I just noticed:

(Edit: updated PR to cover these as well)

Is there any evidence this prefetch affects the program behavior?

There is a buffer overflow here by accessing past the array length. Whether that causes a segfault will depend on hardware.

Copy link
Contributor Author

@kishorenc kishorenc May 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think the proper way to actually use a _mm_prefetch is to fetch a full 64-bytes ahead, for example how it's done here by fetching "8 doubles ahead":

https://stackoverflow.com/questions/73012072/why-such-address-is-used-with-mm-prefetch

But this is quite tricky to implement and test because modern processors have gotten really good at prefetching without any hints.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mm_prefetch((char *) (visited_array + *(datal + j + 1)), _MM_HINT_T0);

+1 fetches the visited bit for the next item, which can be anywhere in the memory compared the previous element. Almost certain different cache and page and at least skylake's cpus were struggling without the prefetches, that is why it ended up in the code despite the obvious issue.

Yeah, it would be great if you could profile the performance on few different x64 platforms and maybe improve the prefetching performance and hopefully fix this out-of-bound issue that made people uneasy.

I think the proper way to test the performance is to do search on small dimensional data to focus on memory sparsity (I used d=4) and keep two implementations of the search function (with and without the change), so the performance would not depend on random memory layout changes in the data. That should allow measuring the performance within 1-2% accuracy.

@kishorenc
Copy link
Contributor Author

@yurymalkov

+1 fetches the visited bit for the next item, which can be anywhere in the memory compared the previous element.

Ah, sorry, I missed the extra pointer de-referencing in the code and assumed a sequential lookup.

I benchmarked with/without prefetching by searching on 1M / 4-dimension points.

  • with prefetching: 6700 ms
  • without prefetching: 7800 ms

So, I've now updated the PR to use std::min() to limit the index of the next prefetch. This is comparable in performance to the current version without bound checking.

@yurymalkov
Copy link
Member

Hi @kishorenc,
Thanks for the change! We would need to double check the performance

@yurymalkov
Copy link
Member

@kishorenc can you please share the benchmarking code?
Also, I wonder, how do you look at the other *(data + 1) in the code? Those also have the same out-of-bounds prefetch.

@kishorenc
Copy link
Contributor Author

Here's the code for the benchmark. I just ran this with/without the changes in the PR and compared.

Also, I wonder, how do you look at the other *(data + 1) in the code? Those also have the same out-of-bounds prefetch.

Yes, those are out-of-bound as well, but I didn't address them because I was not sure what their actual sizes were. In the places that I've changed, there was a clear size variable available that I could go by.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants