Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash function cuasing false postives. (Nasty) #17

Open
TheRook opened this issue Dec 23, 2011 · 6 comments
Open

Hash function cuasing false postives. (Nasty) #17

TheRook opened this issue Dec 23, 2011 · 6 comments
Assignees

Comments

@TheRook
Copy link

TheRook commented Dec 23, 2011

I noticed some bad behavior with the pybloomfiler to I wrote a simple set of tests. A ~22% failure rate is really bad. As a quick fix i am using the hash() function prior to passing the data to pyboomfiler. Usually the false positive rate with this function is 0, sometimes its really close to 0.

My simple test:
http://pastebin.com/C8KwWFxR

The output:
('false positive', 0.2251722)
('false negitive', 0.0)
with hash()
('false positive', 1e-07)
('false negitive', 0.0)
with md5()
('false positive', 0.0)
('false negitive', 0.0)
with md5() hexdigigest
('false positive', 0.0)
('false negitive', 0.0)

@axiak
Copy link
Owner

axiak commented Jan 3, 2012

Hey, now that the holidays are over I'm looking into this. Note that false positives are expected with bloom filters --- false negatives are the true nasty behavior. If you say 0.0001 for the error rate and you are below the capacity (which you are), you should expect less than 0.0001 error rate. I am looking into why it's broken.

@ghost ghost assigned axiak Jan 3, 2012
@xlfe
Copy link

xlfe commented May 3, 2012

Just to confirm, I've had serious problems with both the default SHA512 hash (a large memory leak) and a really bad false positive rate using the (i think) djb hash. I've just tried the superfast hash though and that works very well. Thanks.

(python 2.6 on ubuntu 64bit)

@pbutler
Copy link
Contributor

pbutler commented Jul 7, 2012

I just submitted a fix for the memory leaks referred to in this issue

@zzc3266825
Copy link

hi axiak, i want to know how is this issue going on? i think "using the hash() function prior to passing the data to pyboomfiler" may increase the false positive rate when the number of input items is large, and the test also proved it.

@dbishop
Copy link
Contributor

dbishop commented Feb 24, 2014

I think this is a duplicate of Issue #46

@andresriancho
Copy link

Seems fixed in latest release 0.3.14:

without hash()
('false positive', 1.04e-05)
('false negitive', 0.0)
with hash()
('false positive', 1.02e-05)
('false negitive', 0.0)
with md5()
('false positive', 1.03e-05)
('false negitive', 0.0)
with md5() hexdigigest
('false positive', 8.7e-06)
('false negitive', 0.0)

What really matters is the first result:

without hash()
('false positive', 1.04e-05)
('false negitive', 0.0)

IMHO this ticket should be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants