Index of /archives/CPAN/modules/by-authors/id/P/PA/PALVARO
Name Last modified Size Description
Parent Directory -
Bloom-Faster-1.3.readme 2007-02-23 14:45 2.5K
Bloom-Faster-1.3.meta 2007-02-23 14:48 310
Bloom-Faster-1.3.tar.gz 2007-02-23 14:51 477K
README 2007-02-24 02:54 2.5K
Bloom-Faster-1.3.1.readme 2007-03-10 08:16 2.5K
Bloom-Faster-1.4.readme 2007-03-10 08:16 2.5K
Bloom-Faster-1.3.1.meta 2007-03-17 04:08 312
Bloom-Faster-1.3.1.tar.gz 2007-03-17 15:25 8.5M
Bloom-Faster-1.4.meta 2007-03-17 16:44 310
Bloom-Faster-1.4.tar.gz 2007-03-17 16:48 602K
Bloom-Faster-1.6.2.readme 2009-06-22 09:19 2.5K
Bloom-Faster-1.7.readme 2009-06-22 09:19 2.5K
Bloom-Faster-1.6.readme 2009-06-22 09:31 2.5K
Bloom-Faster-1.6.meta 2009-06-23 11:41 307
Bloom-Faster-1.6.tar.gz 2009-06-23 11:42 22K
Bloom-Faster-1.6.2.meta 2010-06-13 06:05 312
Bloom-Faster-1.6.2.tar.gz 2010-06-13 06:16 21K
Bloom-Faster-1.7.meta 2010-06-13 07:06 310
Bloom-Faster-1.7.tar.gz 2010-06-13 07:17 21K
CHECKSUMS 2021-11-22 07:55 4.5K
NAME
Bloom::Faster - Perl extension for the c library libbloom.
INSTALLATION
see INSTALL
SYNOPSIS
use Bloom::Faster;
# m = ideal vector size.
# k = # of hash functions to use.
my $bloom = new Bloom::Faster({m => 1000000,k => 5});
# this gives us very tight control of memory usage (a function of m)
# and performance (a function of k). but in most applications, we won't
# know the optimal values of either of these. for these cases, it is
# much easier to supply:
#
# n = number of expected elements to check for duplicates,
# e = acceptable error rate (probability of false positive)
#
# my $bloom = new Bloom::Faster({n => 1000000, e => 0.00001});
while (<>) {
chomp;
# Bloom::Faster->add() returns true when the value is a duplicate.
if ($bloom->add($_)) {
print "DUP: $_\n";
}
}
DESCRIPTION
Bloom filters are a lightweight duplicate detection algorithm proposed
by Burton Bloom
(http://portal.acm.org/citation.cfm?id=362692&dl=ACM&coll=portal), with
applications in stream data processing, among others. Bloom filters are
a very cool thing. Where occasional false positives are acceptable,
bloom filters give us the ability to detect duplicates in a fast and
resource-friendly manner.
The allocation of memory for the bit vector is handled in the c layer,
but perl's oo capability handles the garbage collection. when a
Bloom::Faster object goes out of scope, the vector pointed to by the c
structure will be free()d. to manually do this, the DESTROY builtin
method can be called.
A bloom filter perl module is currently avaible on CPAN, but it is
profoundly slow and cannot handle large vectors. This alternative uses a
more efficient c library which can handle arbitrarily large vectors (up
to the maximum size of a "long long" datatype (at least
9223372036854775807, on supported systems ).
EXPORT
None by default.
Exportable constants
HASHCNT
PRIME_SIZ
SIZ
SEE ALSO
libbbloom.so
AUTHOR
Peter Alvaro and Dmitriy Ryaboy, <palvaro@ask.com>
COPYRIGHT AND LICENSE
Copyright (C) 2006 by Peter Alvaro and Dmitriy Ryaboy
This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself, either Perl version 5.8.5 or, at
your option, any later version of Perl 5 you may have available.