Index of /archives/CPAN/authors/id/P/PA/PALVARO

Icon  Name                            Last modified      Size  Description
[PARENTDIR] Parent Directory - [TXT] README 2007-02-24 02:54 2.5K [   ] CHECKSUMS 2021-11-22 07:55 4.5K [   ] Bloom-Faster-1.7.tar.gz 2010-06-13 07:17 21K [TXT] Bloom-Faster-1.7.readme 2009-06-22 09:19 2.5K [TXT] Bloom-Faster-1.7.meta 2010-06-13 07:06 310 [   ] Bloom-Faster-1.6.tar.gz 2009-06-23 11:42 22K [TXT] Bloom-Faster-1.6.readme 2009-06-22 09:31 2.5K [TXT] Bloom-Faster-1.6.meta 2009-06-23 11:41 307 [   ] Bloom-Faster-1.6.2.tar.gz 2010-06-13 06:16 21K [TXT] Bloom-Faster-1.6.2.readme 2009-06-22 09:19 2.5K [TXT] Bloom-Faster-1.6.2.meta 2010-06-13 06:05 312 [   ] Bloom-Faster-1.4.tar.gz 2007-03-17 16:48 602K [TXT] Bloom-Faster-1.4.readme 2007-03-10 08:16 2.5K [TXT] Bloom-Faster-1.4.meta 2007-03-17 16:44 310 [   ] Bloom-Faster-1.3.tar.gz 2007-02-23 14:51 477K [TXT] Bloom-Faster-1.3.readme 2007-02-23 14:45 2.5K [TXT] Bloom-Faster-1.3.meta 2007-02-23 14:48 310 [   ] Bloom-Faster-1.3.1.tar.gz 2007-03-17 15:25 8.5M [TXT] Bloom-Faster-1.3.1.readme 2007-03-10 08:16 2.5K [TXT] Bloom-Faster-1.3.1.meta 2007-03-17 04:08 312
NAME
    Bloom::Faster - Perl extension for the c library libbloom.

INSTALLATION
    see INSTALL

SYNOPSIS
      use Bloom::Faster;
  
      # m = ideal vector size.  
      # k = # of hash functions to use. 

      my $bloom = new Bloom::Faster({m => 1000000,k => 5});

      # this gives us very tight control of memory usage (a function of m)
      # and performance (a function of k).  but in most applications, we won't
      # know the optimal values of either of these.  for these cases, it is 
      # much easier to supply:
      #
      # n = number of expected elements to check for duplicates,
      # e = acceptable error rate (probability of false positive)
      #
      # my $bloom = new Bloom::Faster({n => 1000000, e => 0.00001});

      while (<>) {
            chomp;
            # Bloom::Faster->add() returns true when the value is a duplicate.
            if ($bloom->add($_)) {
                    print "DUP: $_\n";
            }
      }

DESCRIPTION
    Bloom filters are a lightweight duplicate detection algorithm proposed
    by Burton Bloom
    (http://portal.acm.org/citation.cfm?id=362692&dl=ACM&coll=portal), with
    applications in stream data processing, among others. Bloom filters are
    a very cool thing. Where occasional false positives are acceptable,
    bloom filters give us the ability to detect duplicates in a fast and
    resource-friendly manner.

    The allocation of memory for the bit vector is handled in the c layer,
    but perl's oo capability handles the garbage collection. when a
    Bloom::Faster object goes out of scope, the vector pointed to by the c
    structure will be free()d. to manually do this, the DESTROY builtin
    method can be called.

    A bloom filter perl module is currently avaible on CPAN, but it is
    profoundly slow and cannot handle large vectors. This alternative uses a
    more efficient c library which can handle arbitrarily large vectors (up
    to the maximum size of a "long long" datatype (at least
    9223372036854775807, on supported systems ).

  EXPORT
    None by default.

  Exportable constants
      HASHCNT
      PRIME_SIZ
      SIZ

SEE ALSO
    libbbloom.so

AUTHOR
    Peter Alvaro and Dmitriy Ryaboy, <palvaro@ask.com>

COPYRIGHT AND LICENSE
    Copyright (C) 2006 by Peter Alvaro and Dmitriy Ryaboy

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself, either Perl version 5.8.5 or, at
    your option, any later version of Perl 5 you may have available.