Created
April 19, 2012 19:01
-
-
Save nikodemus/2423090 to your computer and use it in GitHub Desktop.
this was what i had in mind
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| (defconstant +foo-count+ (ash (sb-int:power-of-two-ceiling | |
| (* 8 sb-vm::*backend-page-bytes*)) | |
| -1)) | |
| (defconstant +foo-mask+ (1- +foo-count+)) | |
| (defun foo-duplicates (list) | |
| (let ((bitmap (make-array +foo-count+ :element-type 'bit)) | |
| (results nil)) | |
| (declare (dynamic-extent bitmap)) | |
| (dolist (elt list) | |
| (let ((hash (logand +foo-mask+ (sxhash elt)))) | |
| (cond ((zerop (bit bitmap hash)) | |
| (setf (bit bitmap hash) 1) | |
| (push elt results)) | |
| ((member elt results)) | |
| (t | |
| (push elt results))))) | |
| results)) | |
| (defparameter *list* (let (all) | |
| (loop repeat 10000 | |
| do (push (random 100000) all)) | |
| all)) | |
| (progn | |
| (time (foo-duplicates *list*)) | |
| nil) | |
| (let ((list (copy-list *list*))) | |
| (time (delete-duplicates list)) | |
| nil) | |
| (progn | |
| (time (remove-duplicates *list*)) | |
| nil) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Regarding implementation strategies that use constant space:
If one uses a bit array I think Nikodemus' way is best.
I have read a bit more about bloom filters and currently I think that
using them would even degrade earlier (that is, for smaller lists).
I think I found a way to use a constant size hash table to get the
runtime down by a constant factor for huge inputs.
That is, with n the size of input, m the size of output,
from n * m down to n * (m / hash-table-size).
For small inputs this would behave just like a dynamically adjusted hash table.
I don't want to code that currently, but it goes like this
(keeping the first of multiply-occurring elements, as that is simpler):
Walk through the list, putting each not yet seen element
(tested via the hash table) into the result and in the hash table.
Once the hash table is full, remember the place in the list we are at,
and walk through the rest of the list deleting all elements from it
that are in the hash table. Empty the hash table and repeat from
the remembered place on.