Given a non-negative integers array a with different len(a) (scale) and different max(a) (amplitude), what is the most efficient way in Python to count the unique elements in a?
In this experiment, we fix the scale len(a) to be 20000, and focus specifically on the amplitude max(a) of the array.
Here is the code:
def get_runtime_stat(AMPLITUDE:int):
def _test1():
counts = np.zeros((AMPLITUDE + 1, ), dtype=np.int32)
# np.int16 is already enough for sample size 20000 < 32767
for ind in a: counts[ind] += 1
counts = counts[counts != 0] # this is because some value in range may be missing
return counts
def _test2():
uniq_vals, uniq_cnts = np.unique(a, return_counts=True)
return uniq_cnts # uniq_vals are sorted
def _test3():
counts = np.bincount(a, minlength=AMPLITUDE+1)
counts = counts[counts != 0]
return counts
TIMES = [[], [], []]
for repeat in range(10):
SCALE = 20000
a = np.random.randint(0, AMPLITUDE, (SCALE,))
tic = time.time(); ct1 = _test1(); tac = time.time() # or use tracemalloc.get_traced_memory() to get memory peak
time1 = tac - tic
tic = time.time(); ct2 = _test2(); tac = time.time()
time2 = tac - tic
tic = time.time(); ct3 = _test3(); tac = time.time()
time3 = tac - tic
assert np.all(ct1 == ct2) and np.all(ct1 == ct3)
del ct1, ct2, ct3
TIMES[0].append(time1)
TIMES[1].append(time2)
TIMES[2].append(time3)
return TIMESFor AMPLITUDE in [5, 10, 50, 100, ..., 1e10], here is the results of time usage and memory usage:
So in my case, when AMPLITUDE is small (<1e5) I should use np.bincount, and otherwise np.unique.

