Here are the new benchmarks for setkey updated for v 1.9.2. Let's generate some data.
require(data.table)
set.seed(1)
N <- 2e7 # size of DT
foo <- function() paste(sample(letters, sample(5:9, 1), TRUE), collapse="")
ch <- replicate(1e5, foo())
ch <- unique(ch)
DT <- data.table(a = as.numeric(sample(c(NA, Inf, -Inf, NaN, rnorm(1e6)*1e6), N, TRUE)),
b = as.numeric(sample(rnorm(1e6), N, TRUE)),
c = sample(c(NA_integer_, 1e5:1e6), N, TRUE),
d = sample(ch, N, TRUE))
print(object.size(DT), units="MB")
# 538.9 MbWe'll copy DT on to another object DT.copy so as to benchamrk setkey(DT, .) on different columns (and combinations) and then use DT.copy to restore unsorted DT back.
DT.copy = copy(DT)
## on numeric column 'a'
> system.time(setkey(DT, a))
user system elapsed
4.861 0.334 5.316
## reset by DT = copy(DT.copy)
## on integer column 'c'
> system.time(setkey(DT, c))
user system elapsed
3.432 0.325 3.889
## reset again
## on numeric, numeric column 'a,b'
> system.time(setkey(DT, a,b))
user system elapsed
6.321 0.229 6.872
## reset again
## on character column 'd'
> system.time(setkey(DT, d))
user system elapsed
3.992 0.182 4.253 DT = copy(DT.copy)
system.time(ans <- DT[, mean(b), by=c])
# user system elapsed
# 2.943 0.234 3.237 Also load reshape2. But it runs melt.data.table method instead.
require(reshape2)
> system.time(melt(DT, id="d", measure=1:2))
user system elapsed
1.117 0.534 1.677 Note that older version of reshape2 took about 190 seconds to accomplish this. But Kevin Ushey's implemented a C++ version of melt in reshape2 recently, which is also available on CRAN since recently. And here are the timings on that new version.
> system.time(reshape2:::melt.data.frame(DT, id="d", measure=1:2))
user system elapsed
3.445 0.587 4.095 It's much faster than reshape2:::melt's previous version, but still a bit slower (but I guess this is because of haviing character column with too many unique strings, not sure why this huge difference) than data.table's melt here.
In addition, melt.data.table's implementation of na.rm argument is quite efficient (it avoids making another copy by checking and removing NAs at the C-side). Here's a comparison on using na.rm=TRUE on the same data set.
## melt.data.table
> system.time(melt(DT, id="d", measure=1:2, na.rm=TRUE))
user system elapsed
2.072 0.587 2.722
## reshape2:::melt
> system.time(reshape2:::melt.data.frame(DT, id="d", measure=1:2, na.rm=TRUE))
user system elapsed
27.316 4.000 39.465 We'll add one more column e first:
smple <- sample(letters[1:10], 2e7, TRUE)
set(DT, i=NULL, j="e", value=smple)Let's run dcast.data.table now:
## data.table version
system.time(dcast.data.table(DT, d ~ e, value.var="b", fun=sum))
user system elapsed
6.926 0.517 7.599 reshape2:::dcast has no new implementation (like that of melt from Kevin Ushey, yet) and is therefore quite slow and doesn't scale that well. For this example, it takes 76.738 seconds.
Either way,
data.tabledefinitely has more machinery inside to make sure all these things happen as quickly as possible. :)