Here's the assignment:
Download this raw statistics dump from Wikipedia (360mb unzipped):
http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-10/pagecounts-20141029-230000.gz
Write a simple script in your favourite programming language that:
- Gets all views from the English Wikipedia (these are prefixed by "en ")
- Limit those articles to the ones with at least 500 views
- Sort by number of views, highest ones first and print the first ten articles.
- Also measure the time this takes and print it out as well.
Right now we've got versions in Javascript (Node.js), PHP, Go, Python, Ruby, Bash (awk/sed/grep), Groovy and Java in both Java 8 functional style and 'old school' style.
The Bash, Groovy and Java versions were written by @breun, the Ruby version was written by @tieleman, the others by yours truly.
Some measurements on my machine (2011 Macbook Pro, no SSD):
- C: 1.63s (1.58, 1.73, 1.59, 1.62, 1.63)
- Go: 2.36s (2.31, 2.42, 2.36, 2.33, 2.36)
- Java (oldschool): 4.77s / 2.66s if not taking the first measurement into account (13.15, 2.59, 2.48, 3.08, 2.58, 2.58)
- Groovy: 4.33s (4.16s, 4.27s, 4.55, 4.42, 4.27)
- Node.js: 7.10s (7.56, 7.18, 7.01, 6.89, 6.88)
- PHP: 7.44s (7.25, 7.35, 7.28, 7.37, 7.97)
- Python: 7.45s (6.59, 7.28, 6.81, 8.99, 7.59)
- Ruby: 8.85s (9.3, 9.38, 8.6, 8.37, 8.61)
- Bash: 12.34s (12.62, 12.22, 12.78, 12.80, 11.29)
- Lua: 22.81s (24.08, 22.65, 22.11, 21.53, 23.70)
Your output should look like this:
Query took 7.56 seconds
Main_Page (394296)
Malware (51666)
Loan (45440)
Special:HideBanners (40771)
Y%C5%AB_Kobayashi (34596)
Special:Search (18672)
Glutamate_flavoring (17508)
Online_shopping (16310)
Chang_and_Eng_Bunker (14956)
Dance_Moms (8928)
Here is a quick version in Haskell:
{-# LANGUAGE OverloadedStrings #-} import Data.List (sortBy) import qualified Data.ByteString.Char8 as B8 main :: IO () main = B8.readFile "pagecounts-20141029-230000" >>= mapM_ display . take 10 . sortBy (flip compare) . filter popular . map parse . filter en . map B8.words . B8.lines where en :: [B8.ByteString] -> Bool en ("en":_) = True en _ = False parse :: [B8.ByteString] -> (Int, B8.ByteString) parse (_:t:vs:_) = case B8.readInt vs of Just (v, "") -> (v, t) _ -> (0, "") parse _ = (0, "") popular :: (Int, B8.ByteString) -> Bool popular (v, _) = v > (500 :: Int) display :: (Int, B8.ByteString) -> IO () display (v, t) = B8.putStrLn $ B8.concat [t, " (", B8.pack $ show v, ")"]To compile:
ghc -Wall -O2 wmstats.hsTo time:
time ./wmstatsOn my machine, the Go version takes ~2.040s and this Haskell version takes ~1.232s.