Skip to content

Instantly share code, notes, and snippets.

@nikitaborisov
Last active October 2, 2018 17:55
Show Gist options
  • Select an option

  • Save nikitaborisov/d9275826d66aff00475fc221182855f1 to your computer and use it in GitHub Desktop.

Select an option

Save nikitaborisov/d9275826d66aff00475fc221182855f1 to your computer and use it in GitHub Desktop.

Data Sets

The data sets are a result of three crawls of mobile web, as documented in our ACM CCS paper, in May 2018. Two of crawls were performed from the University of Illinois (US1 and US2) and a third was from a data center in Frankfurt (EU1). The crawls visited 100,000 websites as taken from Alexa top sites. The list of sites used and their corresponding ranks is included as site-list.csv.

File Contents

Raw OpenWPM files

The files <crawl>-crawl-data.sqlite.xz and <crawl>-javascript.ldb.tar.xz (e.g., US1-crawl-data.sqlite.xz) contain the raw data generated by OpenWPM, as described in https://github.com/citp/OpenWPM#output-format. The crawl data file contains an sqlite3 database (compressed using xz) with instrumentation data from each web page load, and the javascript database contains all of the scripts fetched while loading a site, stored using LevelDB instance (and then archived using tar and xz).

Feature files

The file <crawl>-features.csv lists the features of each script, as documented in Section 3.4 of our paper. Its format is a tab-separated values, including a header line that lists the features, with one row for each script and each feature being represented by a 1 (feature present) or 0 (feature absent).

The file <crawl>-script-visit-id.csv is a tab-separated value file that, for each script gives a comma-separated list of site ranks where that script was used (as listed in site-list.csv).

Clustering results

The file US1-cluster-labels.csv is a comma-separated file that describes which cluster each script is assigned to, using the methodology as described in Section 5 of our paper. The mapping of cluster IDs to use cases can be found in Table 7 of our paper.

Sensor-accessing scripts and domains

The files <crawl>-sensor-domains.csv, <crawl>-sensor-script.csv, and <crawl>-sensor-sites.csv list where the four types of sensor we track (motion, orientation, light, and proximity) are accessed, aggregated in three different ways. All three are formatted as tab-separated values. In <crawl>-sensor-script.csv there is one row for each pair (sensor type, script URL) where a script at the given URL accesses the specified sensor. The first column is the sensor type, the second is the script URL, and the third is a comma-separated list of sites (specified by rank) where the script was loaded and accessed the sensor. <crawl>-sensor-domain.csv aggregates these by the domain that served the script (using the [public suffix list|https://publicsuffix.org]+1 domain grouping), which forms the second column; the first and third remain the same. <crawl>-sensor-sites.csv performs the aggregation based on sites where the sensor access is done: the first column remains the sensor type and the second is a comma-separated list of all sites.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment