Working with the huggingface dataset

Some tips for dealing with https://huggingface.co/datasets/commaai/commaCarSegments

For context, this dataset contains raw log data for a (a lot) of drives.

I wanted to learn about the CAN BUS messages on certain vehicles types. What follows are some tips for working with the dataset.

Prereqs

these tips work on linux and mac
ability to compile code in the Openpilot repo
- https://github.com/commaai/openpilot/tree/master/tools#native-setup-on-ubuntu-2404-and-macos
- can you build + run Cabana?
- willingness to write python?

Mining the datasets for for certain car types

Take inspiration from:
- https://github.com/commaai/openpilot/blob/master/tools/car_porting/examples/find_segments_with_message.ipynb
- https://github.com/commaai/openpilot/blob/master/tools/lib/tests/test_comma_car_segments.py
Run your script with ./tools/op.sh venv and then python /path/to/script.py
You don't even need to download segments dataset for this!
Example

from openpilot.tools.lib.comma_car_segments import get_comma_car_segments_database, get_url
from openpilot.tools.lib.logreader import LogReader
from openpilot.tools.lib.route import SegmentRange
import random

database = get_comma_car_segments_database()

# Get all segments in db for Forester
fp = "SUBARU_FORESTER"
forester_segments = database[fp]

# Print all known segments, e.g. of output
# [...]
# 9a3079fb5c491ea5/2023-10-29--09-26-58/19/s
# e12c50f77a67608c/2023-12-07--14-01-43/3/s
# 306dc508a8dbd532/2023-12-30--17-34-15/45/s
# ace607260543d257/2024-01-05--14-02-25/1/s
# 382f4ac8109f707f/2023-11-17--15-00-16/18/s
# [...]
for s in forester_segments:
  print(s)

# Example more ways to interact with a segment:
segment = random.sample(forester_segments,k=1)[0]
sr = SegmentRange(segment)
url = get_url(sr.route_name, sr.slice)
lr = LogReader(url)
CP = lr.first("carParams")

Playing segments locally

In my case, I wanted to play some segements from the huggingface dataset in Cabana. We need to download the segments locally, since there's no easy way to tell Cabana about the segments hosted on huggingface.

The dataset contains more than 200 GB of data, which sucks if you're running on a laptop like me. Luckily, there's a way to deal with this.

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/commaai/commaCarSegments

Setting GIT_LFS_SKIP_SMUDGE=1 prevents Git LFS from downloading all large files during cloning. Instead, it creates pointer files, letting you selectively download only what you need and saving disk space.

Now we can do git lfs pull -I path/to/segment/** to selectively download logs.

For example, for dongle_id 306dc508a8dbd532, there are many routes:

commaCarSegments $ tree segments/306dc508a8dbd532
segments/306dc508a8dbd532
├── 2023-12-11--11-33-59
│   └── 7
│       └── rlog.bz2
├── 2023-12-11--14-47-17
│   ├── 28
│   │   └── rlog.bz2
│   ├── 32
│   │   └── rlog.bz2
│   └── 40
│       └── rlog.bz2
├── 2023-12-12--07-28-08
│   └── 23
│       └── rlog.bz2
├── 2023-12-12--11-27-44
│   └── 3
│       └── rlog.bz2
# [... truncated for brevity]

To resolve the pointers for segments in route 2023-12-11--11-33-59, we simply do:

git lfs pull -I "segments/306dc508a8dbd532/2023-12-11--11-33-59/**"

Ok, so now we have the data, but the problem is that Cabana with the --data_dir option expects a different directory structure for the logs!!!

Whereas the structure currently looks like:

 % tree segments/306dc508a8dbd532/2023-12-11--14-47-17
segments/306dc508a8dbd532/2023-12-11--14-47-17
├── 28
│   └── rlog.bz2
├── 32
│   └── rlog.bz2
└── 40
    └── rlog.bz2

--data_dir is actually looking for a structure like this:

commaCarSegments % tree segments/0a3e89f78b1d0071
segments/0a3e89f78b1d0071
├── 2023-11-16--13-50-33--10
│   └── rlog.bz2
├── 2023-11-16--13-50-33--15
│   └── rlog.bz2
├── 2023-11-16--13-50-33--2
│   └── rlog.bz2
├── 2023-11-16--13-50-33--9
│   └── rlog.bz2

Old format: segments/DONGLE_ID/ROUTE/SEGMENT_NUMBER/rlog.bz2

New format: segments/DONGLE_ID/ROUTE--SEGMENT_NUMBER/rlog.bz2

So here's a script that you can run directly in your commaCarSegments repo that will pull files for a given dongle_id & route, then symlink them to ~/segments:

#!/bin/bash
set -e

# Check if the required arguments are provided
if [ $# -lt 2 ]; then
  echo "Usage: $0 DONGLE_ID ROUTE_DATE"
  echo "Example: $0 0a3e89f78b1d0071 2023-12-31--21-30-21"
  exit 1
fi

DONGLE_ID="$1"
ROUTE_DATE="$2"
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
OUTPUT_DIR="~/segments/$DONGLE_ID"
SEGMENTS_DIR="segments/${DONGLE_ID}/${ROUTE_DATE}"
FULL_SEGMENTS_DIR="${SCRIPT_DIR}/${SEGMENTS_DIR}"

set -x
git lfs pull -I "${SEGMENTS_DIR}/*"
mkdir -p "$OUTPUT_DIR"
set +x

# Create symlinks with flattened directory structure
for dir in "$FULL_SEGMENTS_DIR"/*; do
  if [ -d "$dir" ]; then
    num=$(basename "$dir")
    new_dir="$OUTPUT_DIR/${ROUTE_DATE}--${num}"
    ln -s "${dir}/" "${new_dir}"
    echo "Created symlink for $dir in $new_dir"
  fi
done

Finally, we can run cabana like this:

cabana --data_dir `~/segments/0a3e89f78b1d0071` 2023-12-31--21-30-21

aubsw/hugging_face_datasets.md

Select an option