Some tips for dealing with https://huggingface.co/datasets/commaai/commaCarSegments
For context, this dataset contains raw log data for a (a lot) of drives.
I wanted to learn about the CAN BUS messages on certain vehicles types. What follows are some tips for working with the dataset.
- these tips work on linux and mac
- ability to compile code in the Openpilot repo
- https://github.com/commaai/openpilot/tree/master/tools#native-setup-on-ubuntu-2404-and-macos
- can you build + run Cabana?
- willingness to write python?
- Take inspiration from:
- Run your script with
./tools/op.sh venvand thenpython /path/to/script.py - You don't even need to download segments dataset for this!
- Example
from openpilot.tools.lib.comma_car_segments import get_comma_car_segments_database, get_url
from openpilot.tools.lib.logreader import LogReader
from openpilot.tools.lib.route import SegmentRange
import random
database = get_comma_car_segments_database()
# Get all segments in db for Forester
fp = "SUBARU_FORESTER"
forester_segments = database[fp]
# Print all known segments, e.g. of output
# [...]
# 9a3079fb5c491ea5/2023-10-29--09-26-58/19/s
# e12c50f77a67608c/2023-12-07--14-01-43/3/s
# 306dc508a8dbd532/2023-12-30--17-34-15/45/s
# ace607260543d257/2024-01-05--14-02-25/1/s
# 382f4ac8109f707f/2023-11-17--15-00-16/18/s
# [...]
for s in forester_segments:
print(s)
# Example more ways to interact with a segment:
segment = random.sample(forester_segments,k=1)[0]
sr = SegmentRange(segment)
url = get_url(sr.route_name, sr.slice)
lr = LogReader(url)
CP = lr.first("carParams")In my case, I wanted to play some segements from the huggingface dataset in Cabana. We need to download the segments locally, since there's no easy way to tell Cabana about the segments hosted on huggingface.
The dataset contains more than 200 GB of data, which sucks if you're running on a laptop like me. Luckily, there's a way to deal with this.
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/commaai/commaCarSegmentsSetting GIT_LFS_SKIP_SMUDGE=1 prevents Git LFS from downloading all large files during cloning. Instead, it creates pointer files, letting you selectively download only what you need and saving disk space.
Now we can do git lfs pull -I path/to/segment/** to selectively download logs.
For example, for dongle_id 306dc508a8dbd532, there are many routes:
commaCarSegments $ tree segments/306dc508a8dbd532
segments/306dc508a8dbd532
├── 2023-12-11--11-33-59
│ └── 7
│ └── rlog.bz2
├── 2023-12-11--14-47-17
│ ├── 28
│ │ └── rlog.bz2
│ ├── 32
│ │ └── rlog.bz2
│ └── 40
│ └── rlog.bz2
├── 2023-12-12--07-28-08
│ └── 23
│ └── rlog.bz2
├── 2023-12-12--11-27-44
│ └── 3
│ └── rlog.bz2
# [... truncated for brevity]To resolve the pointers for segments in route 2023-12-11--11-33-59, we simply do:
git lfs pull -I "segments/306dc508a8dbd532/2023-12-11--11-33-59/**"Ok, so now we have the data, but the problem is that Cabana with the --data_dir
option expects a different directory structure for the logs!!!
Whereas the structure currently looks like:
% tree segments/306dc508a8dbd532/2023-12-11--14-47-17
segments/306dc508a8dbd532/2023-12-11--14-47-17
├── 28
│ └── rlog.bz2
├── 32
│ └── rlog.bz2
└── 40
└── rlog.bz2
--data_dir is actually looking for a structure like this:
commaCarSegments % tree segments/0a3e89f78b1d0071
segments/0a3e89f78b1d0071
├── 2023-11-16--13-50-33--10
│ └── rlog.bz2
├── 2023-11-16--13-50-33--15
│ └── rlog.bz2
├── 2023-11-16--13-50-33--2
│ └── rlog.bz2
├── 2023-11-16--13-50-33--9
│ └── rlog.bz2
Old format: segments/DONGLE_ID/ROUTE/SEGMENT_NUMBER/rlog.bz2
New format: segments/DONGLE_ID/ROUTE--SEGMENT_NUMBER/rlog.bz2
So here's a script that you can run directly in your commaCarSegments
repo that will pull files for a given dongle_id & route, then symlink
them to ~/segments:
#!/bin/bash
set -e
# Check if the required arguments are provided
if [ $# -lt 2 ]; then
echo "Usage: $0 DONGLE_ID ROUTE_DATE"
echo "Example: $0 0a3e89f78b1d0071 2023-12-31--21-30-21"
exit 1
fi
DONGLE_ID="$1"
ROUTE_DATE="$2"
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
OUTPUT_DIR="~/segments/$DONGLE_ID"
SEGMENTS_DIR="segments/${DONGLE_ID}/${ROUTE_DATE}"
FULL_SEGMENTS_DIR="${SCRIPT_DIR}/${SEGMENTS_DIR}"
set -x
git lfs pull -I "${SEGMENTS_DIR}/*"
mkdir -p "$OUTPUT_DIR"
set +x
# Create symlinks with flattened directory structure
for dir in "$FULL_SEGMENTS_DIR"/*; do
if [ -d "$dir" ]; then
num=$(basename "$dir")
new_dir="$OUTPUT_DIR/${ROUTE_DATE}--${num}"
ln -s "${dir}/" "${new_dir}"
echo "Created symlink for $dir in $new_dir"
fi
doneFinally, we can run cabana like this:
cabana --data_dir `~/segments/0a3e89f78b1d0071` 2023-12-31--21-30-21