-
-
Save matomatical/336bd97abffeb6ebf614e502dcb1d160 to your computer and use it in GitHub Desktop.
| """ | |
| Extract the contents of a `run-{id}.wandb` database file. | |
| These database files are stored in a custom binary format. Namely, the database | |
| is a sequence of wandb 'records' (each representing a logging event of some | |
| type, including compute stats, program outputs, experimental metrics, wandb | |
| telemetry, and various other things). Within these records, some data values | |
| are encoded with json. Each record is encoded with protobuf and stored over one | |
| or more blocks in a LevelDB log. The result is the binary .wandb database file. | |
| Note that the wandb sdk internally contains all the code necessary to read | |
| these files. It parses the LevelDB chunked records in the process of syncing to | |
| the cloud, and will show you the results if you run the command | |
| `wandb sync --view --verbose run-{id}.wandb`. | |
| And the protobuf scheme is included with the sdk (for serialisation) so this | |
| can also be used for deserialisation. | |
| This is a simple script that leans on the wandb internals to invert the | |
| encoding process and render the contents of the .wandb database as a pure | |
| Python object. From here, it should be possible to, for example, aggregate all | |
| of the metrics logged across a run and plot your own learning curves, without | |
| having to go through the excruciating (and hereby redundant) process of | |
| downloading this data from the cloud via the API. | |
| Notes: | |
| * The protobuf scheme for the retrieved records is defined here: | |
| [https://github.com/wandb/wandb/blob/main/wandb/proto/wandb_internal.proto]. | |
| This is useful for understanding the structure of the Python object returned | |
| by this tool. | |
| * The LevelDB log format is documented here: | |
| [https://github.com/google/leveldb/blob/main/doc/log_format.md], | |
| but note that the W&B SDK does not depend on the LevelDB SDK, it just uses | |
| the same file structure. The reason for this seems to be a different choice | |
| of checksum algorithm from the LevelDB standard. | |
| * This starting point for this tool was the discussion at this issue: | |
| https://github.com/wandb/wandb/issues/1768 | |
| supplemented by my study of the wandb sdk source code. | |
| * I've mostly been studying `wandb/wandb/sdk/internal/datastore.py` to | |
| understand the writer, but I recently noticed that if one uses wandb core, | |
| it seems to use a different implementation. | |
| Caveats: | |
| * The technique depends on some of the internals of the wandb library not | |
| guaranteed to be preserved across versions. This script might therefore break | |
| at any time. Hopefully before then wandb developers provide an official way | |
| for users to access their own offline experimental data without roundtripping | |
| through their cloud platform. | |
| * The script doesn't include error handling. It's plausible that it will break | |
| on corrupted or old databases. You'll have to deal with that when the time | |
| comes. | |
| * OK, the time has come for me to deal with it because lots of my data is | |
| apparently corrupted (or was for some reason written not respecting the | |
| LevelDB format properly). I will try to patch the SDK so parsing can | |
| recover from the next valid protobuf record. | |
| MIT License | |
| ----------- | |
| Copyright (c) 2025 Matthew Farrugia-Roberts | |
| Permission is hereby granted, free of charge, to any person obtaining a copy | |
| of this software and associated documentation files (the "Software"), to deal | |
| in the Software without restriction, including without limitation the rights | |
| to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | |
| copies of the Software, and to permit persons to whom the Software is | |
| furnished to do so, subject to the following conditions: | |
| The above copyright notice and this permission notice shall be included in all | |
| copies or substantial portions of the Software. | |
| THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | |
| IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | |
| FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | |
| AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | |
| LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | |
| OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | |
| SOFTWARE. | |
| """ | |
| import collections | |
| import dataclasses | |
| import json | |
| import google.protobuf.json_format as protobuf_json | |
| import wandb | |
| @dataclasses.dataclass(frozen=True) | |
| class WeightsAndBiasesDatabase: | |
| history: tuple = () | |
| summary: tuple = () | |
| output: tuple = () | |
| config: tuple = () | |
| files: tuple = () | |
| stats: tuple = () | |
| artifact: tuple = () | |
| tbrecord: tuple = () | |
| alert: tuple = () | |
| telemetry: tuple = () | |
| metric: tuple = () | |
| output_raw: tuple = () | |
| run: tuple = () | |
| exit: tuple = () | |
| final: tuple = () | |
| header: tuple = () | |
| footer: tuple = () | |
| preempting: tuple = () | |
| noop_link_artifact: tuple = () | |
| use_artifact: tuple = () | |
| request: tuple = () | |
| def load_wandb_database(path: str) -> WeightsAndBiasesDatabase: | |
| """ | |
| Convert a wandb database at `path` into a stream of records of each type. | |
| """ | |
| d = collections.defaultdict(list) | |
| # use wandb's internal reader class to parse | |
| ds = wandb.sdk.internal.datastore.DataStore() | |
| # point the reader at the .wandb file. | |
| ds.open_for_scan(path) | |
| # iteratively call scan_data(), which aggregates leveldb records into a | |
| # single protobuf record, or returns None at the end of the file. | |
| while True: | |
| record_bin = ds.scan_data() | |
| if record_bin is None: break # end of file | |
| # once we have the data, we need to parse it into a protobuf struct | |
| record_pb = wandb.proto.wandb_internal_pb2.Record() | |
| record_pb.ParseFromString(record_bin) | |
| # convert to a python dictionary | |
| record_dict = protobuf_json.MessageToDict( | |
| record_pb, | |
| preserving_proto_field_name=True, | |
| ) | |
| # strip away the aux fields (num, control, etc.) and get to the data | |
| record_type = record_pb.WhichOneof("record_type") | |
| data_dict = record_dict[record_type] | |
| # replace any [{key: k, value_json: json(v)}] with {k: v}: | |
| for field, items in data_dict.items(): | |
| if isinstance(items, list) and items and 'value_json' in items[0]: | |
| mapping = {} | |
| for item in items: | |
| if 'key' in item: | |
| key = item['key'] | |
| else: # 'nested_key' in item: | |
| key = '/'.join(item['nested_key']) | |
| assert key not in mapping | |
| value = json.loads(item['value_json']) | |
| mapping[key] = value | |
| data_dict[field] = mapping | |
| # append this record to the appropriate list for that record type | |
| d[record_type].append(data_dict) | |
| return WeightsAndBiasesDatabase(**d) | |
| if __name__ == "__main__": | |
| import sys | |
| if len(sys.argv) != 2: | |
| print( | |
| "usage: parse_wandb_database.py path/to/run.wandb", | |
| file=sys.stderr, | |
| ) | |
| sys.exit(1) | |
| path = sys.argv[1] | |
| print(f"loading wandb database from {path}...") | |
| wdb = load_wandb_database(path) | |
| print("loaded!") | |
| for record_type, record_data in wdb.__getstate__().items(): | |
| print(f" {record_type+':':21} {len(record_data): 6d} records") |
Most people are probably looking for history-type records which contain the time series of metrics logged during experiments (though this script extracts all types of records). Example to access the time series for a metric 'foo':
wandb_database = load_wandb_database('path/to/run.wandb')
for record in wandb_database.history:
if 'foo' in record['item']:
t = record['step']['num'] # t value used for `wandb.log` call
x = record['item']['foo'] # logged value of 'foo' metric at this step
Finding a lot of my data is 'corrupted' in ways that seem unlikely to be due to filesystem corruption but more plausibly due to a systematic errors in file writing, or incomplete writes from crashed experiments. I am working on the script to try to find a way to recover as much data as possible in these cases too.
The corrupted databases were made with wandb-core's writer with wandb 0.17.5. The new backend was fixed in wandb/wandb#8088 and the fix included from wandb 0.17.6 onwards, I was just unlucky to be using core with the version of wandb with this bug for a while.
I wrote a new log parser that is (1) robust to errors and (2) has an option to parse the databases as if they were generated by versions of wandb with that bug. I packaged this parser as its own library here: https://github.com/matomatical/wunderbar.
Thank you for this. Could you please include a license at the top of the code please? I would like to use it in my work.
Thank you for this. Could you please include a license at the top of the code please? I would like to use it in my work.
Glad it could be of use. I would recommend using the more established library version https://github.com/matomatical/wunderbar which is available under an MIT license. Otherwise, I have added a comment releasing this gist under the same license.
My solution to wandb/wandb#1768