lmmx/node_types.parquet

lmmx · 2025-11-03T01:33:07Z

Source: https://github.com/tree-sitter/tree-sitter-rust/blob/master/src/node-types.json
- Permalink: https://github.com/tree-sitter/tree-sitter-rust/blob/v0.24.0/src/node-types.json

>>> import polars as pl                                                                                                        
>>> import polars_genson                                                                                                       
>>> from pathlib import Path                                                                                                   
>>> json = Path("node-types.json").read_text()                                                                                 
>>> df = pl.DataFrame({"node_types": json})                                                                                    
>>> transformed = df.genson.normalise_json("node_types", wrap_root="node_types").explode("*").unnest("*") 
>>> transformed.write_parquet("node_types.parquet")

lmmx · 2025-11-06T00:11:12Z

df.filter(pl.col("subtypes").is_not_null()).explode("subtypes")

lmmx · 2025-11-06T00:14:09Z

df.filter(pl.col("subtypes").is_not_null()).explode("subtypes").select(pl.col("subtypes")).unnest("*")

This shows that all the entities that are subtypes are named except for _

lmmx · 2025-11-06T00:17:23Z

df.filter(pl.col("children").is_null())

This one shows the 176 with no children (most don't have fields)

lmmx · 2025-11-06T00:18:06Z

df.filter(pl.col("children").is_not_null())

This shows the 104 with children (most have fields)

lmmx · 2025-11-06T00:23:53Z

> `df.filter(pl.col("extra")).explode("fields").unnest("fields").unnest("value").explode("types")`

(Slightly difficult to access)

The symbols which are "extra" (comments) that can be in either doc, inner, outer, are never multiple/required (i.e. always optional singletons), in the very inner types column you see they have other names (like "inner doc comment marker")

lmmx · 2025-11-06T00:28:22Z

df.filter(pl.col("extra")).explode("fields").unnest("fields").unnest("value").explode("types").with_columns(pl.col("types").struct.field("type").alias("name")).drop("types")

The AST nodes with fields have a value whose types struct (object) contains a type string for the named types which is the name: such as doc_comment

lmmx · 2025-11-06T00:34:22Z

If you unpack ("explode") the fields arrays of each symbol and look at its fields' key-value pairs' values

df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields")

df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields").unnest("value")

The value has 3 parts:

multiple
required
types

This seems very "info dense" and important to the dataset

df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields").unnest("value").write_parquet("node_field_types.parquet")

df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields").unnest("value").drop("named", "subtypes")

We don't need the named/subtypes columns here

lmmx · 2025-11-06T00:40:43Z

Let's just take a look at the function-related things

df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields").unnest("value").drop("named", "subtypes").filter(pl.col("type").str.starts_with("function"))

There are:

funcs = df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields").unnest("value").drop("named", "subtypes").filter(pl.col("type").str.starts_with("function"))

pprint(funcs.get_column("key").to_list())

>>> print("\n".join(funcs.get_column("key").to_list()))
body
name
parameters
return_type
type_parameters
name
parameters
return_type
type_parameters
parameters
return_type
trait

I would guess this is how tree-sitter splits all the bits of the function up into its individual parts...

lmmx · 2025-11-06T00:45:54Z

We can look into that a bit more easily if we rename the struct fields (so the field type doesn't clash with the symbol type)

funcs.explode("types").rename({"type":"symbol"}).with_columns(pl.col("types").struct.rename_fields(["field_type","named_field"])).unnest("types")

If we look at the values, they're basically one-to-one between the unpacked field key-value pair's key name (i.e. the field name of the symbol), with some exceptions, e.g.:

function_item "body" symbol type doesn't exactly match the field type ("block") but same idea - this is the bit with the main function content in
function_item "name" has 2 fields ("identifier" and "metavariable")
there are optional parts like return_type nd type_parameters (required = false)
- the return_type is present in all 3 of the symbols: function_item, function_signature_item, and function_type

There is obviously more here.

In terms of reliable targets I would expect that fields with required: true would be useful because we could always find them if we are looking at some semantic object (e.g. here the block field in the body key

We would then be able to write a program to conditionally extend our match range to the other, optional, parts of the AST based on a check for them in the surrounding nodes

lmmx · 2025-11-06T00:55:58Z

Looking more closely at that function_item

funcs_types = funcs.explode("types").rename({"type":"symbol"}).with_columns(pl.col("types").struct.rename_fields(["field_type","named_field"])).unnest("types")

funcs_types.rename({"multiple":"_multiple","required":"_required"}).filter(pl.col("symbol") == "function_item").unnest("children")
funcs_types.rename({"multiple":"_multiple","required":"_required"}).filter(pl.col("symbol") == "function_item").unnest("children").explode("types")
funcs_types.rename({"multiple":"_multiple","required":"_required"}).filter(pl.col("symbol") == "function_item").unnest("children").explode("types").unnest("types")

lmmx/node_types.parquet

Select an option

No results found

Select an option

No results found

lmmx commented Nov 3, 2025 •

edited

Loading

Uh oh!

lmmx commented Nov 6, 2025

Uh oh!

lmmx commented Nov 6, 2025

Uh oh!

lmmx commented Nov 6, 2025 •

edited

Loading

Uh oh!

lmmx commented Nov 6, 2025

Uh oh!

lmmx commented Nov 6, 2025

Uh oh!

lmmx commented Nov 6, 2025

Uh oh!

lmmx commented Nov 6, 2025 •

edited

Loading

Uh oh!

lmmx commented Nov 6, 2025 •

edited

Loading

Uh oh!

lmmx commented Nov 6, 2025 •

edited

Loading

Uh oh!

lmmx commented Nov 6, 2025 •

edited

Loading

Uh oh!

lmmx/node_types.parquet

lmmx commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lmmx commented Nov 6, 2025

Uh oh!

lmmx commented Nov 6, 2025

Uh oh!

lmmx commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lmmx commented Nov 6, 2025

Uh oh!

lmmx commented Nov 6, 2025

Uh oh!

lmmx commented Nov 6, 2025

Uh oh!

lmmx commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lmmx commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lmmx commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lmmx commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lmmx commented Nov 3, 2025 •

edited

Loading

lmmx commented Nov 6, 2025 •

edited

Loading

lmmx commented Nov 6, 2025 •

edited

Loading

lmmx commented Nov 6, 2025 •

edited

Loading

lmmx commented Nov 6, 2025 •

edited

Loading

lmmx commented Nov 6, 2025 •

edited

Loading