Skip to content

Instantly share code, notes, and snippets.

@lmmx
Last active November 6, 2025 00:57
Show Gist options
  • Select an option

  • Save lmmx/dc8f01157c97ff8bf6ef1f7ecc5d995f to your computer and use it in GitHub Desktop.

Select an option

Save lmmx/dc8f01157c97ff8bf6ef1f7ecc5d995f to your computer and use it in GitHub Desktop.
Loaded node types into parquet via polars-genson via https://gist.github.com/lmmx/ed3dd70ea7997f27efa1ff31b625c0b1
@lmmx
Copy link
Author

lmmx commented Nov 3, 2025

>>> import polars as pl                                                                                                        
>>> import polars_genson                                                                                                       
>>> from pathlib import Path                                                                                                   
>>> json = Path("node-types.json").read_text()                                                                                 
>>> df = pl.DataFrame({"node_types": json})                                                                                    
>>> transformed = df.genson.normalise_json("node_types", wrap_root="node_types").explode("*").unnest("*") 
>>> transformed.write_parquet("node_types.parquet")
Screenshot from 2025-11-03 01-32-37

@lmmx
Copy link
Author

lmmx commented Nov 6, 2025

Screenshot from 2025-11-06 00-08-02

df.filter(pl.col("subtypes").is_not_null()).explode("subtypes")

@lmmx
Copy link
Author

lmmx commented Nov 6, 2025

Screenshot from 2025-11-06 00-13-05

df.filter(pl.col("subtypes").is_not_null()).explode("subtypes").select(pl.col("subtypes")).unnest("*")

This shows that all the entities that are subtypes are named except for _

@lmmx
Copy link
Author

lmmx commented Nov 6, 2025

Screenshot from 2025-11-06 00-16-20

df.filter(pl.col("children").is_null())

This one shows the 176 with no children (most don't have fields)

@lmmx
Copy link
Author

lmmx commented Nov 6, 2025

Screenshot from 2025-11-06 00-17-10

df.filter(pl.col("children").is_not_null())

This shows the 104 with children (most have fields)

@lmmx
Copy link
Author

lmmx commented Nov 6, 2025

Screenshot from 2025-11-06 00-20-38 > `df.filter(pl.col("extra")).explode("fields").unnest("fields").unnest("value").explode("types")`

(Slightly difficult to access)

The symbols which are "extra" (comments) that can be in either doc, inner, outer, are never multiple/required (i.e. always optional singletons), in the very inner types column you see they have other names (like "inner doc comment marker")

@lmmx
Copy link
Author

lmmx commented Nov 6, 2025

Screenshot from 2025-11-06 00-26-50
df.filter(pl.col("extra")).explode("fields").unnest("fields").unnest("value").explode("types").with_columns(pl.col("types").struct.field("type").alias("name")).drop("types")

The AST nodes with fields have a value whose types struct (object) contains a type string for the named types which is the name: such as doc_comment

@lmmx
Copy link
Author

lmmx commented Nov 6, 2025

If you unpack ("explode") the fields arrays of each symbol and look at its fields' key-value pairs' values

Screenshot from 2025-11-06 00-32-57

df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields")

Screenshot from 2025-11-06 00-30-54

df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields").unnest("value")

The value has 3 parts:

  • multiple
  • required
  • types

This seems very "info dense" and important to the dataset

df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields").unnest("value").write_parquet("node_field_types.parquet")
Screenshot from 2025-11-06 00-37-09

df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields").unnest("value").drop("named", "subtypes")

We don't need the named/subtypes columns here

@lmmx
Copy link
Author

lmmx commented Nov 6, 2025

Let's just take a look at the function-related things

Screenshot from 2025-11-06 00-38-49

df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields").unnest("value").drop("named", "subtypes").filter(pl.col("type").str.starts_with("function"))

There are:

funcs = df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields").unnest("value").drop("named", "subtypes").filter(pl.col("type").str.starts_with("function"))
pprint(funcs.get_column("key").to_list())
>>> print("\n".join(funcs.get_column("key").to_list()))
body
name
parameters
return_type
type_parameters
name
parameters
return_type
type_parameters
parameters
return_type
trait

I would guess this is how tree-sitter splits all the bits of the function up into its individual parts...

@lmmx
Copy link
Author

lmmx commented Nov 6, 2025

We can look into that a bit more easily if we rename the struct fields (so the field type doesn't clash with the symbol type)

Screenshot from 2025-11-06 00-45-33

funcs.explode("types").rename({"type":"symbol"}).with_columns(pl.col("types").struct.rename_fields(["field_type","named_field"])).unnest("types")

If we look at the values, they're basically one-to-one between the unpacked field key-value pair's key name (i.e. the field name of the symbol), with some exceptions, e.g.:

  • function_item "body" symbol type doesn't exactly match the field type ("block") but same idea - this is the bit with the main function content in
  • function_item "name" has 2 fields ("identifier" and "metavariable")
  • there are optional parts like return_type nd type_parameters (required = false)
    • the return_type is present in all 3 of the symbols: function_item, function_signature_item, and function_type

There is obviously more here.

In terms of reliable targets I would expect that fields with required: true would be useful because we could always find them if we are looking at some semantic object (e.g. here the block field in the body key

We would then be able to write a program to conditionally extend our match range to the other, optional, parts of the AST based on a check for them in the surrounding nodes

@lmmx
Copy link
Author

lmmx commented Nov 6, 2025

Looking more closely at that function_item

funcs_types = funcs.explode("types").rename({"type":"symbol"}).with_columns(pl.col("types").struct.rename_fields(["field_type","named_field"])).unnest("types")
Screenshot from 2025-11-06 00-56-39
funcs_types.rename({"multiple":"_multiple","required":"_required"}).filter(pl.col("symbol") == "function_item").unnest("children")
funcs_types.rename({"multiple":"_multiple","required":"_required"}).filter(pl.col("symbol") == "function_item").unnest("children").explode("types")
funcs_types.rename({"multiple":"_multiple","required":"_required"}).filter(pl.col("symbol") == "function_item").unnest("children").explode("types").unnest("types")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment