-
-
Save lmmx/dc8f01157c97ff8bf6ef1f7ecc5d995f to your computer and use it in GitHub Desktop.
> `df.filter(pl.col("extra")).explode("fields").unnest("fields").unnest("value").explode("types")`
(Slightly difficult to access)
The symbols which are "extra" (comments) that can be in either doc, inner, outer, are never multiple/required (i.e. always optional singletons), in the very inner types column you see they have other names (like "inner doc comment marker")
df.filter(pl.col("extra")).explode("fields").unnest("fields").unnest("value").explode("types").with_columns(pl.col("types").struct.field("type").alias("name")).drop("types")The AST nodes with fields have a value whose types struct (object) contains a type string for the named types which is the name: such as doc_comment
If you unpack ("explode") the fields arrays of each symbol and look at its fields' key-value pairs' values
df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields")
df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields").unnest("value")
The value has 3 parts:
- multiple
- required
- types
This seems very "info dense" and important to the dataset
df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields").unnest("value").write_parquet("node_field_types.parquet")
df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields").unnest("value").drop("named", "subtypes")
We don't need the named/subtypes columns here
Let's just take a look at the function-related things
df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields").unnest("value").drop("named", "subtypes").filter(pl.col("type").str.starts_with("function"))
There are:
funcs = df.filter(pl.col("fields").is_not_null()).explode("fields").unnest("fields").unnest("value").drop("named", "subtypes").filter(pl.col("type").str.starts_with("function"))pprint(funcs.get_column("key").to_list())>>> print("\n".join(funcs.get_column("key").to_list()))
body
name
parameters
return_type
type_parameters
name
parameters
return_type
type_parameters
parameters
return_type
traitI would guess this is how tree-sitter splits all the bits of the function up into its individual parts...
We can look into that a bit more easily if we rename the struct fields (so the field type doesn't clash with the symbol type)
funcs.explode("types").rename({"type":"symbol"}).with_columns(pl.col("types").struct.rename_fields(["field_type","named_field"])).unnest("types")
If we look at the values, they're basically one-to-one between the unpacked field key-value pair's key name (i.e. the field name of the symbol), with some exceptions, e.g.:
- function_item "body" symbol type doesn't exactly match the field type ("block") but same idea - this is the bit with the main function content in
- function_item "name" has 2 fields ("identifier" and "metavariable")
- there are optional parts like return_type nd type_parameters (required = false)
- the return_type is present in all 3 of the symbols: function_item, function_signature_item, and function_type
There is obviously more here.
In terms of reliable targets I would expect that fields with required: true would be useful because we could always find them if we are looking at some semantic object (e.g. here the block field in the body key
We would then be able to write a program to conditionally extend our match range to the other, optional, parts of the AST based on a check for them in the surrounding nodes
Looking more closely at that function_item
funcs_types = funcs.explode("types").rename({"type":"symbol"}).with_columns(pl.col("types").struct.rename_fields(["field_type","named_field"])).unnest("types")
funcs_types.rename({"multiple":"_multiple","required":"_required"}).filter(pl.col("symbol") == "function_item").unnest("children")
funcs_types.rename({"multiple":"_multiple","required":"_required"}).filter(pl.col("symbol") == "function_item").unnest("children").explode("types")
funcs_types.rename({"multiple":"_multiple","required":"_required"}).filter(pl.col("symbol") == "function_item").unnest("children").explode("types").unnest("types")



Uh oh!
There was an error while loading. Please reload this page.