I was looking for dozens of minutes just the exact size of the different datasets (dev, val and train) of the SQuAD2.0.
Size of the different datasets as of the 30th of March 2025 (I did not find the validation dataset):
Dev Dataset Summary:
Number of categories: 35
Total number of questions: 11873
Train Dataset Summary:
Number of entries: 442
Total number of questions: 130319The script I used looks like this:
from pathlib import Path
import pandas as pd
if __name__ == "__main__":
df = pd.read_json(Path("data/train-v2.0.json"))
# Display a summary of the dataset
print("Dataset Summary:")
print(f"Number of categories: {len(df)}")
print(f"Columns: {df.columns.tolist()}")
# Count the number of questions in the dataset
num_questions = sum(
len(paragraph["qas"])
for entry in df["data"]
for paragraph in entry["paragraphs"]
)
print(f"Total number of questions: {num_questions}")Feel free to modify, improve and share again.