Skip to content

Instantly share code, notes, and snippets.

@bittlingmayer
Last active April 23, 2020 14:18
Show Gist options
  • Select an option

  • Save bittlingmayer/8d504d3bface5a7ff353c4839ef9e2a5 to your computer and use it in GitHub Desktop.

Select an option

Save bittlingmayer/8d504d3bface5a7ff353c4839ef9e2a5 to your computer and use it in GitHub Desktop.
Split a file for train and test (randomly but without shuffling it or otherwise changing the order)

Split a file without shuffling

Often we want a random sample for test. Usually that's done by shuffling. But occasianally we want to preserve the order in train.

This script removes a random sample without otherwise changing the order. It shuffles the original, takes a random sample for test, and then removes all lines that occur in the sample from train. (See https://stackoverflow.com/questions/4366533/how-to-remove-the-lines-which-appear-on-file-b-from-another-file-a)

. split.sh example.txt

If there are duplicates in the original and one of them ends up in the test sample, all others will be removed from train, so the total number of lines in train and test may be less than in the original.

Note that increasing test_size will increase the computation time.

original_file=$1
test_size=100
min_len=1000
line_count=`wc -l ${original_file} | awk '{print $1;}'`
if [[ ${line_count} -le ${min_len} ]]; then
echo "Not splitting for test. Too few lines in ${original_file}"
return
fi
shuffled_file=${original_file}.shuf
test_file=${original_file}.test
train_file=${original_file}.train
echo "Shuffling ${original_file}"
# on Mac there is no `shuf` by default, so using gshuf instead
# or `sort` as a last resort. (`sort -R` is not real shuffle, and is slow)
if [[ `command -v shuf` ]]; then shuf ${original_file} > ${shuffled_file};
elif [[ `command -v gshuf` ]]; then gshuf -R ${original_file} > ${shuffled_file};
elif [[ `command -v sort` ]]; then sort -R ${original_file} > ${shuffled_file}; fi
head -n ${test_size} ${shuffled_file} > ${test_file}
# This is the trick.
grep -Fvxf ${test_file} ${original_file} > ${train_file}
wc -l ${test_file} ${train_file}
rm ${shuffled_file}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment