Often we want a random sample for test. Usually that's done by shuffling. But occasianally we want to preserve the order in train.
This script removes a random sample without otherwise changing the order. It shuffles the original, takes a random sample for test, and then removes all lines that occur in the sample from train. (See https://stackoverflow.com/questions/4366533/how-to-remove-the-lines-which-appear-on-file-b-from-another-file-a)
. split.sh example.txt
If there are duplicates in the original and one of them ends up in the test sample, all others will be removed from train, so the total number of lines in train and test may be less than in the original.
Note that increasing test_size will increase the computation time.