Skip to content

Instantly share code, notes, and snippets.

@bittlingmayer
Created June 2, 2017 16:17
Show Gist options
  • Select an option

  • Save bittlingmayer/9d32fd77e894669d7edd9108e5333eba to your computer and use it in GitHub Desktop.

Select an option

Save bittlingmayer/9d32fd77e894669d7edd9108e5333eba to your computer and use it in GitHub Desktop.
Shplit.py - shuffle+split a data file
# Shplit.py - shuffle+split a data file
#
# Positional Arguments:
# 1: the filename
# 2: the split factor
#
# The filename must have an extension.
#
# Example:
#
# python shplit.py data.txt 4
#
# Writes 3/4 to data.train.txt and 1/4 to data.test.txt
#
# The rows are selected at random.
def shplit(f, f2, f3, n):
from random import shuffle
with open(f, 'r') as f:
lines = f.readlines()
# Append a newline in case the last line didn't end with one
lines[-1] = lines[-1].rstrip('\n') + '\n'
# Shuffle the data
shuffle(lines)
# Write (n-1)/n to train and 1/n to test
split = len(lines) // n
with open(f2, 'w') as f:
f.writelines(lines[split:])
with open(f3, 'w') as f:
f.writelines(lines[:split])
import sys
f = sys.argv[1]
n = int(sys.argv[2])
ext = '.' + f.split('.')[-1]
f2 = f.replace(ext, '.train' + ext)
f3 = f.replace(ext, '.test' + ext)
shplit(f, f2, f3, n)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment