Split Train data into Training and Validation when using ImageDataGenerator in Keras
Split train data into training and validation when using ImageDataGenerator.
Keras comes bundled with many essential utility functions and classes to achieve all varieties of common tasks in your machine learning projects. One usually used class is the ImageDataGenerator. As explained in the documentation:
Generate batches of tensor image data with real-time data augmentation. The data will be looped over (in batches).
Before now, it's okay if you’re keeping your training and validation image folder separate.
Until recently, you were on your own to put together your training and validation datasets, for instance by creating two separate folder structures for your images to be used in conjunction with the flow_from_directory function.
For example, the old way would be to do something like so:
TRAIN_DIR = './datasets/training'
VALIDATION_DIR = './datasets/validation'
datagen = ImageDataGenerator(rescale=1./255)
train_generator = datagen.flow_from_directory(TRAIN_DIR)
val_generator = datagen.flow_from_directory(VALIDATION_DIR)
Lately, however (here’s the pull request, if you’re interested), a new validation_split parameter was added to the ImageDataGenerator that allows you to randomly split a subset of your training data into a validation set, by specifying the percentage you want to allocate to the validation set:
datagen = ImageDataGenerator(validation_split=0.3, rescale=1./255)
Then when you request flow_from_directory, you pass the subset parameter specifying which set you want:
train_generator = datagen.flow_from_directory(
TRAIN_DIR,
subset='training'
)
val_generator = datagen.flow_from_directory(
TRAIN_DIR,
subset='validation'
)
You’ll note that both generators are being loaded from the TRAIN_DIR, the only difference is one uses the training subset and the other uses the validation subset.
And, that’s all.