When training a neural network model for image recognition, it usually resizes the training images into the same size, e.g. 512x512 or 224 x 224.
For a fully connected network, as the number of neurons for each layer is specified, the input images need to be in the same size to fit the network.
For a convolutional neural network, it actually doesn't require the images to all in the same size.
The convolution operation simply applies filters on an image and outputs into a feature map. If the input image size is bigger, it just means the output feature map is bigger too. The convolution operations and pooling operations are not bind to a certain size of image.
However for a training process, it needs to calculate the Loss. So if using different sizes of images for training and thus getting different sizes of outputs, make sure the loss calculation function can handle it. E.g. comparing a ground truth in dimensions m by n to a prediction in dimensions w by h.
Usually people still use same size training images due to a few reasons:
1. A loss function expects same size of ground truth and prediction
2. Mini batch training expects every image in a mini batch to have the same size
3. Resizing is required anyway. Make sure the input size is dividable by some number depending how many pool layers are there. e.g. there are 5 2x2 maxpool with stride 2, it means the image size will be reduced to 1/32. Then make sure the input image size is dividable by 32.
4. random cropping is required anyway. usually people randomly crop images into smaller images to enrich the training dataset.
5. There is fully connected layer which takes a fixed number of input neurons, so need to make sure the input image's size fits.
6. make sure the input image is big enough in case it is reduced to 0 size after a few layers of conv / pooling.
Considering all the above reasons, it is just easier and neater to preprocess training images into the same fixed size.
For a testing process, if no fully connected layer is there to constrain the size, big enough image in any size works. It just means the prediction (e.g. text recognition score map) is bigger when the input image is bigger.