Recall that convolutional neural networks (CNNs) make the explicit assumption that inputs are images, which in turn are just three dimensional data consisting of width, height, and depth (e.g. an RGB image has a depth of 3). CNNs use convolutional layers to take advantage of this structural property to extract and learn features (curved edges, color, etc.) of image data.
The first layer of any CNN is a convolutional layer. Its primary purpose is to convolve (think of it as sliding) a filter over an image. The area of the image that the filter convolves is called the receptive field. Mathematically, all the filter is doing is performing element-wise multiplication of the filter with the receptive field, and then summing all theses values into a single number (i.e. a dot product). This is performed over the whole image to get what is called a feature map or activation map.
It is important to note that the depths of the filter and input have to match. This ensures that the dot product works to produce an output with one depth channel.
Now you can run any number of filters over the input to get different feature maps. This is desirable because it helps the network learn different properties of the input data. These feature maps can then be stacked along the depth dimension to form a new image, which forms the input of another layer. As a result, the depth of the output is the number of filters we would like to use.