We can hear sound from 20 to 20,000 Hz with highest sensitivity from 2,000 to 5,000Hz where speech is located.
To record such sound we need a system that can accurately measure 20kHz oscillations.
If we want a digital recording we will need to sample the signal from a microphone.
To accurately represent such signal without aliasing a common approach is to sample it at 44,100Hz.
The dynamic range (silent to loud) of human hearing is 120dB which means the signal range we can distinguish is from 1 to 10 ^ (120/20) = 1,000,000.
To measure such range we need at least 20 bits (2^20-1). Since in the digital world we work with a multiple of 8 bits (8, 16, 24, 32 etc.) , 24 bit is the most common used bit depth used for high quality sound.
If you want to play sound you will need to transmit data 1 mega bit per second (=44100 * 24) and if its stereo it will be 2 mega bits per second.
Initial Ethernet speed in a university office in 1990 was 10 mega bits per second. Today we are approaching 10 giga bits per second.
One hour of recording takes 900 Mega bytes of data (3600 seconds x 44,100 x 24 x 2 / 8bits/byte) and the storage of your notebook computer can easily fit 50 hrs of sound.
It turns out we can compress the sound data using AAC (Advanced Audio Coding) or MPG (Motion Picture Group). It was developed so that with 30 times less data we can still recreate the sound accurately to human perception. This approach was developed in the 1990s.
Using the same principles we can encode 1080p video at 2.5 mega bits per second and sound will only take 64 kilo bits per second. This method was developed in the 2000s (in part at the University of Arizona) and commonly used for Skype, Zoom etc.