Data Analysis

In radio astronomy the method of analysis depends on the nature of the signal sought.  In its simplest form a square-law detector output voltage can be digitised and plotted.  In days gone past this plotting was done in real-time using a chart recorder with pens drawing on scroll of paper.  This 'total power' technique is still used in simple systems to detect the broadband signals from many sky objects.   Even the detection of pulsars (apparent pulsing of a broadband signal) can be done, in its simplest form, with a square-law detector.

The nature of the Hydrogen line signal, however, requires a different method of detection and analysis.

The Nature of the Hydrogen Line Signal

The hydrogen line signal is not a broadband signal (at least not in radio astronomy terms where typical bandwidths can be 100's of MHz), but is essentially a narrowband sinusoidal radio frequency signal of a fixed, precise, frequency of 1420.4075 MHz.   To separate this weak narrowband signal from the background continuum radio signals requires the use of narrow filters.  In fact, if the hydrogen cloud producing the signal was not moving away or towards us, a single narrowband filter tuned to 1420.4075 MHz would suffice.

However, for the most part, the hydrogen clouds of high interest are moving relative to a topocentric observer.  The movement of the hydrogen produces a doppler shift which is added to the shift produced by the velocity of the topocentric observer (due to Earth's rotation and its orbital speed around the Sun).  In addition to the bulk doppler shift produced by movement towards/away from the observer, many hydrogen line targets have differential velocities within themselves. An example would be a rotating galaxy - where, in the observer's line of sight, one side is rotating away, while the other is rotating towards.  The overall effect is a bulk doppler shift, but not just at a single frequency, but a band of frequencies. These doppler shifts mean a single filter is not sufficient and so an array of filters are required. 

This doppler effect 'inconvenience', is in fact, extremely useful for it allows calculation of the relative velocity of the various clouds and have allowed the plotting of the spiral arms of our home Galaxy.

The need for a multitude of narrowband hardware filters would difficult and expensive to build.  Fortunately there is a way to implement the filter bank digitally in software.

The Digital Equivalent to the Hardware Filter Bank - the Discrete Fourier Transform

The Fourier theorem states that any time series can be transformed into a summation of a sinusoidal series.   That is, signals in the time domain can be transformed into the frequency domain.  This simply means that time-varying data (i.e., data collected by a receiver) can be separated into values spread across a frequency spectrum in frequency 'bins'.  The 'bins' are equivalent to an individual filter of specified centre frequency and bandwidth...

As an example of the transform, a 1 Hz sine wave (with a sinusoidal shape in the time domain) is transformed into a single spike in the frequency domain as shown on the right - termed a spectrum.

The Nature of the Hydrogen Line Signal - Revisited

If the hydrogen line signal was analysed by this method it would appear as a single, narrow spike like the 1 Hz signal - albeit at 1420.4075 MHz instead of 1 Hz.  If the hydrogen line was doppler shifted it would appear to the left or right of the nominal 1420.4075 Mhz position, depending on the sign of the doppler shift.

The width of the spectrum spike observed in some direction depends on how similar in velocity the various parts of the hydrogen gas in that direction are.  In some directions w.r.t to the Galactic centre, the signal from the hydrogen clouds might be narrow and doppler-shifted as shown in the velocity graph below...

...which indicates, in this case, that the gas is moving away.  However, in many cases the various parts of the gas cloud are moving at different velocities, or the view includes several separate clouds - resulting in a a wider velocity distribution such as this...

...where, in the line of sight of the antenna beam, there is hydrogen gas moving away (+50 km/s position), and another cloud moving towards (-50 km/s position).  This is a simple description of what actually could be a more complex reality, but the point is: hydrogen line signals can show an infinite variety of spectral curves, representative of the quantity and relative movement of neutral hydrogen gas in the line of sight of the antenna beam.

It is clear that a number of frequency bins (filters) are needed to represent this structure.  The Discrete Fourier Transform provides this in software.

The Fast Fourier Transform - (FFT)

The processing load for calculating the Discrete Fourier Transform can be reduced by employing one of the various Fast Fourier Transform algorithms.  In its basic form the number of input points (n) must be a power of two   n = 2x.

The data collected from an extended observation period can be quite large (n is large).   The number of output frequency bins (b) is equal to n, so for long data files the number of frequency bins is large - with very narrow bandwidths each.   For a single tone frequency this is an advantage as this improves the S/N.   However, almost all hydrogen line observations will see a spread of frequencies to some extent and so extreme frequency resolution is not necessary.   The long data file can be decimated in time into blocks (power of two length) and an FFT calculated for each which are then summed (averaged).   So the frequency resolution is coarser, but the S/N is improved by the averaging.  The number of time samples analysed in each FFT have, for the software written by the author, been chosen to be selectable in values of 64, 128, 256 and 512.  These values result in the same number of frequency bins for each.

The Relationship Between Observed Doppler Shift and Radial Velocity

It is not possible to measure total velocity via doppler, as velocities transverse to the field of view produce no doppler shift - only radial velocities (directly towards or away) can be measured.  The doppler shift produced by such radial velocity is the same fraction of the observation frequency as the radial velocity is of the speed of light.  That is...

Δf/fobs = Vr/C

...and so for a nominal observation frequency of 1420.40575 MHz, a doppler shift of 1 MHz corresponds to a radial velocity of 211.06 km/s. Therefore, the small bump in the HI profile above placed at about -50 km/s corresponds to an observed doppler shift of about +237 kHz.

Note that the convention for displaying velocities is increasingly positive from left to right.   Velocities are defined as positive when directed away from the observer (because velocity is dS/dT).  Therefore the corresponding doppler shift frequency scale would be increasingly negative from left to right.  That is, negative velocities correspond to a doppler shift frequency above 1420.40575 MHz, and vice versa (as shown in the HI profile above).

For a nominal receiving bandwidth of 1.8 MHz (the practical operating bandwidth of the RTL dongle for a sampling rate of 2.4 Msps), the total range in velocity that is covered by this bandwidth is ~380 km/s.  This sufficient for most intra-galactic hydrogen line observations.

The range of FFT bins (64, 128, 256 and 512) splits the full span for 2.4 Msps (506.544 km/s) into velocity bins of resolution of 7.91, 3.96, 1.98 and 0.99 km/s respectively.  For most citizen scientist level observations these levels of resolution are adequate.

Extended Sources Versus Point Sources

One should be aware that the spectrum received is dependent on the angular extent of the source and the beamwidth of the receive antenna. Some point sources are in a direction where there is little 'background' HI signal - whilst others are in the midst of high levels of 'background' HI signal. Some examples of the difference between the results obtained from a large aperture antenna (left column) versus those from a small aperture antenna (right column).

Extracting the Hydrogen Line Signal

The strength of the hydrogen line signal is very low and so combined with the high noise of a simple system, finding the HI signal is like looking for a pimple on an elephant's back.

To make matters worse the passband response of the RTL dongle is not flat across the passband...

The left and right endpoints of the response line are the limits set by the sampling rate.  In the case of a sampling rate of 2.4 Msps, these endpoints are -250 km/s to +250 km/s (-1.2 MHz to +1.2 MHz).  However, it can be seen that the response rolls off well before those endpoints are reached and so the useable range is about -190 km/s to + 190 km/s (-0.9 Mhz to +0.9 MHz).

The hydrogen line signals are sitting on top of this gross dongle passband ripple (about 1 dB pk-pk). 

The amplitude of the HI signals are well below that ripple amplitude in this system.

Dark Frame Subtraction

In order to see small signals on top of the large system noise signal, a technique can be borrowed from astrophotography called 'dark frame subtraction'.  In the application here that involves recording the response of the system when the target is not being received (the 'dark frame') and subtracting this from a recording when the target is being received.  The difference should reveal the target signal while cancelling out the system response variations.

The 'dark frame' can be recorded by feeding broadband noise into the system - preferably before the first LNA so as to include the response right from the antenna feed output through to the data recording point.

However, a much better technique is to record the 'dark frame' off-source from actual sky signals.  Here the recording is done of signals coming into the system in the same configuration as target recordings.  An extra requirement for this technique is identifying the best 'off-source' pointing, which depends on the target source.  Some examples follow...

Another good reason for using sky signals for recording the 'dark frame' is that any constant terrestrial RFI signals are cancelled out during the subtraction operation.  Also, including the whole system (from sky to dongle) in the 'dark frame' captures all the system response - including the frequency response of the antenna feed and the antenna itself.

Selection of the 'Dark Frame' Method

Which variation of 'dark frame' method used will depend on the target object(s) and the local system parameters and environment.

For this project the method selected is the position-switched method as described in a document from the Jodrell Bank Centre for Astrophysics...

"Position-switched observations in which no frequency offset is made but the reference coordinates are shifted by several degrees from the target direction.. This provides a spectrum of HI in a neighbouring line of sight to that to the target and when it is subtracted can be used to minimise the contribution of emission from the Milky Way when making observations of other galaxies."

Apart from the issue that this project is about observing other galaxies (the Magellanic Clouds), the local environment provides a high degree of RFI and the position-switched method provides a better rejection of RFI than any other method.  Hopefully this can be appreciated by examining the following...

First, the raw data returned by the system used at HawkRAO from a run looking for HI line signals from the galactic plane.

This data reveals the response across the passband of the RTL dongle hardware.

Note the high level of RFI spikes across the passband.  These RFI spikes need to be cancelled out by some means to reveal the sought-after HI line signals.

The HI line signals are barely discernible (circled in the graphic on the right).

It should be obvious that using a 'dark frame' reference derived from a local reference passive load or other RFI-free noise source will not allow the cancellation of the RFI from the raw data.

Likewise, using a 'dark frame' derived from moving the HI line signals out of the passband (frequency-offset) will obviously also not allow cancellation, as the RFI pattern would be completely unrelated.

The 'dark frame' reference, therefore, used at HawkRAO for the purpose of baseline correction is derived from a position switch.

NOTE: although usually 'position-switched' references are obtained by moving the antenna 'off-source', at HawkRAO the antenna is not moved at all.  The 'dark frame' reference is acquired by allowing the Earth's rotation to move the target (the Magellanic Clouds) out of the antenna beam.  In this way any RFI signals remain fixed in position w.r.t. to the antenna beam, largely maintaining the correlation between target and 'dark frame' RFI patterns.

After the target has drifted out of the beam of the antenna, another data acquisition run is done - the 'dark frame'.  The result of such a run is shown on the right.

Hopefully all the responses - other than the target which is no longer in the antenna beam - remain the same.   As the two data runs are done at different times this is unlikely to be 100% true.  Note that while the RFI spikes look largely identical in both runs, there is one spike that is obviously much stronger in the 'dark frame' compared to the target data plot (encircled in the graph on the right).

Any spike or signal which is stronger in the 'dark frame' will cause a negative spike in the corrected result - and vice versa.

After the target raw data has had the 'dark frame' base line correction applied (each target data FFT bin divided by its 'dark frame' FFT bin twin), we can see the good cancellation of the RFI spikes, leaving the HI line signals to appear in the output result shown below.

 Note how the RFI spike which is much larger in the 'dark frame' than the target data has over-corrected and resulted in a large negative spike.

Note also that almost all of the RFI spikes are greater in the 'dark frame' than the target data, resulting in the RFI upward spikes in the raw being replaced by downward spikes in the result.

Some averaging on the output result can be done to reduce the residual effects of the RFI spikes.  A degree of median filtering would perhaps be better as the RFI spikes are outliers to the adjacent noise level.

This, hopefully, gives some insight as to why this particular method of baseline correction is used at HawkRAO - which is a relatively high RFI environment.  It is entirely possible that another method might be more suitable for other targets and other RFI environments.

It is instructive to realise that over a 2 MHz bandwidth the gain of 3 metre dish antenna (50% efficiency) varies by just over 0.01 dB.  The frequency response of the feed used in this project will most likely have a much larger slope.  How much of this slope appears in the result is dependent on the ratio of sky noise to system noise.

All in all, the 'dark frame' subtraction process cancels out most, but not all, system responses.  Compensation for the slope can be done in the final display graph via controls in the data analysis GUI.

Note: the 'dark frame' recordings are analysed at different bin numbers and individual 'dark frame' data files are saved in .CSV form for later recall by the analysis software.  These 'dark frame' files must be produced for each different configuration change (i.e., adding/subtracting any component in the sky-to-dongle path - even including just a connector change) as small discrepancies compromise the quality of the 'dark frame' subtraction process.   This also applies to different settings of the dongle gain for the same reasons.  For this reason all data files are saved with details of the conditions under which they are recorded and can therefore be matched with the corresponding correct 'dark frame' file data.

The Data Analysis Software

After a recording is made by the data acquisition software, the data file can be analysed.  Typical files are very large (>10 GB) and because the resolution requirements are not stringent, the file is decimated in time by reading in blocks selectable in size (64, 128, 256 or 512 samples) and a complex FFT individually performed on each block of IQ data.   The results of each block are summed after being 'dark frame' corrected and the final result displayed.

Results Display Options

The data analysis GUI allows configuration of the display graph including slope compensation, offset and scale as well as velocity display limits and bin range cropping.   Selectable degrees of moving average filtering are provided.

A typical result is shown below...

Results can be exported to other programs via the clipboard or to a graphic file in various formats.

RTL Dongle ADC Offset

The IQ data saved in the acquisition file is in the form of unsigned 8-bit bytes where integer values range from 0 to 255 (00H to FFH).  The RF signal is bi-polar (AC) with +ve and -ve excursions, but the RTL dongle ADC outputs values from 0 to 255 where a value around 127 (128) represents 0V.  This offset needs to be removed from the data before FFT analysis is done otherwise a big spike at DC (the centre frequency in the complex FFT output) will contaminate the spectrum display.  As most FFT algorithms using floating-point input values the value of 127.5 is usually subtracted from the unsigned byte data read from the IQ data file - but that doesn't address the small, but significant, hardware residual offset in the RTL ADC.

To determine the combination of the binary offset  and the residual hardware offset, the incoming IQ data values can be averaged.   The data analysis software assumes an offset (say, 127.5) and then calculates a mean value for each incoming block of data and dynamically updates the offset value to be subtracted.  By this method virtually no DC spike remains in the result at the end of the process.  The offset is fairly accurately determined in the first block (say, 64 samples) and so an accurate adjustment largely happens in the first 100 uS of the data file for a sampling rate of 2.4 Msps - so even though the measured offset can be saved for the next analysis, this action is not necessary.

Note that it is not valid to cast from unsigned to signed in the code - a mathematical conversion is necessary.