Designing a Data Matrix
It is vital that you get your data into the computer in a way that it can be analyzed. What makes this tricky is that the format appropriate for the computer is generally not the one that is best for presenting your raw data as a neat table. At the least, this means that you have to put aside your ideas on how data should appear when they are neatly presented, and focus on how to structure your data so that they can be analyzed easily.
You should not worry at this stage about getting your data printed in the form of a well structured report. We will address that as a problem later.
A word of caution is in order. When you have your data structured in a good input format, it may seem that you are being very inefficient. You are likely to have a lot of duplication, such as a plot number being repeated over and over as it appears with each observation from a plot. Such repetition generally is a necessary part of the data structure and should not be eliminated in an attempt to find a shortcut way to enter your data.
Data Matrix Concepts and Terminology
The basic form in which data are stored for analysis is called a data matrix (or data array). This simply means that you have a group of data values organized in a rectangular form that has a certain amount of consistency.
Related items are found in similar parts of the data matrix.
Each horizontal line of the data matrix is called a row and its values are called observations.
Each vertical set of values is called a column and its values correspond to a particular variable.
In this example, a name has been placed on the top of each column to identify what information is recorded in the column.
Observations
Observations are the smallest units of sampling in a study. They correspond to each individual or item in the study. This is not necessarily the same as a datum (which is the smallest measurement recorded).
Picture the following: if you have an item and make a number of measurements on it, this entire set comprises an observation. Ordinarily, your observation will include different kinds of measurements, for example the mass, length, temperature and color of the item. The characteristic that makes the measurements an observation is that they all relate to the same item.
An observation is recorded as a row in a data matrix. How many values occur across the data matrix depends on the number of values needed to identify the observation and record all its characteristics.
Variables
Variables are made up of two kinds of information. One kind consists of the information that you use to identify your observations. These might be the plot number and date, sample number, or pond identification number as shown in the previous example. The other kind of variable consists of your measurements, such as the size, mass or yield of the item being observed.
You organize the data matrix so that values for each variable are found stacked one below another. If you are interested in the values for a particular variable, say temperature, for all observations, then you look down a single column of the data matrix. All the values in this column should be temperature.
Organizing A Data Matrix
Arrange everything in a rectangular form.
Once you decide what you are measuring and how to classify each observation, you can decide the columns for your data matrix. Make sure that every value for a column has the same type of information. For example, only plot numbers in the plot number identification column. It is unlikely that you will have two columns with the same type of information. This means that if you are measuring temperature, all of your temperature values will most likely be in the same column in your data matrix. There are valid data matrix designs that are exceptions to this, but it is better to assume that the no-duplication pattern is more appropriate.
Your observations get recorded as rows. As you collect more data, you add new rows to your data matrix. This is like adding lines to a page.
Once you have designed your data matrix, you never add or remove columns as you add observations to the data matrix. Otherwise, you would produce a non-rectangular matrix.
Be generous in including identification columns to the data matrix. It may seem that you are entering the same information over and over. But that's good, not bad. Remember that you are trying to get information into the computer for analysis, not to save typing.
Be neat and tidy at this stage of the analysis. It is very hard to recover from a poor data matrix design or sloppy recording of values.
Design your data matrix as soon as possible in your study. If you do it before you record values, so much the better. When in doubt about whether you have a proper data matrix, try a design and see if it can be used to produce the desired analyses. Do this with a few observations, not all of them. This will give you a chance to redesign your data matrix, if necessary, without wasting much typing.
Some Examples
A study is being made of the relationship between leaf length and width for a particular plant. Leaves are collected and the length and width of each (measured in millimeters) are recorded.
A further study is made to determine whether the length-width relationship for leaves changes, depending on the location within the plant from which the leaves are collected. Here, locations are identified by the four cardinal compass directions. Lengths and widths. are measured in millimeters. Note that all the leaf lengths and widths for a compass direction are not recorded in the same row. Instead, each pair of measurement values and its identifying compass direction has its own row.
This study is similar to the previous one, but here it is thought that the relative age of the leaf may influence its length/width relationship. Age is not something that can be measured directly in these plants; instead, the order of leaf emergence on a branch will be used. Each leaf is given a sequence number to indicate its relative position, starting from the apex and working back to the older leaves. This number does not have measurement units. Question: How would you modify this design if you had pairs of leaves originating at each point on the stem? Answer: you would keep the same data matrix design but double the number of observations.
A study has been designed to see if different trees have different average fruit weights. Three fruits are selected randomly from each tree, weighted (in grams) and recorded with the identification number for each tree. One possibility for the data matrix is:
TREE 1 2 3 WEIGHT-1 12.5 23.4 15.4 WEIGHT-2 14.5 18.7 19.2 WEIGHT-3 9.1 21.4 18.7
It would be better, however, to use a data matrix with the weights recorded like this:
TREE WEIGHT 1 12.5 1 14.5 1 9.1 2 23.4 2 18.7 2 21.4 3 15.4 3 19.2 3 18.7
This design retains more flexibility for analysis, even though it is harder for you to compare the data values for the trees directly from the data matrix.
Measurement Scales
There are three measurement scales that are commonly used:
Classification data (also known as attribute or nominal data). These are values that are qualitative, such as data consisting of names. The tree positions that were used in Example 2 belong to this measurement scale. Virtually all of the identification information in your data matrix consists of classification data. (Nomen = name).
Ordinal data (also known as ranked data). These are values for which we know the order, but not the precise distance between values. In Example 3, leaf emergence consists of ordinal values since we only know the relative time of emergence of leaves for each branch sampled, not the exact time between the emer-gence of successive leaves. (Ordinal = to order).
Measurement data (or interval data). These are data where we have numerical values and can place each value precisely on a continuous scale. Anything that you measure with metric units (actually S.l. units) is measurement data.
The measurement scale associated with a particular variable has an important role in determining what type of analysis can be performed. This is not any easy match-up process, but an understanding of your analysis goal and the measure scales of your variables provides a guideline. With increasing experience, you will find these characteristics of considerable importance.
Self-tests and Exercises
Provide a data matrix design for each of the following problems and identify the measurement scale appropriate to each variable in the matrix.
On a series of monthly trips to an experimental site, the weights of three species of rodents will be recorded from animals caught in a set of traps. Sometimes, several animals are found in a single trap. More often, a trap is empty. The investigator wants to determine whether there are any obvious trends in the live weights of the species over the year during which she will take samples.
A laboratory study will determine the lethal temperature of a species of aquatic plant. A water bath will be used to control the environmental temperature (reading in degrees Fahrenheit). The experimental procedure is to put a container with a small number of plants into the water bath at a particular temperature for 24 hours. At the end of this period, each of the plants is tested to see if it is alive or dead. It is not possible to use the same number of plants in each experimental temperature. One way that the results will be examined is as the percentage survival at each temperature.
A yield trial is to be run on a new grain variety. The experimental plot will be divided into units that have no added fertilizer, a breeder-recommended fertilizer level, and an extension-agent recommended level. There will be one harvest, at the end of the growing season, to see if there are differences in the three experimental treatments. Evaluation will be based on total plant weight and the weight of the plant's grain yield.
The nutritional quality of five fruit varieties will be investigated. The characteristic being studied is the fruit's fiber content. Fiber content is determined by measuring the total fruit weight and the total fiber weight. Twelve fruits will be examined for each tree variety. Since it is suspected that there my be differences between trees of the same variety, four different trees will be used, each contributing three fruits. Therefore, a total of twenty trees will be used in the study.