Once data are collected, we often need to summarize them numerically and graphically to aid interpretation. This lab begins with using an online climate data exploration tool to visualize climate change trends since 1895 in New York City's Gramercy neighborhood where Baruch College is located. Then we analyze these data using a spreadsheet to estimate the central tendency (mean, median, and mode) and make graphs with appropriate elements to estimate quantitative trends. This lab activity summarizing long-term climate data prepares us for future lab activities where we evaluate links between climate change and the location, abundance, and behavior of organisms. Quantifying the variation in datasets (e.g., distributions, variance, confidence intervals) is covered later in the Population Statistics lab activity.
Statistical background reading can be found in an introductory stats book (see one here), but we'll only focus on basics for this lab and overall course.
For instructions on how to produce visuals in Google Sheets, check out this page.
Students will be able to use an online data exploration tool to obtain climatic data and visualize long-term trends
Students should be able to take a dataset and:
Produce estimates of the minimum, maximum, and range
Produce estimates of the mean, median, and mode
Produce graphs with appropriate elements (title, axes labels and unit markers, data points)
Interpret graphs to describe long-term trends in climatic parameters
Data are recorded information, usually in the form of numbers, that can be examined to discover patterns and trends. Ecologists collect a wide variety of data in order to understand the abundance and distribution of organisms on the planet as well as their relationships with each other and their environments. While data are typically collected from individual organisms (e.g., individual height), we often can't measure every single organism in an entire population. As a result, we collect and summarize data from a sample of individuals and use it to estimate traits about the population (e.g., average height in a population). The science of statistics then allows us to determine how good (accurate) our population estimates are (confidence).
Today we'll use a long-term climate data set to practice data analysis and summary skills. This process of analyzing and summarizing data lies at the heart of understanding ecological patterns and relationships. Data collected by scientists can be summarized in two main ways - numerical and visual summaries. Numerical summaries include information on the distribution of data such as the mean (average) and amount of variation. Visual summaries include graphs and charts that use lines and symbols to represent trends in the data.
Numerical Summaries
Numerical summaries are used to help us understand general trends in the data. A common example is the mean, which is the average value for some trait in a population. If we have n pieces of data, we can add them together and divide by n to get the mean. Another example is the median. The median is the piece of data that falls in the center of the dataset if they are arranged in order. Note that for data sets with an even number of elements (pieces), the median is the average of the two pieces closest to the center.
If there are any extremely large or small numbers in the data set (outliers), they will raise or lower the mean value of the population. In this case, the mean value doesn't give us the best representation of the population because it's swayed by outliers. Alternatively, the median value is not influenced by outliers in the data set. You may have heard news reports on the "median home price" or "median household income" - the median is used here to make sure the numerical summary isn't influenced by extremely high values.
Visual Summaries
Visual summaries take numbers and turn them into images. Just like numerical summaries, they distill data so that general trends and points are easier to understand. One common form of visual summary is a graph. Different types of graphs exist (e.g., bar, line, histogram), but all share some general characteristics. Graphs should contain:
Axes labels with units - Each axis (side of the graph) should be labeled. What is it (length, mass, time, etc.) and how was it measured (metric unit)?
Visual representation of the data - Points, bars, or other shapes that represent the data.
A descriptive title - Not just "variable 1 vs variable 2"! This is the default title in Google Sheets and Excel, and it's awful. It tells us nothing of the relationship between these variables, only that they're graphed... but we already know this just by looking at the axis titles! Instead, our titles should always state the relationship between the variables to aid in interpretation. A good template to practice would be the form of "Y-Trend-With-X", where you would state the y-axis variable, describe the type of trend, then state the x-axis variable. Let's try: if we were to make a graph showing faster plant growth at higher temperatures, we might title it "Plant growth increases with temperature". To make the title even better, you might quantify how much plant growth increases: "Plant growth increases 25% with each additional degree Celcius".
Other options that may be included are legends (useful if you are displaying data from different groups), lines of best fit (trendlines), confidence intervals, and other factors.
Linking Numerical & Visual Summaries: Trendlines & R2 values
When making visual summaries of data, we should also include basic mathematical descriptions of the visual trends. This is easily accomplished by fitting a trendline to the data and evaluating both its equation and R2 value (pronounced R-square). A trendline is a line drawn through a cloud of data points that best "fits" the data (i.e., minimizes the mathematical distance between each data point and the line). In other words, the trendline visually depicts the average relationship between x and y in a graph. The mathematical relationship between x and y is then described by the trendline equation and the R2 value:
Trendline equation - Recall the generalized formula for a linear line, y=mx+b, where m is the slope, x is each unit on the x-axis, and b is the y-intercept. A positive slope is indicated by a positive value of m, and a negative slope is indicated by a negative value of m. We can use this equation to calculate a y-value for any value of x. For example, using the equation y=2(x)+5, if x=2 then y=9. You'll use this technique to evaluate and predict climate patterns below.
R2 value - This value tells you how well the trendline fits the data (i.e., the strength of the relationship between the x and y variables). R2 values range from 0 to 1. A value of 1 means a perfect fit (every point falls right on the trendline), while a value of 0 indicates there is no relationship among the variables. The higher the R2 value, the stronger the fit between x and y. As an example, if R2 = 0.65 then we would say "x explains 65% of the variation in y".
In this activity you'll produce scatterplots (scattercharts), fit trendlines to the data, and then use trendline equations and R2 values to quantify and interpret the trends. Scatterplots are very useful, and widely used for demonstrating the relationship between two variables. For instructions on how to produce visuals in Google Sheets, check out this page.
Ecologists often evaluate how living systems respond to climatic variability. This requires access to climate data as well as the ability to analyze them. To get started with accessing and visualizing climate data, we’ll head to the PRISM Climate Group's Data Explorer. This is a website run by climate scientists at Oregon State University and serves as one of the primary data portals for downloading climatic data. On this site you’ll be able to quickly visualize climate trends for NYC going back to 1895.
Go to the PRISM Climate Group's Data Explorer. Familiarize yourself to the four main areas of this tool: Location; Data Settings; Controls; and the map. On the map, notice how there is a grid of rectangles—climate data are available for each individual grid. For example, you could look at lower Manhattan climate trends separately from those in midtown or upper Manhattan.
At the bottom of the Data Settings options, click on the option for “Interpolate grid cell values”. This tells the database to average several nearby grids in a particular way so that the climate data can be estimated for one particular point within one grid.
Go back up to the Location options, click on Coordinates, and enter these numbers [Latitude: 40.7394] [Longitude: -73.9846; make sure you include the negative sign]
Click on “Zoom to location”, and you should see the map adjust with a red circle appearing in Manhattan, at the corner of Lexington Avenue and 23rd St in Manhattan (the location of Baruch’s 17-Lex building).
In the Data Settings box, highlight only Precipitation and Mean Temp (deselect any others). The Precipitation data are defined as the total amount of precipitation accumulated for each year. The Mean Temp data are defined as the annual average of daily average temperature.
Select Annual Values, and set them like this: [Start: 1895] [End: 2024] with [Units: SI (metric)]. Set the Resolution to 800m. Check to make sure you've already selected "Interpolate Grid Cell Values".
Under Controls, click “Retrieve Time Series”. A graph will appear below, showing values for precipitation and temperature in Baruch’s neighborhood, spanning the time frame from January 1895 to December 2024. Answer the following questions, and note that you don't need to download the time series because we're providing you with a data set for this location that's already been prepared for you (But feel free to use this Prism Data Explorer for any other region in the US!).
Using these graphs as visual summaries of the data, do your best to estimate the average precipitation and average temperature over the 1895-2024 period. Record your estimates here, being sure to report proper units in your answer.
Average precipitation 1895-2024:
Average temperature 1895-2024:
Do you see any long-term trends in the graphs? For example, does it seem like precipitation might be decreasing, increasing, or staying the same? How about for temperature?
Look specifically at the variation in precipitation and temperature in the graphs, which you can see as changes in values across years and decades. Do your best to estimate whether there are any visible patterns of variation in the long-term data set, and describe those patterns here.
Visually compare between precipitation and temperature trends. Do you think these climate factors might be related to each other (or correlated)? In other words, do high precipitation years tend to occur during high or low temperature years?
Looking back at your answers for 1-4, how confident would you rate yourself in your ability to detect patterns and trends in the visual summaries? And, what are some limitations of visual summaries?
On their own, visual summaries don’t allow us to fully understand our data. Numerical summaries, while helpful for quantifying exact rates of change, don’t allow us to intuitively understand trends as efficiently as visual summaries. In the rest of this lab activity, you will blend numerical and visual summaries to develop a powerful perspective on data analysis. Let’s explore these opportunities while building basic spreadsheet skills.
For the rest of this activity, we’ll use a spreadsheet to make numerical and visual summaries of NYC climate data. We’ve already downloaded the climate data from PRISM and prepared it for you. However, we haven’t arranged the data in any particular order (called “sorting” the data) so that you can practice this basic spreadsheet skill. Make a copy of our NYC climate data and save it in your Google Drive.
Metadata
Metadata are the description of the data set (data about data!). Typically, metadata include a description of each column in a spreadsheet. Here are metadata for our NYC climate data:
Location: NYC Gramercy & Baruch College neighborhood (Latitude: 40.7394, Longitude: -73.9846)
Year (1895-2024)
Precipitation (mm): Total amount of accumulated precipitation per year. Example: If there were 100 days of precipitation in one year, and each of those days had 10mm, the accumulated precipitation for the year would be 1,000mm.
Tmean (°C): Mean daily temperature, averaged over each year. Example: If today’s coldest overnight temperature (Tmin) were 10°C, and the warmest temperature (Tmax) were 20°C, the mean daily temperature would be 15°C. When the daily Tmean value is averaged across the entire year, that value is what appears in our data set.
Essential formulas for spreadsheet calculations
In this data analysis activity, as well as in many subsequent lab activities, we'll use spreadsheets to calculate numerical summaries. Following are some of the formulas we'll use throughout the semester:
Name Formula Function
Mean =AVERAGE() Calculates the average value of a data range
Median =MEDIAN() Calculates the median value of a data range
Min =MIN() Identifies the minimum value of a data range
Max =MAX() Identifies the maximum value of a data range
Sum =SUM() Calculates the sum of a data range
Organizing your spreadsheet working space & managing data
Your lab instructor will guide you through basic navigation of spreadsheets in Google Sheets, including cells, rows, columns, and worksheets. In addition, your instructor will guide you through how to sort data.
To sort the data by Year in ascending order: Click the upper-left corner of the spreadsheet (where the rows & columns meet), which will highlight all cells on the worksheet. Then, select Data > Sort Range > Advanced Range Sorting Options. A pop-up window appears, select “Data has header row”, then select which column you’d like to sort by (Year) and whether you’d like it sorted in ascending order (A->Z) or descending order (Z->A). Notice that you can add additional sorting rows, but for today’s activity just sort by Year in ascending order.
Your lab instructor may guide you through additional sorting practice to build this skill. When sorting data, always make sure you’ve highlighted the entire data set! If you highlight just the Year column to sort (without selecting all other columns), it will disassociate Year from its corresponding precipitation and temperature values.
Once you’ve practiced managing data in the spreadsheet, continue to answer the questions below.
Create numerical & visual summaries
6. Identify what year(s) each of these climate extremes occurred, using either the sorting technique or the MIN and MAX functions:
Driest year on record
Wettest year on record
Coldest year on record
Warmest year on record
7. Earlier in this lab, you estimated average precipitation and temperature for the 1895-2024 period using a visualiztion. Let’s improve our accuracy by calculating the central tendency of the climate data using spreadsheet functions. Be sure to include appropriate units in each of your answers.
Mean precipitation 1895-2024:
Median precipitation 1895-2024:
Mean Tmean 1895-2024:
Median Tmean 1895-2024:
8. In your answers for the previous question, were the median values similar to the mean values? Which of these measures is most likely to be influenced by extreme values, the mean or median? You may want to refer to the Introduction section for this lab activity.
9. Your instructor will guide you through the process of creating a scatterplot. Then create a scatterplot of Tmean vs. Year (Tmean on y-axis, Year on x-axis). Add a trendline, including the equation and R^2 (R-squared) value. Make sure your x-axis and y-axis have titles and units, and make sure your chart title accurately interprets your graph results (a graph title of “Tmean vs Year” is not sufficient; please use the format “Y-trend-with-X” discussed in the Introduction section). Include a copy of your graph with your answer. Based on the trendline and its equation, how much has Tmean changed since 1895 in NYC?
How to calculate the amount of change using a trendline equation: Go to an empty cell, type in an equal sign (this is how the spreadsheet knows you’re calculating an equation), and type in the trendline equation, substituting 1895 for x. In another cell, repeat for year 2024. Find the difference between these two values.
10. Following up on the previous trendline calculation—why would we use the trendline equation to estimate the amount of change, rather than use the individual data points from 1895 and 2024?
11. Create another scatterplot of Precipitation vs. Year, including the trendline equation and R-squared value. Title the graph. Include a copy of your graph with your answer. Based on the trendline and its equation, how much has Precipitation changed since 1895 in NYC?
12. Congratulations on building your first set of data summaries that combine visual & numerical summaries! Let’s now compare our two scatterplots. Which climate variable (Tmean or Precipitation) has a stronger correlation with Year? How can you determine that?
13. For our last graph, let’s use the long-term climate data in a different way. Instead of evaluating trends over time, let’s evaluate how change in one climate factor might influence a change in another climate factor. In other words, answer this question: How might rising temperatures influence precipitation patterns in NYC? Create your own graph to answer this question, including all of the necessary components (units, labels, title, trendline, R-square), and provide a brief interpretation or summary. Include a copy of your graph for your answer.
14. Point to Ponder: Take a moment to reflect on the overarching outcomes of your analysis. Now that you’ve quantified some basic rates of climate change in Baruch’s neighborhood, what do these results mean to you? This is an open-ended question with no correct or incorrect answer; please reflect on what you learned in this activity and what it means to you.