Below you will find copies or links to our most important supplemental documents.
The tool that we created is a data analysis program coded in Python to determine the relationships between airborne pollen counts and environmental factors such as climate and air pollution. The full code can be found on Github: https://github.com/trschaeffer/PollenTool
We have also written a detailed User’s Guide including video tutorials so that anyone can learn how to use the tool. It can be accessed from this link:
https://docs.google.com/document/d/1AO7KPz4ybl1cILyjgckGVQ38X49zzYsnGo60tWddggA/edit?usp=sharing
The development of the pollen tool began as a segmented process where the individual functions were written before implementation into the user interface (UI). The three main tabs of the UI: Data Entry, Correlation, and Plotting, became three separate python files which are imported into the main UI file. The goal of the data entry file is reading and merging data that the user inputs to the master Excel file using the openpyxl library. The correlation functions are able to filter the data into a format that can be analyzed using Spearman’s correlation methods. The plotting functions are able to process the data and generate plots using the python library MatPlotLib, including plotting of various forms of regression.
Our data analysis tool was built with a set of robust logic so that it would be flexible and applicable to as many situations as possible. Below are some of the decisions that we made to ensure that the tool could be used for all of our collaborator’s data.
We have created flexibility in the user’s ability to import data. The tool will accept .xls and .xlsx files, and the format in which data is stored in these files can range. Additionally, the user has the option to import a new data file that they have created and add it to their master data set, or load in the master data set. The master data set consists of all of the data they have imported in the past. This allows all of the different types of data, such as pollution, climate, and pollen, to be stored all in one place for easy access by the tool. Since the daily pollen counts are currently collected manually, we have also created the option to enter individual data points to the master data set using the tool and calculate the total pollen per day.
The tool is also able to calculate correlations and generate graphs based on various conditions the user can opt to use. The user may filter the data to include only certain dates, specify the pollen season, calculate monthly or yearly averages, and plot and extend regression lines into the future.
D.2.1 Data entry Logic
One of the innovative features of our tool is its ability to continually be updated with data over time, allowing more accuracy in its data processing over time. The challenge lies in the variance of formatting styles, thus, our tool must be flexible to receive and understand these differences. We received files with three different formats and two different file types. While these different formats initially provided a challenge, it allowed us to enhance the adaptability of our program. A .xlsx format with the dates in mm/dd/yyyy format was used in the spreadsheet that will contain all data, a spreadsheet we refer to as the “mastersheet”. When importing data to the mastersheet, the tool tries to identify the format of the data file by checking cells ‘A2’, ‘A3’, and ‘B1’ for a date, whether it be a written out month or a formatted date. If written-out months are given, the year is extracted from ‘A1’. The program will then use the orientation of the dates to continue to extract the category information and data out of the file. With this, as well as several functions to merge, and filter data by date and category, we were able to get all data into a single format and onto the mastersheet. This mastersheet is loaded and updated whenever a new file is merged into it.
D.2.2 Merging, filtering, and maintaining data integrity
When a spreadsheet is created it contains three key elements that our program must be able to handle. It contains categories, data, and dates. These categories are filled using the above techniques. Categories will be a list N long, dates will be a list M long, and data will be a list of lists, N lists of M length.
The first operation performed on this dataset is to remove any categories and dates with no data in them. Any cells that are blank, contain a “-”, “--”, or “nan” are considered to be empty and are turned into NoneType. It is best to remove these empty sets so the user will not be able to select an empty category to be analyzed, which would result in absence of results. The dates containing no data were also removed so that the program could run faster and not increase file sizes unnecessarily. Whenever a date or category is deleted from the two lists, special care is taken that the main data list is also resized so that it can continue to align its index values to the indexes of the dates and the categories. Whenever a category or date is deleted, a string is generated describing this action and can be reported to the user.
The second main operation that is performed after selecting a file to browse is to merge this file into the master data set. The idea behind this is it will give the user the ability to interface with all available data at once. Similar categories are brought together, as well as similar dates. To properly merge categories together the category with the shorter name is compared against the longer category, concatenated to the same length. Any data within these categories is also merged, with the data from the new file overwriting the data from the masterfile. This is done so that if there were to be a mistake in the master spreadsheet, one could simply fix the original spreadsheet then import the data once again, as opposed to manually editing the mastersheet as well as the original file. “Total pollen” is a special category that will always overwrite all data in its category. This is done so that if a former pollen category were made a non-pollen category it would not allow any prior data to remain.
If a new category must be added, it is simply put on the end of the list of categories. While this process could be done alphabetically, for us it makes more sense this way because pollen allergies can be loaded into the master sheet first, followed by other factors, allowing the most important data to always appear first in drop-down menus. Upon merging files, the interface will generate a prompt asking the user if the new categories are pollen categories. This is done so that total pollen per day can be calculated based on the pollen count of any given day. To note a pollen category, a marker, “(p)” is added to the end of its name. The marker is always filtered out before presenting data to the user. To view and edit what is a pollen category, the user is able to list all categories along with their status as a pollen category. Dates are merged chronologically, and future dates are appended to the ends of the categories. Special care is taken throughout these processes to ensure that the data is kept associated with the correct dates and categories.
D.2.3 Special treatment of “Pollen Categories”
There are several differences in the way that data is represented between our collaborator and the other institutes whose data we used. When the pollen count is zero, that category is left blank. When weather data is not taken, the cell is left blank or with a ‘-’. This would cause some issues for the tool because it raises the question, “how should a blank or ‘-’ be dealt with to keep the data honest?” The resolution our group created was by making a distinction between the treatment of pollen and non-pollen categories.
When the user merges a file, just before the new categories are appended to the end of the master data set, the user is given a checklist of the new categories, and asked to check off which are pollen types. This begins the special treatment of the pollen category. A ‘(p)’ is added to the end of the category’s name to mark it as a pollen category, and -10000 is added in every blank slot of the data. The -10000 is treated as zero by the UI, and simply marks that the data was added artificially. This allows the removal of pollen categories, along with any zeros the tool has added to the data. There is a button in the Data entry section of the UI allowing the user to switch categories between being a pollen or non-pollen category using the same dialog as before.Having the user perform this task also allows the creation of a total pollen category, which is the sum of all pollen categories. There is a seperate button to calculate this category initially, and whenever a pollen category is modified, it will automatically be recalculated.
The program creates an array for each variable using “numpy.array”, then uses the function “spearmanr”, available from the SciPy library in Python, to calculate Spearman’s correlation coefficients. We used the “polyfit” function of the in the numpy library to create linear regression for each relationship. Polynomial regression applies the same “polyfit” function with some different parameters. LOESS regression is done using the “loess” function of the statsmodel library.
In order to make the data analysis functions of the tool more flexible, we decided to implement some optional advanced settings for organizing the data. Since there is such a large number of data points available to cover the many years of data collection, the graphs can easily become very messy and difficult to interpret. As a means of simplifying the analysis process, we have made it an option for the user to graph the averages of the data across years or months and to graph these averages. This option is also available for the correlation calculations so the user can determine the appropriate spearman’s correlation coefficient values for the data they wish to analyze. If used, when the user prompts for the correlation calculation or creation of the graph, the data is read in by the tool and it calculates averages of the data points. The tool goes through every date in the file, unless a specific date range is specified (see section below), groups the dates and corresponding data point for each time period, calculates the average for each group of data, and reports them in a chronological array used for the graphing function or correlation calculation. This makes it much easier to visualize and analyze the data over a period of several years. Simplifying the data in this way may also bring to light some patterns in yearly or monthly meteorological or pollution trends that influence the pollen season.
Other options available to the user are selecting to correlate or graph data within only certain dates in the pollen season or a certain date range. If the user chooses to select these options, then only data within the specified date range will be included in the calculations or graphs. To do this, the tool reads in each date, determines if it is within the specified date range, makes arrays of only the corresponding data points within that range, and then completes the calculations and graphing functions using these filtered arrays. Any combination of optional settings is able to be processed by the tool. These features are useful because the user may want to look at only certain dates within the pollen season, or a certain month, week, or year.
Python is not a language that is made to produce deliverable products to those with little technical experience. This shortcoming is exasperated by our remote presence, meaning that our team could not be present to assist any installation or technical guidance. Any user loading python packages is typically expected to install Python, collect dependencies from the Python terminal, and finally launch the file through the same terminal. As an alternative to this, we use the Pyinstaller package, a tool that is made to create a single executable file with all dependencies, even Python, included. While it creates a large file that may take a while to load and be scanned by antivirus softwares, this is the best option for our work, as this file can be run on any computer with a single click.
Getting Pyinstaller to properly work was challenging. The process involved finding all of the missing “hidden” imports. The steps taken to get the pyinstaller tool to work were documented and put into a text file so that this process may be repeated in the future. Unfortunately, we found pyinstaller to be extremely unreliable in transferring from computer to computer, so results may vary with our process of using it to successfully create a .exe.
Full responses can be found here: https://docs.google.com/spreadsheets/d/1Uz006amzxgDYhebPlFsI1P-A7xcQ21FSuYZek_bERhM/edit?usp=sharing
Full analytics: https://docs.google.com/forms/d/16W9Y1x7yro82tOM7g26Kdhv2QR3pUJpnJN3tqQotyl4/viewanalytics