Generating Faux Data

Overview:

In this class, you may use faux data as long as it is (a) plausible given the parameters of the system and (b) plausible given the current state of the research literature. These instructions will walk you through the creation of faux data that could be used for a dashboard to present a snapshot in time. If you are interested in a dashboard that will demonstrate common demographic differences, I suggest you start here and work backwards to then generate the kinds of data files that you would use to generate a live dashboard. Instructions here are given in Excel, but the process would be very similar in other systems and the functions/formulas will also be very similar in Google Sheets.

Step 1: Consider what similar data might look like.

What learning metrics are you hoping to model, and what are their typical values? What kind of data is typically used to measure these constructs?
- For example, if you're generating faux data from an established survey that uses a Likert scale, your faux numbers should reflect that!
- You must pick appropriate measure types for the kind of thing you're trying to measure. For example, no heart-rate sensor will tell you what a student is thinking, so you should consider more plausible uses for that kind of measure (e.g., emotional valance).
- Make sure you're familiar with which values are typically indicative of good/strong performance, and what values are common when students needs are not being met?
Once you have chosen your primary measures, consider what other metrics might these interact with. Is there research out there to back this up? How strong are these relationships typically?
What demographic factors might influence these metrics? Consider going beyond the typical census data (which are proxies for other types of information about your students), but since these are often preferred/required in more global it's good to include them as well. What other student-level characteristics might you want to use to generate and analyze your data (e.g., neurodivergence, disability, regionality, etc.)
You can check some of the online data sources we have already provided, and you might even blend these with your own unique data. FOR THIS CLASS, that is fine, as long as you appropriately cite your data.
Decide whether you'll be doing a live dashboard or a snapshot.

Step 2: Decide how you will deal with population-related issues

Decide how many students you are going to include. You might consider a power analysis.
Decide how balanced you would like your data to be for demographic categories. In general, statisticians and data scientists want full representation, but that isn't always possible in the real world. If you want to emulate this issue so that your dashboard can demonstrate different approaches to this issue, that is a completely justifiable thing to do. (For example, you might want to show how the dashboard handles reporting of things like standard deviations when there are only 3 kids in a class or other relevant sub-population.)
Decide how much variation you want in your data overall, including variation that you might choose to introduce based on demographics. Remember that we rarely see cases where "girls always do X, and boys always do Y." There is often considerable overlap between demographic groups even when there are demographic-based trends. However, something like colorblindness might be more prevalent in boys than in girls. If you think that might haveMy fr an impact on their grades in art class, for example, generate a category for colorblindness before you proceed with your generation of art grades.

Step 3: Consider data type and normalization issues

Do you have both categorical and continuous data? (yes)
If you are converting categorical data to dummy variables, be sure to document that.
How might different scales affect your metric generation or feature extraction processes? But then decide how to display to novices if you had to z-score items

Step 4: Get familiar with Excel functions that might help you

General Hints:
Basic Excel formulas for generating and exploring relationships in your data:
More complicated formulas:
- PivotTables:
- VLookup:
- HLookup:
- Xlookup:
- If vs. VLookup vs. XLookup:
- Index and Match:
- Unpivot Data (1/3):
- Indirect Function (Pulling the same data from nearly identical sheets):
- 14 Newer functions (TextSplit, TextBefore, TextAfter, Vstack, HStack, TOROW, TOCOL, WrapRows, WrapCol, TAKE, DROP, ChooseRows, ChooseCols, Expand):
- Random Sampling:
Conditional formatting:
Creating Charts (you should check your work as you go!)
- https://www.youtube.com/watch?v=lUjZa4YBixI&pp=ygUZRXhjZWwgQ2FtcHVzIC0gSm9uIHJhbmRvbQ%3D%3D

Step 5: Consider scale, interpretability, and data normalization issues

Remember that humans struggle with really large and really small numbers.
If you're dealing with metrics of wildly different scales, you may need to normalize that data.
The units and measurements you use for preliminary calculations may not be meaningful to your audience. For example, telling a teacher that the average student performance on a test is 85 doesn't help them decide what to do if there is a bimodal distribution in those scores. Likewise, decimal places applied to people might also be problematic (e.g., "on average, 3.1 of the students in your class are likely to have autism"). If your audience doesn't understand standard deviation (one way to normalize data), you might use the SD for your calculations but present it more meaningful units in the dashboard.
You can (and should!) come back and deal with some of these issues after your initial data generation, but it helps to plan ahead for these kinds of issues

Step 6: Generating Naturalistic Data using a combination of deliberate labels and randomization formulas

If you're hoping to generate differences based on demographic categories, build these categories first. For ease in formula calculations later, you may consider having both a text column and a redundant numerical variable for these categories. Be sure to consider the degree of overlap you want for demographic categories and for other student level characteristics (e.g., race and SES, or SES and ADA categories).
Once these are generated, randomly assign a unique student ID to each student.
- There are many ways to do this, using one of Excel's formulas for randomization can help. For example, if you want insert a column using the RANDBETWEEN function, it will prove you with a random number between any two integers you specifiy.
- However, every time you do something on the sheet, the RANDBETWEEN function will update these numbers. So one thing to do is to copy the RANDBETWEEN column and then use the "Paste Special" function and select "values" when you paste it into a new column.
- You can then sort by these randomlly variables, BUT, you won't know for sure that there are not duplicates in this column, especially if you're working with a very large number of students. I recommend creating a 3rd column and using autofill to create something like "Student#001, Student#002, etc." to generate you final unique ID.
After assigning students a unique ID, pick a variable you would like to design first.
- If your first learning variable that would not be affected by demographics, use one of Excel's random functions to assign values.
  - For example, you could use RANDBETWEEN(400-750) to assign a student a random score on the Verbal part of the SAT test and you would probably get a reasonable amount of variation
  - If you want most of you students to be within a certain range (say 500-600), but a smaller proportion of them to be in a slightly higher or slightly lower range, you could randomly assign three groups (using the RANDBETWEEN(1,3) function) in a first column, then you could use If-functions to assign ranges of 400-600 to group 1, 500-600 to group 2, 500-750 to group 3. You might play with the number of groups and the range you assign until you get natural looking distributions.
  - Or, you could pick a less obvious place to start. Something like "age basic reading was mastered" to generate a value of 5-9 years old is not a typical measurement included in school reports (especially at the college level), but could be used to generate noise in more typical learning measures since this kind of data is known to be predictive.
- If your first learning metric is something that you would like to be affected by demographics, there are several ways to ensure that you create that variation without producing fully siloed (not realistic data)
  - If your data set is relatively small and you're only dealing with a binary demographic variable (e.g., free and reduced price lunch, an indicator of SES), you might just sort the data and apply different RANDARRAY functions to the two categories of students.
  - If your data is larger, a combination of if-functions and RANDARRAY functions can help you to generate slightly different ranges for different populations. For example, you might typically see an advantage in students GPA if they are not dealing with poverty, so students with a
  - See videos on how to apply these formulas. You might be able to embed them within one another, but in some cases, it might be easier to generate several columns, each with its own formula, to get to your final goal.

Checklist

Do your measures align closely with the design of the learning environment you have chosen?
Do any additional measures (those not closely aligned with the learning environment design) follow closely with what is possible to detect from the literature?
Are there appropriate relationships (as justified by the literature) in your data?
Is your data (and each variable) at an appropriate scale?
Is your data a valid measurement of what you propose to visualize/present?
IF USING EXCEL: When you're happy with the relationships with your data, did you COPY&PASTE-SPECIAL, selecting the "VALUES" only C&P option? You want to do this or your data will continue to change every time you make an edit to your spreadsheet.

Page updated

Google Sites

Report abuse