This page contains some sample datasets that you may wish to use for your mini-project. The datasets are divided into two categories. Click on any of these links to go to the relevant part of the page.
Pedagogy and assessment (student data)
Student MCQ responses
Student survey responses
Scientific content knowledge (data from lab experiments)
Heating of an oil bath
Atwood machine
Hypothesis testing for data with a random spread:
Nuclear radiation with bananas
Genetics and inheritance in corn
The egg problem
Medical trends in Singapore
The following dataset contains modified data from students' answers to a set of questions on SLS:
Some data cleaning and pre-processing may be required:
There are some students who did not do the exercise
Using the answer key, can you write a spreadsheet formula to mark each student's response and sum up the score for each student?
Hint: You should have 10 columns with a score of 1 if the answer is equal to the answer in the answer key, and 0 otherwise; as well as an 11th column that sums up these 10 columns.
Some possible directions of investigation:
Are there any significant differences between the classes in terms of their responses and scores?
Are there any correlations between questions (e.g. a student who answers "A" to one question may be more likely to answer "C" to another question).
(hint: you may want to use a chi square test of independence or other similar test)
Evaluate the discriminatory index and difficulty of each question.
It may also be useful to generate a plot or heatmap where each column represents one question, ordered in order of difficulty, and each row represents one student.
Taking this to a higher level:
Consider how you might develop question sets that make deliberate use of MCQ options to identify misconceptions in students.
The following dataset contains modified data from a course survey:
Some possible directions of investigation:
How do the three classes compare on their responses to each survey question?
Are there any correlations between the various questions?
You may face the following challenges:
Many survey questions are Likert-scale or Likert-like scale based. This results in highly granular data, which may be difficult to visualise on a scatterplot.
Taking this to a higher level:
You may want to consider checking for mediating or moderating variables, or consider whether any pair(s) or group(s) of values can be condensed into a single factor (e.g. Principle Component Analysis)
Click here to go back to the top of this page
The following dataset contains modified data taken for a standard lab practical. Students are expected to be able to use such data to determine the heat capacity of oil.
Some background theory may be required:
You will need to derive the formula to model the heating of the system
Some possible directions of investigation:
Plot an appropriate scatterplot to evaluate the proposed model.
Estimate the expected range of error in the measured and calculated data, and insert appropriate error bars.
Does this expected error fully account for the deviations from the model? If not, attempt to model the deviation appropriately. Do the values obtained make sense from a real-world perspective?
Bringing this back to your classroom:
Consider how you might use these insights to explain non-random errors in standard practical experiments to your students.
The following dataset contains data collected from a class of students performing the same experiment. Students are expected to be able to take 6 data points to determine the magnitude of gravitational acceleration g.
Some background theory may be required:
You will need to derive the formula to model the acceleration of the system.
Some possible directions of investigation:
Plot an appropriate scatterplot to evaluate the proposed model.
Estimate the expected range of error in the measured and calculated data, and insert appropriate error bars. In this case, you may wish to estimate the error from the actual spread of measurements in the data.
Does this expected error fully account for the deviations from the model? If not, attempt to model the deviation appropriately. Do the values obtained make sense from a real-world perspective?
If you are stuck and unable to proceed with modeling the deviation, you may wish to read this short primer on theory.
Bringing this back to your classroom:
Consider how you might use these insights to explain non-random errors in standard practical experiments to your students.
Click here to go back to the top of this page
The following dataset contains data taken from sensors measuring a capacitor's discharge.
Some background theory may be required:
You will need to derive the formula to model the discharge of the capacitor
You may encounter some difficulties:
When plotting the scatterplot and inserting the trendline,
Some possible directions of investigation:
Plot an appropriate scatterplot to evaluate the proposed model.
Estimate the expected range of error in the measured and calculated data, and insert appropriate error bars.
Does this expected error fully account for the deviations from the model?
Bringing this back to your classroom:
Consider how you might use these insights to explain non-random errors in standard practical experiments to your students.
Click here to go back to the top of this page
The following dataset contains data taken with a GM counter and banana peels acting as a low level radioactive source. Part of the data was taken in the absence of the radioactive source to provide data for background radiation.
The dataset has two tabs. You can use the "Processed data" tab. This tab includes a filter function where you can indicate what proportion of the original raw data you want to include in your analysis.
Some possible directions of investigation:
Is there a significant difference in radiation count with and without the source?
How much radiation does the source provide, on average?
Does changing the number of readings (e.g. filter only 50% of the raw data) affect the statistical significance? Think about what this means in the context of statistics and evidence.
Select multiple random samples of a fraction of the total dataset and obtain the mean value of that sample. Do these values follow the central limit theorem?
Bringing this back to your classroom:
Consider these insights when discussing stochastic experiments (e.g. radiation counts, thermal physics).
Consider how you might design a STEM experiment for students using data.
The following dataset contains data representing the number of corn kernels partitioned by colour and texture.
The data is taken from https://bhcc.digication.com/igor_popovich/Lab_report_4._Corn_Genetics. This page also includes some ideas for conducting this as a lesson.
Some possible directions of investigation:
Evaluate whether the two variables (colour and texture) are independent.
Evaluate whether the distribution follows Mendel's law.
Bringing this back to your classroom:
Consider how you might design a STEM experiment for students using data, or use this as a learning experience for students.
Click here to go back to the top of this page
The following dataset contains data describing the number of eggs found in flea of various sizes
The data is taken from here: http://www.biostathandbook.com/linearregression.html
Some possible directions of investigation:
Is the weight of the fleas normally distributed? Is the number of eggs normally distributed?
The hypothesis suggests that the number of eggs is proportional to the weight of the flea.
- Does the data support this hypothesis? What is the ratio of explained to unexplained variance?
- Are the data points randomly/normally scattered around the model?
Click here to go back to the top of this page
There are many public datasets for medical statistics in Singapore, for example:
Consider studying the relationship between these datasets and other datasets, such as:
Vaccination statistics: https://www.moh.gov.sg/covid-19/statistics
Singapore Tourism Analytics Network: Monthly visitor arrivals - https://stan.stb.gov.sg/content/stan/en/tourism-statistics.html
Dept. of Statistics, Singapore: International visitor arrivals - https://www.singstat.gov.sg/publications/reference/ebook/industry/tourism