Sample datasets

Sample Datasets for hands-on projects

This page contains some sample datasets that you may wish to use for your mini-project. The datasets are divided into two categories. Click on any of these links to go to the relevant part of the page.

Pedagogy and assessment (student data)

Student MCQ responses
Student survey responses

Scientific content knowledge (data from lab experiments)

Modeling systematic error:

- Heating of an oil bath
- Atwood machine

Hypothesis testing for data with a random spread:

- Nuclear radiation with bananas
- Genetics and inheritance in corn
- The egg problem

Open investigation:

- Medical trends in Singapore

Pedagogy and assessment

Student MCQ responses

The following dataset contains modified data from students' answers to a set of questions on SLS:

Student MCQ data

Some data cleaning and pre-processing may be required:

There are some students who did not do the exercise
Using the answer key, can you write a spreadsheet formula to mark each student's response and sum up the score for each student?
Hint: You should have 10 columns with a score of 1 if the answer is equal to the answer in the answer key, and 0 otherwise; as well as an 11th column that sums up these 10 columns.

Some possible directions of investigation:

Are there any significant differences between the classes in terms of their responses and scores?
Are there any correlations between questions (e.g. a student who answers "A" to one question may be more likely to answer "C" to another question).
(hint: you may want to use a chi square test of independence or other similar test)
Evaluate the discriminatory index and difficulty of each question.
It may also be useful to generate a plot or heatmap where each column represents one question, ordered in order of difficulty, and each row represents one student.

Taking this to a higher level:

Consider how you might develop question sets that make deliberate use of MCQ options to identify misconceptions in students.

Student survey data

The following dataset contains modified data from a course survey:

Student survey data

Some possible directions of investigation:

How do the three classes compare on their responses to each survey question?
Are there any correlations between the various questions?

You may face the following challenges:

Many survey questions are Likert-scale or Likert-like scale based. This results in highly granular data, which may be difficult to visualise on a scatterplot.

Taking this to a higher level:

You may want to consider checking for mediating or moderating variables, or consider whether any pair(s) or group(s) of values can be condensed into a single factor (e.g. Principle Component Analysis)

Click here to go back to the top of this page

Modeling systematic error

Oil bath heating experiment

The following dataset contains modified data taken for a standard lab practical. Students are expected to be able to use such data to determine the heat capacity of oil.

Heat capacity experiment data

Some background theory may be required:

You will need to derive the formula to model the heating of the system

Some possible directions of investigation:

Plot an appropriate scatterplot to evaluate the proposed model.

Estimate the expected range of error in the measured and calculated data, and insert appropriate error bars.

Does this expected error fully account for the deviations from the model? If not, attempt to model the deviation appropriately. Do the values obtained make sense from a real-world perspective?

Bringing this back to your classroom:

Consider how you might use these insights to explain non-random errors in standard practical experiments to your students.

Atwood machine experiment

The following dataset contains data collected from a class of students performing the same experiment. Students are expected to be able to take 6 data points to determine the magnitude of gravitational acceleration g.

Atwood Machine data

Some background theory may be required:

You will need to derive the formula to model the acceleration of the system.

Some possible directions of investigation:

Plot an appropriate scatterplot to evaluate the proposed model.

Estimate the expected range of error in the measured and calculated data, and insert appropriate error bars. In this case, you may wish to estimate the error from the actual spread of measurements in the data.

Does this expected error fully account for the deviations from the model? If not, attempt to model the deviation appropriately. Do the values obtained make sense from a real-world perspective?

If you are stuck and unable to proceed with modeling the deviation, you may wish to read this short primer on theory.

Bringing this back to your classroom:

Consider how you might use these insights to explain non-random errors in standard practical experiments to your students.

Click here to go back to the top of this page

Capacitor discharge experiment

The following dataset contains data taken from sensors measuring a capacitor's discharge.

RC discharge data

Some background theory may be required:

You will need to derive the formula to model the discharge of the capacitor

You may encounter some difficulties:

When plotting the scatterplot and inserting the trendline,

Some possible directions of investigation:

Plot an appropriate scatterplot to evaluate the proposed model.

Estimate the expected range of error in the measured and calculated data, and insert appropriate error bars.

Does this expected error fully account for the deviations from the model?

Bringing this back to your classroom:

Consider how you might use these insights to explain non-random errors in standard practical experiments to your students.

Click here to go back to the top of this page

Hypothesis testing of a spread of data

Nuclear radiation with bananas

The following dataset contains data taken with a GM counter and banana peels acting as a low level radioactive source. Part of the data was taken in the absence of the radioactive source to provide data for background radiation.

Copy of Provided data: Nuclear banana

The dataset has two tabs. You can use the "Processed data" tab. This tab includes a filter function where you can indicate what proportion of the original raw data you want to include in your analysis.

Some possible directions of investigation:

Is there a significant difference in radiation count with and without the source?
How much radiation does the source provide, on average?
Does changing the number of readings (e.g. filter only 50% of the raw data) affect the statistical significance? Think about what this means in the context of statistics and evidence.
Select multiple random samples of a fraction of the total dataset and obtain the mean value of that sample. Do these values follow the central limit theorem?

Bringing this back to your classroom:

Consider these insights when discussing stochastic experiments (e.g. radiation counts, thermal physics).
Consider how you might design a STEM experiment for students using data.

Genetics and inheritance with corn

The following dataset contains data representing the number of corn kernels partitioned by colour and texture.

Corn inheritance

The data is taken from https://bhcc.digication.com/igor_popovich/Lab_report_4._Corn_Genetics. This page also includes some ideas for conducting this as a lesson.

Some possible directions of investigation:

Evaluate whether the two variables (colour and texture) are independent.
Evaluate whether the distribution follows Mendel's law.

Bringing this back to your classroom:

Consider how you might design a STEM experiment for students using data, or use this as a learning experience for students.

Click here to go back to the top of this page

The egg problem

The following dataset contains data describing the number of eggs found in flea of various sizes

Egg relationship

The data is taken from here: http://www.biostathandbook.com/linearregression.html

Some possible directions of investigation:

Is the weight of the fleas normally distributed? Is the number of eggs normally distributed?
The hypothesis suggests that the number of eggs is proportional to the weight of the flea.

- Does the data support this hypothesis? What is the ratio of explained to unexplained variance?

- Are the data points randomly/normally scattered around the model?

Click here to go back to the top of this page

Open investigation

Medical trends in Singapore

There are many public datasets for medical statistics in Singapore, for example:

Consider studying the relationship between these datasets and other datasets, such as:

Vaccination statistics: https://www.moh.gov.sg/covid-19/statistics
Singapore Tourism Analytics Network: Monthly visitor arrivals - https://stan.stb.gov.sg/content/stan/en/tourism-statistics.html
Dept. of Statistics, Singapore: International visitor arrivals - https://www.singstat.gov.sg/publications/reference/ebook/industry/tourism

Click here to go back to the top of this page

Click here to go back to the main page for Application and Further Learning

Page updated

Google Sites

Report abuse