Georgia Tech Research High School Course
with Dr. Tuba Ketenci & Tsion Fitsum
The following is an overview of projects I’ve done throughout the internship to build my skills in data analytics and to improve my comprehension of code through courses on Codeacademy. These courses included many lesson and quizzes to test my knowledge. By passing, I would then move onto the next step in the course.
Throughout this Internship I've built my skills through courses and applied them to projects. While I did complete many courses, here are my 4 favorites in terms of building my data analysis skills:
Through this course, I developed a strong foundation in statistical analysis and hypothesis testing, focusing on how sample data can be used to make inferences about a larger population. I learned the importance of distinguishing between sample means and population means and how factors like sample size and selection bias can influence results.
Furthermore, I gained insight into potential pitfalls in data analysis, including Type I and Type II errors, which occur when false correlations are detected or real differences are overlooked. This knowledge has enhanced my ability to critically evaluate data and make informed, data-driven decisions which is an essential skill in industrial engineering and software development.
In this course, I learned how to visualize graphs and use effective data analyzing methods to analyze data. More specifically, I learned about the statistical background behind data analysis. This includes stratified sampling, data collection methods, misleading and confusing graphs, and bias in data analysis. This was an important courses in building my foundational skills in data analysis before using code. It helped me grasp a better understanding of what statistical methods could be used when conducting research.
I learned how to transform raw columns of data into analytics-ready tables using column-based calculations. I learned that you can display a summary table of column information using the .info()method. # indicates the column index number. Column refers to the column name. Non-Null Count is the number of non-missing values in the column. Dtype is the column’s data type. The sample project used for this course was based on a csv data file on national parks. In this lesson, I've learned how to make raw data ready for analysis by renaming columns, using arithmetic operators and pandas methods to calculate new columns, using pandas string methods to modify and manipulate text columns, correcting data types, and identifying missing data.
In this course, I learned how to calculate variance in data which is important to analyze the spread in data that is important for comparing datasets. I used code to analyze NBA player heights in this course. I found that Lebron James was 0.55 standard deviations above the mean of NBA player heights. He’s taller than average, but compared to the other NBA players, he’s not absurdly tall. However, compared to the OkCupid dating pool, he is extremely rare! He’s almost three full standard deviations above the mean. I'd expect only about 0.15% of people on OkCupid to be more than 3 standard deviations away from the mean. This is the power of standard deviation. By taking the square root of the variance, the standard deviation gives you a statistic about spread that can be easily interpreted and compared to the mean.
Here's are my top 3 favorite projects thus far:
In this project I learned how to collect data on snack preferences of a wide sample of people to analyze the trends and create visualizations of the data. My group used google forms to collect survey data and uploaded the data onto a spreadsheet and downloaded it as a csv file. This file was then used to analyze trends using code. The libraries I used for this project are matplotlib.pyplot, pandas, and numpy. An analysis of snack consumption frequency revealed that those who choose healthier options tend to eat their preferred snacks more often than those who opt for less healthy snacks.
In this project I learned how to import and inspect data by exploring numeric columns. I first imported the dataset in laptops.csv and assigned to the variable laptops. laptops = pd.read_csv('laptops.csv') I then used a series method to display the minimum, maximum, and other summary statistics for the age column. laptops['age'].describe() I did the same for the event_year to determine the earliest and latest years in the dataset. I used another series method to display the most common laptop repair problems in the dataset. laptops['problem'].value_counts() Power and battery issues were the most common. To represent the percentage, the method was updated to output percentages instead of counts. laptops['problem'].value_counts(normalize = True) Finally, a pandas method was used to count the number of laptops in each category of repaired. laptops['repair_status'].value_counts(normalize=True)
One of our final projects was to write a research paper using all the skills learned. My group member and I used High School Longitudinal Study of 2009 to analyze a dataset of students and various variables socioeconomic status, first language of students, gender, extracurricular activities etc. to see if they have any correlation with students scores on a math standard test. Key findings revealed that male students scored slightly higher than their peers (β = 0.22, p < 0.001), and students whose first language is the language of instruction performed significantly better (β = 1.03, p < 0.001). Participation in arts and general clubs positively correlated with improved test scores (β = 1.55 and β = 0.50, respectively, both p < 0.001). However, participation in academic extracurricular activities showed a surprising negative relationship with test scores (β = -1.74, p < 0.001), suggesting a need for further investigation into potential causes.
~ STEAM Day Career Slides ~
I presented what I do as an intern at Georgia Tech to Junior and Elementary academy students. I presented a little about me and them a few Engineering projects I have done over the past few years. Then I presented a project that I had done at my internship (the Snack Stats Project) in simple terms for students to understand and finally I had the students play a Kahoot game based on my presentation for an oppotunity to win candy.
Internship Reflection
Compared to other experiences I've had, this internship has definitely helped me grow as a professional and develop the necessary skills for my major of interest. I learned about the importance of time-management and communication. This internship has also challenged me
My internship has impacted my plans for the future because it has made me feel more prepared by learning the basics of data analysis using python and R programming as well as being able to create graphs using csv file data.