On the left is a project I created outside of class in order to double check my work in my statistics class. It checks basic functions from input. It checks these things below
Central Tendency: Mean, Median, Mode, and a 10% Trimmed Mean (robust to outliers).
Dispersion: Minimum, Maximum, Range, Variance, Standard Deviation, Interquartile Range (IQR), and Coefficient of Variation (CV).
Distribution Shape: Skewness (asymmetry) and Kurtosis (tailedness).
Outlier Detection: Lower Inner Fence (LIF), Upper Inner Fence (UIF), and a list of potential outliers.
Under this is an example output of this code:
I have also made a similar project in Java, this demonstrates ideas such as descriptive statistics and exploratory data analysis (EDA)
On the left are mock statistics provided by one of my professors to help us improve our ability to wrangle data and, hopefully, by the end, create meaningful data visualizations using Python to identify trends and grow in our ability to code such projects. The image on the left is a small excerpt of the Excel spreadsheet; it spans thousands of rows, which is why it was chosen and contains extensive data points that can be used to create accurate graphs of all sorts.
The main code block of this project handles several important data visualization tasks using the pandas, matplotlib, and seaborn libraries. Here's a quick summary:
Data Loading: The project starts by importing the dataset into a pandas DataFrame.
Geographical Distribution: It then creates a scatter plot with matplotlib to visualize the geographic spread of housing in California. The plot uses longitude and latitude for the coordinates, with point colors representing the median house value and sizes based on population.
Income Distribution: Next, a histogram is generated using matplotlib to display the distribution of median income across households in the dataset.
Feature Correlation: Lastly, seaborn is used to generate a correlation heat map, showing the linear relationships between all numerical features, which helps in identifying how different variables are related.
Below is the code ran and the graphs that were created from it:
Geographical Distribution: Housing values are highest along the coast and in urban centers, often correlating with higher population density.
Median Income Histogram: The distribution of median income is right-skewed, meaning most households have lower to mid-range incomes, with fewer in the high-income brackets.
Correlation Heat map: There's a strong positive correlation between median_income and median_house_value.