The Data Science Approach
A Data Science approach is necessarily exploratory. Whereas traditional empirical approaches tend to use data to either support or reject a scientific hypothesis, in Data Science we are quite agnostic about the data. Our aim is to explore the data, learn how it is distributed, investigate latent relationships within the data, and eventually develop pipelines to optimize data integration and analysis.
This approach does not preclude us from still investigating a particular hypothesis, but once we do that, we have a much better understanding of the data. Employing a Data Science approach allows us to gain insights into the distribution of the data, quality, and relationships, which may help us refine and formulate more meaningful hypotheses.
To foster a Data Science approach, it is critical to think about our data as distributions rather than focusing on individual datapoints, thus allowing us a deeper understanding of variability, patterns, and relationships within the data.
Once we view data as distributions, we can easily represent the data and get a better grasp of its spread, central tendencies, skewness, dispersion, and missing values, which is relevant to identifying anomalies, biases, or inconsistencies in the data. Moreover, several machine learning algorithms rely on assumptions about the distribution of the data (e.g. PCA assumes that data follows a multivariate normal distribution).