NSCC 328 - STATISTICAL SOFTWARE
Lesson 1. Statistical Software and Tools [including freeware]
S T A T I S T I C A L S O F T W A R E [1]
Def.
Statistical software or statistical analysis software, refers to tools that assist in the statistics-based collection and analysis of data to provide science-based insights into patterns and trends. They often use statistical analysis theorems and methodologies, such as regression analysis and time series analysis to perform data science.
S T A T I S T I C A L A N A L Y S I S [1]
Def.
Statistical analysis is a form of quantitative data science. As the name suggests, it employs statistics, which is “the science that deals with the collection, classification, analysis and interpretation of numerical facts or data…by use of mathematical theories of probability.”
Researchers, data scientists and analysts may use statistical analysis to:
Investigate and present information revealed by datasets
Explore the relationships between data points
Identify underlying trends and patterns in data
Generate and prove or disprove the validity of probability models
Use analytical algorithms to make predictions for the future
Uncover actionable insights
T Y P E S O F S T A T I S T I C A L M E T H O D S [1]
There are two important statistical methods used in data analysis — descriptive and inferential statistics. Both methods are important and give different insights.
Descriptive statistics is the kind of statistics that generally comes to most people’s minds when they hear “statistics.” Descriptive statistics refer to the analysis of data that helps describe or summarize data in a meaningful way. They simplify large quantities of data for easy interpretation, without making conclusions beyond the analysis or answering any hypotheses. Instead of proceeding data in its raw form, descriptive statistics allows us to present and interpret data more easily.
In contrast, Inferential statistics allows analysts to test a hypothesis based on a sample of data from which they can make inferences and generalizations about the greater whole. Inferential statistics tries to make conclusions about future outcomes beyond the data available.
For descriptive statistics, we choose a group to study, measure all the subjects in that group and describe the group in exact numbers. Descriptive statistics can be helpful in looking into such things as the spread and center of the data, but because descriptive statistics are stated in exact numbers, they cannot be used to make broader generalizations or conclusions.
For inferential statistics, we instead start by defining the target population and then plan how to obtain a representative sample. After analyzing the sample and testing hypotheses based on the sample data, the result will be expressed in confidence intervals and margins of errors, based on the uncertainty of using a sample that cannot perfectly represent the population.
Both kinds of statistics are at the heart of the statistical analysis that powers statistical software, used hand in hand to solve problems with intelligence.
B E N E F I T S O F S T A T I S T I C A L S O F T W A R E [1]
Statistical software can help everyone in many different ways. The practice of collecting and analyzing data and transforming it into actionable insights, statistics can add even more value to your data. Statistical analysis can give insight into how to effectively create data-driven decisions, and help you think ahead with predictive analytics models based on historical data.
Statistics can be difficult to perform, but with the right tools, it can be a breeze.
So, what are the benefits of using a statistical analysis tool?
Increases efficiency from streamlined and automated data analysis workflows
Returns more accurate predictions based on machine learning, statistical algorithms and hypothesis testing
Easy customization allows you to ensure the software correctly processes the data and results you want
Grants access to larger databases which reduces sampling error and enables more precise conclusions
Empowers you to make data-driven decisions with confidence
S T A T I S T I C A L S O F T W A R E T O O L S [2]
S P S S S T A T I S T I C S [2]
SPSS Statistics is a statistical software from IBM that can quickly crunch large data sets to provide insights for decision-making and research. According to IBM’s website, 81% of reviewers rank SPSS as easy to use, making it a good choice for novice users as well as expert statisticians. It also can estimate and uncover missing values in data sets, allowing for more accurate reports. Scalable and agile, SPSS Statistics is built to work with large volumes of data with as many user licenses as needed, performing anything from descriptive analytics to advanced statistics simulations.
Data Connectivity and Preparation
SPSS Statistics can read and write data from many different file formats and sources, including ASCII text files, spreadsheets and databases like Microsoft Excel and Microsoft Access and those from other statistics packages. It then streamlines and automates the data preparation process to identify missing data or invalid values and clean up large data sets in a single step. SPSS Statistics allows for greater accuracy in data analysis with its data conditioning workflow.
Comprehensive Statistical Analysis
SPSS is a robust solution that can perform almost every kind of statistical analysis, including but not limited to linear and non-linear models, simulation modeling, Bayesian statistics, custom tables, complex sampling, advanced and descriptive statistics, regression and more. Users can additionally automate statistical procedures using SPSS syntax, creating customized data analyses. It also can perform geospatial analysis. Users can dig deeper into their data with customized tables through ad-hoc analysis.
Ease of Use
With a user-friendly UI, SPSS features a point-and-click interface that employs drop-down menus and drag-and-drop functionality. It allows users without coding knowledge to perform data analysis. It features natural language processing, which makes it possible for even users without technical and coding knowledge to perform statistical analysis.
Predictive Analytics
In addition to being able to perform predictive analytics, users can tailor the platform to their needs, allowing for better predictions over time. With multiple machine learning algorithms and simulators, SPSS uses functions like time series analysis, forecasting, temporal causal modeling and neural networks to uncover complex possible relationships between variables. It can account for the uncertainty of the future with probability distributions and it improves its predictive models with multilayer perception and radial basis function.
Export with Ease
Users can export their data to SPSS’ proprietary file format or a variety of widely accessible formats like text, Microsoft Word, PDF, Excel, HTML, XML, XLS and more. Users can also export visualizations to a variety of graphic image formats.
Watch: Learn SPSS in 15 minutes
S A S / S T A T [2]
SAS/STAT is a cloud-based platform that allows users to harness tools and procedures for statistical analysis and data visualization. Designed to address both specialized and enterprise-wide analytics needs, it is used by business analysts, statisticians, data scientists, researchers and engineers primarily for statistical modeling, observing trends and patterns in data and aiding in decision-making. Its procedures are multithreaded, performing multiple operations at once, increasing the efficiency and stability of the program. Users can create hundreds of built-in, customizable statistical charts and graphs.
SAS has an established reputation in the industry for reliable results and ensures that code produced with SAS/STAT is documented and verified to meet corporate and governmental compliance requirements. An open-source analytics platform, SAS allows users the freedom to experiment and program in either the interface or the coding language of their choice.
Ready-to-Use Statistical Procedures
SAS/STAT comes with a wide range of more than 100 built-in statistical analysis procedures for both descriptive and inferential statistics. Users can create many different kinds of analytical models, including linear and nonlinear models, Bayesian models, accelerated failure time models, Cox regression models, nested models and finite mixture models. Users can also perform analysis of variance, categorical data analysis, causal inference, distributive analysis, psychometric analysis, regression analysis, spatial analysis and much more.
Predictive Modeling
Predictive analytics such as that found in SAS/STAT helps users to predict the future, providing information that leads to enhanced and better-informed decision-making. SAS/STAT users can calculate the probability and possibility of outcomes using predictive modeling based on data mining. SAS/STAT features a number of predictive modeling procedures that can implement regression analysis, effect selection, logistic regression analysis, linear least square modeling, partial least squares regression modeling and transformation regression modeling.
Data Size Suitability
SAS/STAT intelligently analyzes data based on its type and size. It analyzes small data sets with exact techniques, large datasets with high-performance statistical modeling and helps fill in missing values with modern analysis methods.
Centralized Repository
Metadata is stored in a centralized repository, allowing for easy integration of SAS/STAT models into other SAS solutions on their platform, including SAS Analytics Pro, SAS University Edition, SAS In-Memory Statistics and SAS Visual Statistics.
Online Documentation and Support
Users can take advantage of SAS’ extensive online resources. In addition to a free e-learning course on statistics and how-to videos, SAS also offers comprehensive online documentation with a rich set of examples to help users get up and running with the solution. Users also have access to technical support and online communities to find the answers to their every statistical question.
S T A T A [2]
Stata is a statistical solution designed for data scientists, used for data manipulation, exploration, visualization and statistical analysis. With both a graphical user interface and command line structure, Stata is accessible to users with or without coding knowledge. Stata is used by researchers in many fields, including behavioral science, education, medical research, economics, political science, public policy, sociology, finance, business and marketing. It features some level of graphics customization, as users can customize the size of the text, markers, margins and other elements in their graphics.
Stata is available in four different packages, which can analyze different numbers of variables and require more or less memory to run:
Stata/MP: the fastest and largest version of Stata
Stata/SE: Stata for large datasets
Stata/IC: Stata for mid-sized data sets
Numerics by Stata: Stata for embedded and web applications
Statistical Functions
Stata can provide users all the tools they need to perform data science. It includes a broad suite of statistical functions, including but not limited to linear models, panel/longitudinal data, time series analysis, survival analysis, Bayesian analysis, selection models, choice models, extended regression models, generalized linear models, finite mixture models, spatial autoregressive models, nonlinear regression and more.
Predictive Analysis
Stata helps users anticipate the future. It has lasso tools that allow users to predict outcomes, characterize groups and patterns and perform inferential statistics on data.
Automated Reporting
Users can automate reports, which can be created in Word, Excel, PDF and HTML files directly from the solution. The look of the reports can be customized using Markdown text-formatting language.
Advanced Programming with Reproducibility
In addition to the Stata programming languages ado and Mata, users can also incorporate C, C++ and Java plug-ins via a native API. Stata also has Python integration, so users can embed and execute coding directly within the program. Stata features integrated versioning, which allows scripts and programs written years and years ago to continue to work in modern versions of its platform. Created from version 1.0 with reproducible research in mind, scripts written in 1985 will run and produce the same results in 2020 and 2050 and beyond. This frees users from the shackles of keeping and maintaining multiple installations of different versions of Stata, as the most up-to-date version of Stata will always be able to understand older code and datasets, eliminating broken scripts even if users change operating systems or jump to a version of Stata many versions ahead.
Publication-Quality Graphics
Stata enables users to generate uniquely styled, high-quality graphics in many different styles with point-and-click ease. Users can create bar charts, box plots, histograms, spike plots, pie charts, scatterplots, dot charts and more. Users can also write scripts to produce graphs en masse in a reproducible manner. Graphics can be exported to a variety of formats: EPS or TIFF for publication, PNG or SVG for online distribution, or PDF for viewing and sending. With a graph editor, users can customize how their visualizations look, by adding, moving, modifying or removing elements, with the option to record changes and apply those edits to other graphs.
Easy Import and Export of Data
Users can import and export data from a myriad of formats, including XLS, CSV, spreadsheets, SQL sources, ASCII files, text, etc. Stata can also import files from SAS or SPSS, ensuring that it has compatibility with other popular statistical software.
Technical Support and Resources
Stata technical support is free to registered users, allowing for an extra benefit on top of user subscriptions. Stata has a dedicated staff of programmers and statisticians who can answer users’ technical questions, assist in graphics customization and explain the ins and outs of statistical modeling.
Stata also has a YouTube channel full of free video resources, an informative blog, free webinars including a regularly offered “Ready. Set. Go Stata.” webinar on getting started with Stata, as well as a variety of inexpensive online Net Courses that help users maximize the return on their investment.
Watch: Learn Stata in 15 minutes
M I N I T A B [2]
Minitab is a statistics package that delivers statistical analysis, data visualizations and data analytics to help users improve data-driven decision making. It can analyze all kinds of datasets, from small to large, and automates statistical calculations and the creation of graphs, allowing users to focus more on data analysis. Minitab allows users to customize menus and toolbars, preferences, profiles and powerful scripting macro capabilities.
Minitab is currently available for installation on Windows or Mac operating systems only, with no SaaS or mobile options.
Data Preparation
With a seamless, one-click import process, Minitab takes the hard work out of data prep and allows users to quickly sort through and transpose their data.
Descriptive and Inferential Statistics
Minitab can perform statistical analysis on data sets and identify distributions, correlations, outliers and missing values. With a variety of analyses at their command, including analysis of variance, regression, experiment design, variable control charts, reliability/survival, users can probe their data with any number of statistical tests.
Predictive Analytics
Minitab has advanced predictive analytics and machine learning algorithms at its disposal that allow for an even deeper dive into data. With tools for logistic regression, time series analysis, factor analysis and cluster variables, users can take a peek into future possibilities.
Visualizations
Minitab can generate a wide range of graphics to display their findings, including scatterplots, matrix plots, boxplots, histograms, charts, time series plots, probability plots and more. These graphics automatically update as data changes, and users can dig deeper on their visualizations with a brushing feature that zooms into sections of their graphs.
Users can export their graphics to TIF, JPEG, PNG, BMP, GIF, or EMF files, or directly to Microsoft Word or Powerpoint for sharing with others.
Minitab Assistant
One of Minitab’s key offerings is the Minitab Assistant, which guides users through the analytical process and assists them in interpreting and presenting their results. It features an interactive decision tree that helps users pick the correct statistical analysis for their needs. It also provides step-by-step support, including definitions of terms and illustrated examples, to help provide better context and clearer guidelines for effective, accurate analysis.
With simple dialogs and fields that dynamically change based on input, the assistant streamlines the statistical analysis process and returns a series of reports that are easy to understand which help users interpret their results with confidence.
Technical Support and Documentation
Minitab offers a free Quick Start resource that introduces users to the platform’s basic functions and navigation. They also offer animated lessons and hands-on exercises, sold separately as Quality Trainer e-Learning courses. There is also a host of technical documentation, as well as guides, blogs and webinars, available on the Minitab website.
Registered users also can receive technical support by phone or online from expert service representatives.
Watch: Minitab Tutorial
G R A P H P A D P R I S M [2]
Graphpad Prism is a statistics and data analysis solution specialized for scientific research. It offers a wide range of statistical functions and is used by scientists across a broad range of industries, including life sciences, biotechnology, health care and pharmaceuticals, automotive, technology and telecommunications. Though specialized for scientific fields, there is no coding knowledge required to create a wide variety of data visualizations. Prism enables users to work smarter, not harder, with features such as one-click regression analysis that simplify the curve fitting and work automation.
Statistical Analysis
Prism offers a comprehensive library of statistical analyses, including nonlinear regression, survival analysis, regression analysis, t-tests, nonparametric comparisons, and more. Users can avoid statistical jargon with the library of functions presented in clear language, and follow a checklist of requirements to confirm they have chosen the appropriate statistical test.
Customizable Graphics
Users can customize their graphs to tell their data’s story in whatever way they want; they can choose the type of graph, how the data is arranged, the style of the data points, labels, colors, fonts, look and more. With Prism Magic, users can apply a consistent look to a set of graphs with one-click simplicity.
Users can then export their graphs in publication-quality and customize the file type, resolution, transparency, dimensions, color, space, etc. of their visualizations to meet the requirements of publication. To save time in the future, users can set their default export preferences.
Real-Time Updates
When any changes are made to data sets or analyses, those changes update the results and graphs simultaneously in real time.
Online Documentation and Help
Graphpad reduces the complexity of statistics with extensive online help guides and tutorials for Prism, a graph portfolio that helps users learn how to make a wide range of graph types, sample data sets to have hands-on practice with and more. Graphpad offers both free and paid online courses taught by scientists through their Prism Academy on how to maximize their investment in statistics and data visualization.
Work Automation
Users can reduce the number of tedious steps needed to analyze data by setting up reproducible workflows, saving hours of set-up time.
Collaboration
Prism allows for enhanced collaboration with team members, with all the information in a Prism project contained in one shareable file. Others can follow your work step-by-step, adding insight and strengthening your collective research efforts.
Whether or not you choose statistical software for your data, it will depend on your situation and what kind of data you’re looking to analyze. While powerful, many solutions require at least some knowledge of statistics, data science or programming to operate. If you have the technical know-how and the drive to pursue the deepest insights you can glean from your data, statistical analysis software may be right for you
Watch: Quick tour of Prism
M I C R O S O F T E X C E L [3]
We used Excel to do some basic data analysis tasks to see whether it is a reasonable alternative to using a statistical package for the same tasks. We concluded that Excel is a poor choice for statistical analysis beyond textbook examples, the simplest descriptive statistics, or for more than a very few columns. The problems we encountered that led to this conclusion are in four general areas:
Missing values are handled inconsistently, and sometimes incorrectly.
Data organization differs according to analysis, forcing you to reorganize your data in many ways if you want to do many different analyses.
Many analyses can only be done on one column at a time, making it inconvenient to do the same analysis on many columns.
Output is poorly organized, sometimes inadequately labeled, and there is no record of how an analysis was accomplished.
Excel is convenient for data entry, and for quickly manipulating rows and columns prior to statistical analysis. However, when you are ready to do the statistical analysis, we recommend the use of a statistical package such as SAS, SPSS, Stata, Systat or Minitab.
Excel is probably the most commonly used spreadsheet for PCs. Newly purchased computers often arrive with Excel already loaded. It is easily used to do a variety of calculations, includes a collection of statistical functions, and a Data Analysis ToolPak. As a result, if you suddenly find you need to do some statistical analysis, you may turn to it as the obvious choice. We decided to do some testing to see how well Excel would serve as a Data Analysis application.
Most of Excel statistical procedures are part of the Data Analysis tool pack, which is in the Tools menu. It includes a variety of choices including simple descriptive statistics, t-tests, correlations, 1 or 2-way analysis of variance, regression, etc. If you do not have a Data Analysis item on the Tools menu, you need to install the Data Analysis ToolPak. Search in Help for "Data Analysis Tools" for instructions on loading the ToolPak.
Two other Excel features are useful for certain analyses, but the Data Analysis tool pack is the only one that provides reasonably complete tests of statistical significance. Pivot Table in the Data menu can be used to generate summary tables of means, standard deviations, counts, etc. Also, you could use functions to generate some statistical measures, such as a correlation coefficient. Functions generate a single number, so using functions you will likely have to combine bits and pieces to get what you want. Even so, you may not be able to generate all the parts you need for a complete analysis.
Unless otherwise stated, all statistical tests using Excel were done with the Data Analysis ToolPak. In order to check a variety of statistical tests, we chose the following tasks:
Get means and standard deviations of X and Y for the entire group, and for each treatment group
Get the correlation between X and Y.
Do a two-sample t-test to test whether the two treatment groups differ on X and Y.
Do a paired t-test to test whether X and Y are statistically different from each other.
Compare the number of subjects with each outcome by treatment group, using a chi-squared test.
All of these tasks are routine for a data set of this nature, and all of them could be easily done using any of the above listed statistical packages.
Although Excel is a fine spreadsheet, it is not a statistical data analysis package. In all fairness, it was never intended to be one. Keep in mind that the Data Analysis ToolPak is an "add-in" - an extra feature that enables you to do a few quick calculations. So, it should not be surprising that that is just what it is good for - a few quick calculations. If you attempt to use it for more extensive analyses, you will encounter difficulties due to any or all of the following limitations:
Potential problems with analyses involving missing data. These can be insidious, in that the unwary user is unlikely to realize that anything is wrong.
Lack of flexibility in analyses that can be done due to its expectations regarding the arrangement of data. This results in the need to cut/paste/sort/ and otherwise rearrange the data sheet in various ways, increasing the likelihood of errors.
Output scattered in many different worksheets, or all over one worksheet, which you must take responsibility for arranging in a sensible way.
Output may be incomplete or may not be properly labeled, increasing possibility of misidentifying output.
Need to repeat requests for some analyses multiple times in order to run it for multiple variables, or to request multiple options.
Need to do some things by defining your own functions/formulae, with its attendant risk of errors.
No record of what you did to generate your results, making it difficult to document your analysis, or to repeat it at a later time, should that be necessary.
Watch: Excel Functions and Formulas