The Popularity of Data Analysis Software
has been moved to http://r4stats.com/articles/popularity
by Robert A. Muenchen
Abstract: This page presents various ways of measuring the popularity or market share of BMDP, JMP, Minitab, R, R-PLUS, Revolution R, S-PLUS, SAS, SPSS, Stata, Statistica, and Systat, as well as two implementations of the SAS Lanugage, Carolina and WPS. I update this paper several times a year at http://r4stats.com to provide an ongoing view of the software. Recent updates include updating the current number of R add-on packages (4/13/2012), the plots on Google Scholar data in Fig. 7a & 7b (4/12/2012), the numbers of blogs for each package (3/13/2012), Listserv subscriber data (3/9/2012), StackOverflow and Crossvalidated data (3/8/2012), job data (1/4/2012), TIOBE Index values (1/2/2012), adding the Kaggle competition data (1/2/2012).
When choosing an analytical tool to use, there are many factors to consider. Does it run natively on your computer? Does the software provide all the methods you use? If not, how extensible is it? Does that extensibility use its own language, or an external one (e.g. Python, R, SQL) that is commonly accessible from many packages? Does it fully support the style (programming vs. point-and-click) that you like? Are its visualization options (e.g. static vs. interactive) adequate for your problems? Does it provide output the form you prefer (e.g. cut & paste vs. LaTeX integration)? Does it handle large enough data sets? Do your colleagues use it so you can easily share data and programs? Can you afford it?
It can also be helpful to know the size of the software’s market share and whether it is growing or shrinking. Software that is popular and growing probably meets the needs of many people well, however that certainly doesn't mean it will meet yours. That said, let's examine various ways to estimate popularity and/or market share.
Sales & Downloads
Sales figures reported by some commercial vendors include products that have little to do with analysis. Not all vendors release sales figures. Open source software such as R (Ihaka and Gentleman 1996) could count downloads but one person can download many copies, inflating the total and many people can install from a single download, deflating it. Download counts for the R-based Bioconductor project are located at http://www.bioconductor.org/packages/stats/. Similar figures for downloads of Stata add-ons (not Stata itself) are available at http://fmwww.bc.edu/fmrc/reports/report.ssc.html. A list of Stata repositories is available at http://stata.com/links/resources2.html. The many sources of downloads both in repositories and individual's web sites makes counting downloads a very difficult task.
Language Popularity Measures
The TIOBE Community Programming Index ranks the popularity of programming languages, but from a programming language perspective rather than as analytical software (http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html). In January 2012, they rank R in 24th place and SAS at 31st. No other data analysis languages covered by this article even make their top 100.
Langpop.com also ranks programming languages (http://langpop.com/) in a variety of interesting ways, but unfortunately their focus excludes statistical software.
There are some stable and objective measures regarding analytic software. Schwartz (2009) suggested estimating relative popularity by plotting the amount of email discussion devoted to each. The most widely used packages all have discussion lists, or "listservs" devoted to them. The less popular ones either do not have such discussions or, like the list for Minitab, may have only a dozen or so emails per year. Some software packages have multiple discussion lists. For example there are 21 devoted to using R for various focused areas such as graphics, mapping, ecology, epidemiology, etc. (http://www.r-project.org/mail.html). A broader list, including a version of R-Help in Spanish lists 49 discussions (https://stat.ethz.ch/mailman/listinfo).
There are other discussion forums besides listservs. Ideally we could combine the data from all discussion lists and forums, but that would be too time consuming. Therefore, Figure 1 shows the level of activity on only the main discussion listserv in a typical month (i.e. forums, news groups and Google groups are excluded). Each point represents the mean of the 12 monthly counts that occurred in that year. This plot contains data through the end of 2011.
We can see that R is the most discussed software by almost a two-to-one margin, followed by Stata then SAS. Keep in mind that both R and SAS have substantial amounts of discussion in other areas which, if included, would raise both of their lines substantially.
SAS saw growth in its discussion until 2006 when it leveled off and then declined. That decline could be the result of at least three factors: 1) migration to other forums such as those shown in Table 2, or the SAS corporate forum, 2) the introduction of the Enterprise Guide user interface which may generate fewer questions than programming the SAS language and 3) competition from the increased popularity of R and Stata.
R also showed a decline in 2011, perhaps also a result of migration to other forums and the fairly recent appearance of easy-to-use user interfaces such as R Commander, Deducer and Rattle.
Stata has seen substantial growth in the amount of discussion devoted to it, finally surpassing that of SAS in 2010 on its single discussion list.
SPSS has had a relatively low and consistent amount of discussion over the years. SPSS’ traditional user base is in the social sciences where, in my experience, people are less interested in programming and more interested in the product’s easy-to-use graphical user interface. It had that interface for the whole of the period shown.
R and S-PLUS are both implementations of the S language and so are in the most direct competition. From the view of Internet discussion, S-PLUS is experiencing a significant decline. In 2011, half the months showed fewer than ten notices on its list. Many of them were simply conference announcements.
Could the numbers in Figure 1 be the result of a few people doing a lot of talking? When you follow any of these discussion lists, it quickly becomes obvious that a core group of people really keep the lists humming. However the number of people who subscribe to each list shows a similar pattern with R-Help dominating the scene, see Table 1.
It would be interesting to see what topics were most discussed on each list. The only such analysis of which I am aware was done by Arthur Tabachnek (2010) for the SAS list. The most popular topic in 2009 turned out to be...R! You can read his full analysis here under slides from the 2010 session.
Another way people help one another is through Internet discussion sites. The site Stack Overflow (http://stackoverflow.com) covers a wide range of topics, while its sister site, Cross Validated (http://stats.stackexchange.com/), focuses on statistical analysis. At both sites users tag their topics, making it particularly easy to focus searches. As you can see in Table 2, there is far more discussion of R than the other software.
Quora.com is another site that provides advice on almost any topic, including data analysis software. However, it is not currently easy to get counts by software so I dropped coverage of it.
On Internet blogs, people write about software that interests them, showing how to solve problems and interpreting events in the field. The more popular a software package is, the more bloggers there are writing about it. Blog consolidators like Tal Galili's R-Bloggers.com and SAS-X.com, and sasCommunity.org Planet combine various blogs into a single location. While any particular blogger may write only an article every week or so, by combining them, the consolidators essentially provide a daily newspaper on various packages. So far only R and SAS are popular enough to have consolidated versions of their blogs (see Table 3).
R's 290 blogs put it way out in front of the pack, with SAS coming in at second place with 39. Stata has 7, which are listed here. Each of the other packages have either none or just a few.
Kaggle.com is a web site that sponsors data analysis contests. People post data analysis problems there along the amount of money they are willing pay the person or team who solves their problem the best. As I write this (1/2/2012) there are over 25,000 analysts working on over 72,000 problems. Figure 2 shows the software used by the data analysts working on the problems. R is in the lead by a wide margin. R's dominance is even greater among the contest winners, over 50% of whom used R.
A potential source of bias in these figures is that the licenses of most proprietary software prohibits its use for the benefit of outside organizations (universities can help federal grant-providing agencies such as NSF and NIH, but cannot even solve problems for government agencies in general or nonprofits). However, I manage the research software site licenses at the University of Tennessee, and I can attest to the fact that people are often unaware of this limitation.
Surveys of Use
One way to estimate the relative popularity of data analysis software is though a survey. Rexer Analytics does a survey each year asking about tools used for data mining. The difference between software for classical data analysis software and data mining seems like more of a marketing concept than one based on any actual difference in analytic need. Figure 3 shows the results of just one "check all that apply" type question about the tools that respondents reported using in 2009 (the survey was taken in 2010).
We see that R comes out on top, followed by SAS and SPSS. The entire report contained over 40 questions on topics such as algorithms used, fields, challenges, data, impact of the economy on the field, and more. More comprehensive results are available here. It's interesting to note that SPSS and SAS are used more often than their more expensive products aimed specifically at data mining, SPSS IBM Modeler (formerly Clementine) and SAS Enterprise Miner.
The results of a similar survey done by the data mining web site KDnuggets in 2011 are shown in Figure 4. This one shows RapidMiner in first place, followed by R and Excel. It's interesting to see that all of those packages showed a decline in use since the 2010 survey, while SAS, SAS Enterprise Miner, IBM SPSS Modeler all showed slight increases. Salford and Revolution Analytics (shown under its previous name Revolution Computing) showed a substantial increases while JMP, Mathematica, Tableau and 11Ants Analytics appeared in the poll for the first time. You can see the full results and read about the survey's details here.
The KDnuggets site conducted similar poll, this time asking, "What programming languages you used for data mining / data analysis in the past 12 months?" R dominated this poll, as shown in Figure 5.
Figure 5. Languages used in data mining or analysis.
The number of books published on each software reflects their relative popularity. Amazon.com offers an advanced search method which works well for all the software except R. I configured it with the following parameters:
Title: SAS -excerpt -chapter -changes [using SAS as an example]
Subject: Computers & Internet
Format: All formats
Publication Date: After September, 2001 [i.e. 10 years before the search on 10/13/2011]
it's difficult to determine how many books use a particular software in
its examples, I searched for books that included the software in the title.
SAS has many manuals for sale as individual chapters or excerpts.
Luckily, they contain "chapter" or "excerpt" in their title so I
excluded them using the minus sign, e.g. "-excerpt". SAS also has short
"changes and enhancements" booklets that the other packages release only
in the form of flyers and/or web pages so I excluded "changes" as well.
SAS and SPSS both have many versions of the same book or manual still for sale. For example, Marija
Norusis' 3 books on SPSS appear 20 times for various versions of SPSS
released in the last 10 years. The SAS and SPSS numbers are both
somewhat inflated as a result. Limiting the search to books published in
the last 10 years mitigated this problem somewhat, but the SAS and SPSS
figures are probably both still somewhat exaggerated.
count of R books came from
http://www.r-project.org/doc/bib/R-books.html. This list does contain
seven books on S that are older but still relevant. Version numbers do
not appear in any book titles so R avoids the over-counting problem that
plagued my count of SAS and SPSS manuals. The most surprising aspect of
the result (Figure 6) was how extremely dominant the top few packages
are and that three well known packages had no books at all written about
them (BMDP, Statistica, Systat). Revolution R and R-PLUS have no books with their names in the titles, but of course the books on R apply to them as well.
Impact on Scholarly Activity
While Internet search engines make it very easy to locate information about software, their inclusive nature make it difficult to narrow the search enough to determine the prevalence of various packages. For example, searching for the term “SAS” quickly locates the main web site for the SAS Institute, but it also ends up including many hits regarding a shoe company, an airline and the British commando group. Even in the realm of scholarly journal articles, S.A.S. stands for over a dozen terms such as Synthetic Aperture Sonar.
The more popular a software package is, the more likely it will appear in scholarly publications as a topic and as a method of analysis. Google Scholar offers a convenient way to measure such activity. No search of this magnitude is perfect and will include some irrelevant articles and reject some relevant ones. The final set of search terms is described at http://librestats.com/2012/04/12/statistical-software-popularity-on-google-scholar/. Figure 7a shows the number of articles for the most popular six statistics packages from 1995 through 2011. SPSS had a surprising advantage over most other package for much of this time. It seems suspiciously large but after fairly extensive study of the result it does not seem to be spurious. Last year's graph did however have a spurious result. Stata apparently means "was" in Italian and so it appeared to follow a similar path to SAS, but exceeding both SAS and SPSS in recent years. Changing that search to "Statacorp", which should be included in the citation for the Stata software yielded what is probably a much more accurate set of data. The Librestats article makes it easy for anyone to try variations on these searches.
Use of SPSS and SAS in scholarly articles peaked in 2005 and 2006, respectively. The decline they have seen since may be due to competition from the other packages. The total of the other packages in 2011 is a similar to the amount of decline that SPSS and SAS have seen since their peak. If the trends for SAS and R were to continue on their current trajectories, scholarly use of R could surpass SAS use in 2015. That's a big "IF" of course! In 2011 the downward trend in SPSS use flattened out quite a bit, making a forecast more difficult.
Since SAS and SPSS still dominate scholarly use by such a wide margin, I removed those two packages and added JMP and Statistica as shown in Fig. 7b. That figure shows the rapid rise of all software except Statistica. Note that the symbols and colors used in Fig. 7b do not match those in 7a. From 2008 on, R reaches the #3 spot (after SPSS and SAS) and extends its lead in consecutive years.
Figure 7b. Use of data analysis software in academic publications as measured by hits on Google Scholar, EXCLUDING SAS AND SPSS.
Web Site Popularity
measure of software popularity is the number of other web pages that
contain links that point to the software’s main web site. Figure 8
provides those numbers, recorded using Google on January 5, 2012.
Now that SPSS is part of IBM, it dominates the results. This reflects the wide range of products that IBM sells, including computer hardware and services that have nothing to do with data analysis. However, the older SPSS.com website no longer shows up early in a web search and the IBM site that it redirects to has a tiny incoming link measure since it is not meant to be a direct link.
R is next in line with a little over half of IBM's measure, followed by SAS with well less than R's value. The other software follows in the order that I suspect is reflective of their respective market shares. Revolution R Enterprise and R-PLUS are commercial versions of R that are relatively quite new to the market. WPS is an implementation of the SAS Language and Carolina is a SAS-to-Java compiler.
The number of incoming links is an important part of Google’s famous PageRank algorithm (http://en.wikipedia.org/wiki/PageRank).
PageRank is made more useful for searching by (among other things)
weighting the importance of each link. Links from major sites like
WikiPedia would carry far more weight than would a link from a
professor’s course syllabus. The practical range of PageRank is from 1
to 10. Figure 9 plots this data. The software appear in tiers, with the two dominant players, SAS and SPSS (IBM), at the highest, and their well-known alternatives one level down. I find it odd that Stata is not in this level. At the very bottom are the World Programming System (WPS) and Carolina, two companies that use the SAS language. There have been quite a few changes in this ranking since last year, with SAS, SPSS and Revolution Analytics moving up one point and R, Stata and Carolina moving down one point. The R-PLUS site maintained its PageRank of 5 this year, which is a bit surprising given that many of its links are broken, and it is in its fourth year of saying, "Be the first to get R-PLUS 3.3"
Growth in Capability
The capability of all the software in this article has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data is hard to obtain. John Fox (2009) acquired it for R’s main distribution site http://cran.r-project.org/ by year. I collected the later years following his same method. Figure 10 displays the data with a smoothed fit. Each point represents the number of packages at CRAN when the major versions of R (e.g. 2.10, 2.11) were released. A package in R is similar to a SAS or SPSS add-on module. They focus on a particular topic (e.g. time series) and include around 20 functions (procedures, commands) per package.
R’s capability is clearly growing at a very rapid rate and is a major factor in the rapid increase in R's popularity. R does have eight other main software repositories, such as the one at http://www.bioconductor.org/ that are not included in this graph. A program run on 4/13/2012 counted 5,300 R packages at all major repositories, 3,648 of which were at CRAN. So the growth curve for the software at all repositories would be roughly 45% higher on the y-axis than the one shown in Figure 10. As with any analysis software, individuals also maintain their own separate collections typically available on their web sites.
If this type of data becomes as easily available for the other software, I will include it in a future edition.
IT Research Firms
IT research firms study software products and corporate strategies and provide their opinions on each in reports they sell to their clients. Two such reports that focus on data mining tools are here:
Gartner Group: http://www.spss.com.hk/PDFs/Gartner_Magic_Quadrant.pdf
Both firms rank SAS and SPSS as the top two and also predict greater than 100% annual growth for open source business intelligence software.
SAS has a very substantial lead in job openings, with SPSS coming in second with fewer than a quarter of the jobs. Minitab had just over half the SPSS total and R had half again as many as that. A data analyst would do well to know SAS unless he or she were training for field in which one of the other packages is dominant.
The most frequent question I receive about this paper is why I don't collect data on MATLAB, Mathematica, or similar open source software such as Octave, Scilab and Sage. They are, of course, quite capable of doing data analysis. However, I did not collect data on them because their use is more popular in the fields of general science and engineering, not data analysis in the statistical or predictive analytics sense. Graphs from other sources, however, occasionally do include them.
The other thing missing is the discussion I previously included on Google Trends. That site tracks not what's actually on the Internet via searches, but rather the keywords and phrases that people are entering into their Google searches. That ended up being so variable as to be essentially worthless. For an interesting discussion of this topic, see this article by Rick Wicklin.
By most of the measures discussed here, R is competing well with the commercial software vendors. However, I advise not over generalizing from this data. SAS and SPSS continue to dominate the corporate world and Stata is doing quite well in the scholarly arena. Each of these packages is dominant in one market or another. I'm interested in other ways to measure software popularity. If you have any ideas on the subject, please contact me at email@example.com.
If you are a SAS or SPSS user interested in learning more about R, you might consider my book, R for SAS and SPSS Users. Stata users might want to consider reading R for Stata Users, written with Stata guru Joe Hilbe.
I am grateful to John Fox (2009) for the data on R package growth and to Marc Schwartz (2009) for the idea of plotting the amount of activity on e-mail discussion lists. Thanks to Duncan Murdoch for clarifying the pitfalls of counting downloads. Thanks to Martin Weiss for pointing out both how to query Statlist for its number of subscribers. Thanks to Christopher Baum for information regarding counting Stata downloads. Thanks to John (Jiangtang) HU for suggesting I add more detail from the TIOBE index. Thanks to Andre Wielki Andre for suggesting the addition of SAS Institute's support forums. Thanks to Kjetil Halvorsen for the location of the expanded list of Internet R discussions. Thanks to both Dario Solari and Joris Meys for their suggestions on how to improve Google Insight searches. Thanks to Keo Ormsby for his suggestions regarding Google Scholar. Thanks to Karl Rexer for the use of his data mining survey data. Thanks to Gregory Piatetsky-Shapiro for the use of his KDnuggets data mining poll. Thanks to Tal Galili for advice on blogs and consolidation, as well as Stack Exchange and Stack Overflow. Thanks to Patrick Burns for his advice. Thanks to Nick Cox for advice to clarify the role of Stata's software repositories and of popularity itself. Thanks to Stas Kolenikov for the link of known Stata repositories. Thanks to Rick Wicklin for convincing me to stop trying to get anything useful out of Google Insights. Thanks to Drew Schmidt for automating the collection of the data in Figures 7a and 7b.
J. Fox. Aspects of the Social Organization and Trajectory of the R Project. R Journal, http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Fox.pdf
R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5:299–314, 1996.
R. Muenchen, R for SAS and SPSS Users, Springer, 2009
R. Muenchen, J. Hilbe, R for Stata Users, Springer, 2010
M. Schwartz, 1/7/2009, http://tolstoy.newcastle.edu.au/R/e6/help/09/01/0517.html
BMDP, Carolina, JMP, Minitab, R-PLUS, Revolution R, SAS, SAS Enterprinse Miner, IBM SPSS Modeler, IBM SPSS Statistics, Stata, Statistica, Systat and WPS are registered trademarks of their respective companies.
Copyright 2010, 2011, 2012 Robert A. Muenchen