## Software Tutorial Information

Software Carpentry Helping scientists make better software since 1997

GAMS

http://agecon2.tamu.edu/people/faculty/mccarl-bruce/

R

Quick-R: http://www.statmethods.net/

“At this time, R and Python used together gives the most power and possibilities.”

Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata

gah789 says:

January 2, 2011

I have used all of these programmes – and quite a few more – over the last 30 odd years. What one uses tends to reflect personal history, intellectual communities, cost, etc but there are various points not highlighted in the discussion.

1. SPSS & SAS were designed as data management packages in the days when memory & CPU were expensive. They have evolved into tools for analysing corporate databases, but they are ferociously expensive for ordinary users and dealing with academic licences is a pain. Increasingly the corporate focus means that they lag behind the state of the art in statistical methods, but there is no other choice when dealing with massive datasets – oh the days when such data sets had to be read from magnetic tapes!

2. Stata’s initial USP was graphics combined with reasonable data management and statistics, but it developed an active user community which has greatly expanded its statistical capabilities if your interests match those of the community. In my view, its scripting language is not as bad as suggested by other comments and there is lots of support for, say, writing your own maximum likelihood routine or Monte Carlo analysis.

3. R (& S-Plus), Matlab (& Octave), Gauss, … are essentially developments of matrix programming languages, but they are useless for any kind of complex data management. R has a horrible learning curve but a very active research community, so it is useful for implementations of new statistical techniques not available in pre-packaged form. For many casual users what matters is the existence of a front-end – Rcmdr, GaussX, etc – that takes away the complexity of the underlying program.

4. Excel should never be used for any kind of serious statistical analysis. It is very useful for organising data or writing/testing simple models, but the key problem is that you cannot document what has been done and it is so easy to make small but vital errors – mis-copying rows for example. Actually, Statistica, JMP, and similar menu-driven programs fall into the same category: they are very good for data exploration but very poor for setting up analyses that can be checked and replicated in a reliable manner.

5. Many of us have used a variety of programming languages for data management and analysis in the past, but that is daft today – unless you are dealing with massive datasets of the SAS type and can’t afford SAS. In such cases their primary use will be the extraction and manipulation of data that is voluminous and frequently updated, but not for data analysis.

For anyone thinking what to use the key questions to consider are:

A. Are you primarily concerned with data management or data analysis? If data management, then steer clear of matrix-oriented languages which assume that your datasets are small(ish) and reasonably well organised. On the other hand, R or Matlab are essential if you want to analyse financial options using data extracted from Bloomberg.

B. Are your statistical needs routine – or, at least, standard within a research community? If so, go for a standard package with a convenient interface and easy learning curve or the one most commonly used in your community. The vast majority of users can rely upon whatever is the standard package within their discipline or work environment – from econometrics to epidemiology – and they will get much better support if they stick with the standard choice.

C. How large an initial commitment of time and money do you expect to make? A researcher developing new statistical tools or someone analysing massive databases must expect to make a substantially larger initial investment in learning and/or developing software than someone who simply wants to deploy data analysis in the course of other work.

D. Are you a student or a professional researcher? Partly this is a matter of cost and partly a matter of the reproducibility of research results. Open source and other low cost programs are great for students, but if you are producing research for publication or repeated replication it is essential to have a chain of evidence. R programs can be checked and reproduced for standard datasets, but even here there is a problem with documenting the ways in which more complex datasets have been manipulated.

I am primarily an applied econometrician, but even within this field there is a substantial range of packages with large groups of users – from R, Matlab & Gauss through Stata to RATS & E-Views – according to the interests of the users and types of data examined. Personally, I use Stata much of the time but ultimately the choice of package is less important than good practice in managing and analysing data. That is the one thing about the older packages – they force you to document how your data was constructed and analysed which is as or more important than the statistical techniques that are used unless you are purely interested in statistical methods.

Stefan says: Feb 25,2009

Hi. I think this is a very incomplete comparison. If you want to make a real comparison, it should be more complete than this wiki article . And to give a bit of personal feedback:

I know 2 people using STATA (social science), 2 people using Excel (philosophy and economics), several using LabView (engineers), some using R (statistical science, astronomy), several using S-Lang (astronomy), several using Python (astronomy) and by using Python, I mean that they are using the packages they need, which might be numpy, scipy, matplotlib, mayavi2, pymc, kapteyn, pyfits, pytables and many more. And this is the main advantage of using a real language for data analysis: you can choose among the many solutions the one that fits you best. I also know several people who use IDL and ROOT (astronomy and physics).

I have used IDL, ROOT, PDL, (Excel if you really want to count that in) and Python and I like Python best 🙂

## Leave a Reply