Software and Programming

To get started writing code to work with data, you'll need to take care a of a few preliminaries before you can dive in:

identyify a platform you'll use to do your work
obtain software for developing code to process and analyze data
know and apply best practices

identify a platform

Many people are most comfortable using their local laptop or desktop. If you plan to do this, follow the links in the section below for open source software to download the packages and then install according to the directions. For proprietary software, use the links provided following purchase. If you are performing work in service of university business including sponsored research, please use your university-issued computer rather than a personal device. If you do not have a university-issued computer, please connect to a remote or virtual university system, such as one of the resources listed below instead.

Keep in mind that data does not have to be stored on the same computer where it's processed. For most cases, a good solution is to store data in Box and access it from your computer through Box Drive where Box will appear as a drive letter (or location on Mac). For a higher degree of security, do not sync files to your local computer; you can still access them without syncing as long as you have an internet connection.

If you use DIT's Virtual Workspace or OACS' Virtual Desktop, many statistical packages, including both open source and proprietary software, are available for free. Campus virtual environments are an ideal place to get started with programming due to the variety of software easily available. If you are working with very large data sets, processor and RAM capacity may be limited in virtual environments. In addition, be sure to save often in case you are timed out due to inactivity. If you choose the Virtual Workspace, do not store data on it; rather, use Box or a network storage location and set up a drive mapping via TerpDrives/Kumo.

BSOS' BSWIFT cluster and Zaratan, the university's high performance computing cluster, are available for analyses using large data sets and/or computationally or memory intensive processing. Consult their documentation for the software platforms that are available on these systems and how to request other applications.

obtain software

If you are using a local computer, you will need to obtain and install the software on your device. Open source statistical software such as R and Python can be obtained for free. An integrated development (IDE) is not required to run them but is strongly recommended. An IDE is a commonly used tool that makes developing code easier by providing additional functionality like syntax completion and point-and-click menus to important features.

R and RStudio IDE*
Python and Spyder*

Many data analysis and modeling packages including SAS, Stata, SPSS, Matlab, JMP, and NVIVO can be obtained for free or at a discount through Terpware, the university's software licensing office. If you need other software that is not available through Terpware and is not listed in DIT's catalog of approved software or software not recommended for use, contact your local IT liaison, business officer, or Procurement before acquiring it. Even free software must be vetted if it requires accepting terms and conditions.

*Many other IDEs for R and Python are available on the web.

If you're new to using statistical software and/or need some help getting started on your chosen package, please check out the FAQ on learning to program for resources and suggestions.

best practices

Regardless of which platform or computing system you use, constructing and running code in an organized manner permits both your collaborators and your future self to understand the processing steps, how various files relate to one another, and the provenance of data sets and results which are the product of your code. Code management and data management are often closely related, and there is usually no one "right" code structure for any given situation. A non-exhaustive list of good coding principles that apply in most situations include:

Write a little bit of code at a time and test it to make sure that each part works and that you get the expected result.
- Most IDEs will feature a log or other mechanism that shows whether code execution generated errors, warnings, and other messages.
- Check the number of observations before and after when merging, dropping, or adding rows.
- Check distributions (and note missing data) when creating, combining, or recoding variables.
Within a file, write code in the order it has to be run so that the whole program could run on its own once it's done.
Name variables, data sets, and other objects with fairly short but descriptive names (avoid names like x, temp, var1,ugh98, etc.).
Add comments in the code to document what each section does.
Use your software's dedicated commenting character(s) to block out sections that you may not want to run but don't want to delete.
Keep code files (scripts) fairly short so that each has one major focus; name the files accordingly.
Consider adding a "read.me" text document to the folder where you store code that describes what each program does, the order they should be run, and how they relate to each other and data, results, and other files in the project.
Consider using version control to keep track of changes and versions of code files; it's especially important when writing code with collaborators but is very helpful for individuals too.

Page updated

Report abuse