Python for Big Data

PHR760 Python for Big Data

This course is a 3-credit, hands-on course in Big Data in the spring semester. For more information, contact Prof. Robert Lodder at Lodder @ g.uky.edu. Because students have a wide variety of backgrounds in programming, ranging from those new to R to R experts, this course is self-paced. By the end of the second week of the course, each student will set their own individual learning goals for the semester using the list of potential programming projects, having based their goals on their individual abilities. Python is an open-source, general-purpose, multi-paradigm, scripting language. It is designed to emphasize code readability – has a clean syntax with high level data types. It is suited for interactive work and quick prototyping, while being powerful enough to write large applications in. Python has a large number of available and well-written modules for everything from abstract syntax trees to ZIP file manipulation. Its ecosystem features an extensive set of tools including a JIT compiler called PyPy and useful IDE’s like Spyder.

Python is easy for beginners to learn, and is widely used in many scientific areas for data exploration. This course is an introduction to the Python programming language for students without prior programming experience. We cover data types, control flow, object-oriented programming, and graphical user interface-driven applications. The examples and problems used in this course are drawn from diverse areas such as text processing, simple graphics creation and image manipulation, HTML and web programming, and genomics. Finally, students will be introduced to NumPy, SciPy and Biopython.

NumPy and SciPy are open-source add-on modules to Python that provide common

mathematical and numerical routines in pre-compiled, fast functions. These are growing into

highly mature packages that provide functionality that meets, and sometimes exceeds, that

associated with common commercial software like MatLab. The NumPy (Numeric Python)

package provides basic routines for manipulating large arrays and matrices of numeric data.

The SciPy (Scientific Python) package extends the functionality of NumPy with a substantial

collection of useful algorithms, like minimization, Fourier transformation, regression, and other

applied mathematical techniques. Biopython is a set of freely available tools for biological computation written in Python by an international team of developers. It is a distributed collaborative effort to develop Python libraries and applications which address the needs of current and future work in bioinformatics.