Data Science

Distributed Computing Framework

  • Apache Spark : a fast and general engine for large-scale data processing
    • Spark SQL : Spark's module for working with structured data
    • MLlib : Apache Spark's scalable machine learning library
    • GraphX : Apache Spark's API for graphs and graph-parallel computation
  • Apache Hadoop : a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models
    • Apache HBase : an open-source, distributed, versioned, non-relational database
    • Apache Hive : a data warehouse software to facilitate querying and managing large datasets residing in distributed storage
    • Apache Phoenix : High performance relational database layer over HBase for low latency applications
    • Apache Pig : a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs

General-Purpose Library

  • pandas : an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language

Data Mining

  • Weka : a collection of machine learning algorithms for data mining tasks
  • RapidMiner
  • KNIME
  • ELKI : Environment for Developing KDD-Applications Supported by Index-Structures

Data Visualization

  • gnuplot : a portable command-line driven graphing utility for Linux, OS/2, MS Windows, OSX, VMS, and many other platforms
  • VTK : the Visualization Toolkit, an open-source, freely available software system for 3D computer graphics, image processing and visualization
  • MathGL : a library for making high-quality scientific graphics under Linux and Windows
  • PLplot : a cross-platform software package for creating scientific plots
  • OxyPlot : a cross-platform plotting library for .NET
  • Google Charts
  • JFreeChart : a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications
  • RGraph : HTML5 charts library, Open Source interactive charts using JavaScript and the HTML5 canvas tag
  • Raphael : a small JavaScript library that should simplify your work with vector graphics on the web
  • Graphviz : an open source graph visualization software
  • Gephi : an interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs
  • Cytoscape : an open source software platform for visualizing complex networks and integrating these with any type of attribute data
  • Tulip : an information visualization framework dedicated to the analysis and visualization of relational data