Welcome!

Pythias (the organization) offers open-source software (Damon) for analyzing multidimensional tabular datasets.  It also offers consulting in the application of Damon to problems in psychometrics and data mining.

Damon predicts missing cells and replaces observed values with "most likely" values.  Developed in the field of psychometrics (measurement of mental traits, especially in the field of education), Damon is also applicable to problems in statistics, prediction, and data analysis.  Some examples:
  • Student Ability.  Measure student math ability based on a test that contains a mix of math, language, and science items.
  • Item Anchoring.  Calculate latent item coordinates from one data set, and apply them to another.
  • Rasch Measurement.  Damon supports Rasch analysis -- the unidimensional counterpart of Damon.
  • Predict Responses.  Predict how a consumer will respond to a movie, song, or other product.
  • Clean Data.  Fill missing cells and replace highly unlikely values with likely ones.
  • Compress Data.  Store a dataset not as observed values but as row and column coordinates.
  • Extrapolate Respondent Samples.  Predict how members of a representative sample will respond to questions administered to a non-representative sample.
  • Extrapolate Time-Series Functions.  Predict the next value in a series.
Mathematical Features
  • Create Data.  Damon offers a method for creating artificial data in a variety of metrics, adding error, and making cells missing.  This makes it easy to figure out how Damon works, do simulations, and test the validity of Damon apps.
  • Dimensionality.  Damon is very good at finding the optimal dimensionality for a given dataset, key to specifying the correct model.
  • Missing Data.  Handles datasets containing randomly and non-randomly missing values, including sparse datasets.
  • Mixed Metrics.  Handles columns with different kinds of data:  nominal and multiple choice ('a','b','c'), dichotomous (0,1), ordinal (0,1,2,3), interval (-inf to +inf), ratio (0 to +inf).  Damon can parse a column so that each response is assigned its own dichotomous column, or it can standardize it into any of a variety of standardization metrics.  When the main analysis is done, Damon converts estimates back to the original "observed" value metric for each column.
  • Entity Banking.  Each row and column entity (e.g., persons and items) is assigned spatial coordinates which can be stored in a "bank".  By accessing a bank, the coordinates calculated from one dataset can be applied to another that contains some of the same entities, forcing the two datasets to be analyzable as if they were the same dataset.
  • Standard Errors, etc.  Each cell estimate has a corresponding standard error, expected absolute residual, residual, and fit statistic.  Cell estimates can also be expressed as probabilities and logits.
  • Objectivity.  Most statistical methodologies (correlations, regressions, ANOVA, etc.) yield statistics that are highly sample-dependent.  Change the sample and the statistics change.  Hence, the traditional requirement to use "representative samples".  With Damon, representative samples are not a strict necessity.  When the observed data match the estimates produced by Damon at the optimal dimensionality, called "fitting the model", Damon estimates become "sample-free" or "objective".  That means they will be approximately the same regardless of the sample of persons or items (row entities or column entities) used to calculate them, so long as those persons and items also fit the model.  This is a hugely important property.
  • Resists Overfit.  The bane of important modeling algorithms such as neural networks, "overfit" is what happens when the data is modeled by too many parameters.  Model estimates may match observed data closely but perform poorly in predicting missing cells.  Damon finds the optimal number of parameters to avoid this problem.  Plus, it offers an extra method for definitively removing the effects of overfit from cell estimates.
Usability Features
  • Python-based.  Damon is run through a command-line Python interface.  (Python http://www.python.org is a popular top-level programming language.)  This makes it easy to integrate Damon with other Python packages (e.g., Numpy, Scipy, MatPlotLib, etc.) and to build complex and powerful Damon-based applications.  Python works well with programs written in other languages, web applications, cloud computing, and so on.  To use Damon, you do need to install Python and its numerical package Numpy onto your system (both are free).  
  • Cross-Platform.  Damon, like Python, runs on Windows, Mac OS X, and Linux platforms.
  • Easy to Use.  Although Damon does not employ a GUI, it is remarkably easy to learn and use.  You don't need to know Python -- just enough to open the Python interpreter and type a few simple commands.  If you have no programming background, Damon (and Python) will turn you into a programmer without you knowing it.  You also don't need to be a statistician or mathematician.  While Damon contains deep math, from a user perspective it is simple.  Load observations.  Get estimates back.
  • Well Documented.  In addition to the tutorials on this website, each Damon method and function has extensive documentation accessed through Python's help() command.
  • Good at Labels.  Row and column labels are important in Damon as they serve many purposes.
  • Input Formats.  Data can be text files, Damon "DataDicts" (Damon's internal data structure), and HDF5 files (for large datasets).  Data is expected to be in row/column tabular format.
  • Output Formats.  Damon can output any of a large variety of reports as text files, HDF5 files, and Python dictionaries.
  • Large DataSets.  When the PyTables option is used in Damon, it becomes (somewhat) possible to analyze fairly large datasets which might otherwise have trouble fitting in memory.  What is fairly large?  Say, a million rows by a thousand columns (though even this is becoming small by modern standards).
  • Open-Source.  Damon is available under the Apache License 2.0.  That means you have access to the source code and can modify it, distribute it, post it as part of a larger web application -- anything you like -- for free in perpetuity with minimal requirements.
Limitations
  • Tabular.  Damon currently handles only tabular data designs (row entities x column entities).  This means it only does 2- facet analysis, also called "2-way" and "2-tensor".
  • Dataset Size.  Damon handles pretty large datasets, but not at the industrial scale used by companies like Netflix, Google, or Yahoo.  This is not a limitation of the methodology, which is extremely scalable, but of the program.
  • No Graphical User Interface.  Damon currently supports only command-line and scripting usage.  While this provides more power and flexibility than a GUI ever can, it does pose a steeper learning curve.
Methodology
Damon implements a generalized alternating least squares (ALS) algorithm for decomposing matrices into arrays of row and column coordinates of a specified rank or dimensionality determined using Rasch objectivity criteria. Data arrays may be rectangular, and they may have an arbitrarily large number of missing cells.  Data may be nominal, dichotomous, ordinal, interval, and ratio, and a given dataset may contain a mix of data types and metrics.  The optimal number of dimensions (dimensionality) is found by using Damon to predict pseudo-missing cells for each of a specified range of dimensionalities and selecting that dimensionality which:  a) maximizes the accuracy of the predictions, and b) maximizes the stability of the coordinate structure.  To the degree Damon cell estimates match the original observations under these conditions, they are said to be "objective" -- likely to reproduce across different samples of row and column entities and unlikely to suffer from the problem of "overfit".  Damon supports the editing of datasets to optimize their objectivity.

Damon is classified in the field of psychometrics as a multidimensional generalization of the Rasch model with similar objectivity properties.  In the field of data mining, it is a generalized form of alternating least squares matrix decomposition.  It is similar to decomposition algorithms used successfully in the Netflix competition, concluded in 2009.