Welcome!

Important Update, October 31, 2023

This website and the version of Damon it describes were last current in 2020. Since then, I have been hard at work building related applications and reworking Damon from scratch. When released, the new Damon will:

- Be a Python platform for running a variety of psychometric software tools
- Run on a new data utility that extends Polars
- Generalize the matrix factorization algorithm described here to include multi-faceted (multi-tensor) and multi-spatial designs

In the meantime, I have removed most of the resources on this website to start the transition to the new Damon, which will probably migrate to a new website.

I don't have a completion date, but progress is steady.

-- Mark Moulton

****************************************************

Description

Pythias Consulting (meaning me, Mark Moulton) offers open-source software (Damon) for analyzing multidimensional tabular datasets. It also offers consulting in the application of Damon to problems in psychometrics. This website is about the legacy version of Damon, written and maintained from 2010 to 2018, with important updates in 2020 and 2023.

Damon predicts missing cells and replaces observed values with "most likely" values. Developed in the field of psychometrics (measurement of mental traits, especially in the field of education), Damon is also applicable to problems in statistics, prediction, and data analysis. Some examples:

- Student Ability. Measure student math ability based on a test that contains a mix of math, language, and science items.
- Item Anchoring. Calculate latent item coordinates from one data set, and apply them to another.
- Rasch Measurement. Damon supports Rasch analysis -- the unidimensional counterpart of Damon.
- Predict Responses. Predict how a consumer will respond to a movie, song, or other product.
- Clean Data. Fill missing cells and replace highly unlikely values with likely ones.
- Compress Data. Store a dataset not as observed values but as row and column coordinates.
- Extrapolate Respondent Samples. Predict how members of a representative sample will respond to questions administered to a non-representative sample.

Mathematical Features

- Create Data. Damon offers a method for creating artificial datasets in a variety of metrics, adding error, and making cells missing. This makes it easy to figure out how Damon works, do simulations, and test the validity of Damon apps.
- Dimensionality. Damon is very good at finding the optimal dimensionality for a given dataset, key to specifying the correct model.
- Missing Data. Handles datasets containing randomly and non-randomly missing values, including sparse datasets.
- Mixed Metrics. Handles columns with different kinds of data: nominal and multiple choice ('a','b','c'), dichotomous (0,1), ordinal (0,1,2,3), interval (-inf to +inf), ratio (0 to +inf). For analysis, Damon converts the data into any of a variety of standardization metrics, then converts estimates back to the original "observed" metric for each column, if desired. It can also return values in a probability or logit metric.
- Entity Banking. Each row and column entity (persons and items) is assigned spatial coordinates which can be stored in a "bank". By accessing a bank, the coordinates calculated from one dataset can be applied to another that contains some of the same entities, forcing multiple datasets to be analyzable as if they were from the same dataset.
- Standard Errors, etc. Each cell estimate has a corresponding standard error, expected absolute residual, residual, and fit statistic. Cell estimates can also be expressed as probabilities and logits.
- Objectivity. Most statistical methodologies (correlations, regressions, ANOVA, etc.) yield statistics that are highly sample-dependent. Change the sample and the statistics change. Hence, the traditional requirement to use "representative samples". With Damon, representative samples are not a strict necessity. When the observed data match the estimates produced by Damon at the optimal dimensionality, called "fitting the model", Damon estimates become "sample-free" or "objective". That means they will be approximately the same regardless of the sample of persons or items (row entities or column entities) used to calculate them, so long as those persons and items also fit the model. This is a hugely important property.
- Resists Overfit. The bane of important modeling algorithms such as neural networks (though much less so, nowadays), "overfit" is what happens when data is modeled with too many parameters. Model estimates may match observed data closely but perform poorly in predicting missing cells. Damon finds the optimal number of parameters to avoid this problem. Plus, it offers an extra method for definitively removing the effects of overfit from cell estimates.

Usability Features

- Python-based. Damon is run through a command-line Python interface. (Python http://www.python.org is a popular programming language.) This makes it easy to integrate Damon with other Python packages (Numpy, Scipy, MatPlotLib, Pandas, etc.) and to build complex and powerful Damon-based applications. Python works well with programs written in other languages, web applications, cloud computing, and so on. To use Damon, you do need to install Python and its numerical package Numpy onto your system (both are free).
- Cross-Platform. Damon, like Python, runs on Windows, Mac OS X, and Linux platforms.
- Easy to Use. Although Damon does not employ a GUI, it is remarkably easy to learn and use. You don't need to know Python -- just enough to open the Python interpreter and type a few simple commands. If you have no programming background, Damon (and Python) will turn you into a programmer without you knowing it. You also don't need to be a statistician or mathematician. While Damon contains deep math, from a user perspective it is simple. Load observations. Get estimates back.
- Well Documented. In addition to the tutorials on this website, each Damon method and function has extensive documentation accessed through Python's help() command.
- Good at Labels. Row and column labels are important in Damon as they serve many purposes. (Note: Damon's data utility was written before Pandas became the accepted way to deal with labeled data.)
- Input Formats. Data is generally read from csv files, plus a few other formats. Data is expected to be in row/column tabular format.
- Output Formats. Damon can output any of a large variety of reports as csv files.
- Open-Source. Damon is available under the Apache License 2.0. That means you have access to the source code and can modify it, distribute it, post it as part of a larger web application -- anything you like -- for free in perpetuity with minimal requirements.

Limitations

- Tabular. Damon currently handles only tabular data designs (row entities x column entities). This means it only does 2- facet analysis, also called "2-way" and "2-tensor".
- Dataset Size. Damon handles pretty large datasets, but not what is called "big data" in the Machine Learning world. This is not a limitation of the methodology, which is extremely scalable, but of available RAM.
- No Graphical User Interface. Damon currently supports only command-line and scripting usage. While this provides more power and flexibility than a GUI ever can, it does pose a steeper learning curve.

Methodology

Damon implements a generalized alternating least squares (ALS) algorithm for decomposing matrices into arrays of row and column coordinates of a specified rank or dimensionality determined using Rasch-like objectivity criteria. Data arrays may be rectangular, and they may have an arbitrarily large number of missing cells. Data may be nominal, dichotomous, ordinal, interval, and ratio, and a given dataset may contain a mix of data types and metrics. The optimal number of dimensions (dimensionality) is found by using Damon to predict pseudo-missing cells for each of a specified range of dimensionalities and selecting that dimensionality which: a) maximizes the accuracy of the predictions, and b) maximizes the stability of the coordinate structure. To the degree Damon cell estimates match the original observations under these conditions, they are said to be "objective" -- likely to reproduce across different samples of row and column entities and unlikely to suffer from the problem of "overfit". Damon supports the editing of datasets to optimize their objectivity.

Damon is classified in the field of psychometrics as a multidimensional generalization of the Rasch model with similar objectivity properties. In the field of data mining, it is a generalized form of alternating least squares matrix decomposition. It is similar to decomposition algorithms used successfully in the famous Netflix competition that concluded in 2009.

Page updated

Google Sites

Report abuse