First Analysis

Create a New Script
In IDLE, open the damon1.templates.blank module.  Save it to a directory of your choice as my_script4.py.

Accessing Damon Object Methods
To do a Damon analysis, the main thing is to understand how Damon's object-oriented command syntax works.  Say I want to run the coord() method.  coord() is an "attribute" of my_obj -- the Damon object you defined.  Run it by typing:

>>> my_obj.coord(...)

(Enter desired parameters in the parentheses.)

In this syntax, you don't define a variable:

>>> X = my_obj.coord(...)         # This is wrong

X will just return as "None" if you do.  When you run my_obj.coord(...), the program grabs data and parameters accumulated in the my_obj damon object.  It works on that data using the coord() method, calculates coordinates, and assigns the output to my_obj as an attribute under the name coord_out.  Access it by typing: 

>>> my_obj.coord_out 
 
Since coord_out contains a lot of arrays, you will want to be more specific in accessing outputs (see the coord() documentation under "Returns" for a list of contents).  To get an array of person (Facet 0) coordinates, type:

>>> my_obj.coord_out['fac0coord']['coredata']

Note that the object-oriented syntax for accessing an object method is different than for accessing a function.  For example, the residuals() function is in the tools.py module.  It is not attached to any specific object (other than the tools module itself).  You would type:

model_resid = tools.residuals(...)         # Here we do assign a variable name to the function outputs

Or:
created_data = core.create_data(...)        # Remember this?

But with Damon, for the most part you will be accessing DamonObj methods:

my_obj.method1(...)
my_obj.method2(...)
my_obj.method3(...)

Another important point is that order matters when running Damon methods.  You run my_obj.standardize() before my_obj.coord(), which you run before my_obj.baseEst(), which you run before my_obj.resid().  This makes sense, right?  Calculating coordinates depends on standardized data.  Calculating estimates depends on having coordinates.  Calculating residuals requires having estimates.  Damon's template.py contains a nice cheat-sheet giving the usual order of operation:

Methods in Order of Operation

Cheatsheet of Damon Methods
---------------------------
In (approximate) order of application:

d = create_data()['data']       =>  Create artificial Damon objects
d = TopDamon()                  =>  Create a Damon object from an existing dataset
d = Damon(data,'array',...)     =>  More generic low-level way to create a Damon object
d.merge_info()                  =>  Merge row or column info into labels
d.extract_valid()               =>  Extract only valid rows/cols
d.pseudomiss()                  =>  Create index of pseudo-missing cells
d.score_mc()                    =>  Score multiple-choice data
d.subscale()                    =>  Append raw scores for item subscales
d.parse()                       =>  Parse response options to separate columns
d.standardize()                 =>  Convert all columns into a standard metric
d.rasch()                       =>  Rasch-analyze data (in place of coord())
d.coord()                       =>  Calculate row and column coordinates
d.objectify()                   =>  Maximize objectivity of specified columns (in place of coord)
d.base_est()                    =>  Calculate cell estimates
d.base_resid()                  =>  Get residuals (observation - estimate)
d.base_ear()                    =>  Get expected absolute residuals
d.base_se()                     =>  Get standard errors for all cells
d.equate()                      =>  Equate two datasets using a bank
d.base_fit()                    =>  Get cell fit statistics
d.fin_est()                     =>  Get final estimates, original metric
d.est2logit()                   =>  Convert estimates to logits
d.item_diff()                   =>  Get probability-based item difficulties
d.fillmiss()                    =>  Fill missing cells of original dataset
d.fin_resid()                   =>  Get final cell residuals, original metric
d.fin_fit()                     =>  Get final cell fit, original metric
d.restore_invalid()             =>  Restores invalid rows/cols to output arrays
d.summstat()                    =>  Get summary row/column/range statistics
d.bank()                        =>  Save row/column coordinates in "bank" file
d.export()                      =>  Export specified outputs as files


Bear in mind you won't need to run all these methods; one or two will generally suffice.  But they must be in order.  If your methods are out of order, Damon returns an error message telling what needs to be run first.

A Python Fine-Point:  Indenting
Before getting into the program, I need to mention Python's indenting syntax.  Unlike other programming languages like C, which rely on braces {...} to block sections of code, Python blocks code using indents, as in this if-else statement:

Indents

if x is y:
    print 'Blah
else:
    print 'Blih


In Damon, you won't need to use indentation blocks much since all the methods are run at the same level.  However, you will need to paste functions and methods.  You may worry about making the arguments line up to meet Python's indentation requirements, but don't.  In Python, as long as data or arguments are contained in parentheses (...), or brackets [...], or braces {...}, you can spread things out across multiple lines without worrying about indents:

Indenting not necessary in parentheses

# This...
d = core.create_data(nfac0 = 100,
                     nfac1 = 80,
                     ndim = 3,
                     )

# is the same as this...
d = core.create_data(nfac0 = 100,
         nfac1 = 80,
         ndim = 3,
         )

# Is the same as this...
d = core.create_data(nfac0 = 100,nfac1 = 80,ndim = 3)


Now, let's perform our first real Damon analysis.

Type the following program in my_script4.py
(The fingers remember what the brain forgets.)

I have written the program below relying on a lot of default parameters without including the line documentation to make it easier to read.  But you might want to use Python's help() feature to copy and paste the code for the relevant functions and methods.  (See  Documentation .)  That way, you can see all the options for each Damon method in your my_script4.py program, which will demystify things a bit.

Some of this program will seem obscure, but it contains the essential elements of a real-world Damon analysis.  Get the hang of this and the rest of Damon is footnotes.

my_script4.py

import os
import sys

import numpy as np
import numpy.random as npr
import numpy.linalg as npla
import numpy.ma as npma

try:
    import matplotlib.pyplot as plt
except ImportError:
    pass

import damon1 as damon1
import damon1.core as dmn
import damon1.tools as dmnt

############################################################################
# Use create_data() to create 'data' and 'model' Damon objects
d = dmn.create_data(nfac0 = 100,
                    nfac1 = 80,
                    ndim = 3,
                    seed = 1,
                    noise = 5,
                    p_nan = 0.20,
                    )

# Get data and model objects
data = d['data']
model = d['model']

#############################################################################
# From here, we just tell the "data" Damon object what to do, and it does it.

# Run Damon methods
data.coord(ndim = [range(1,6)])
data.base_est()
data.base_resid()
data.base_ear()
data.base_se()
data.summstat(getstats = ['Mean','SD','SE','Corr','Count','Min','Median','Max'],
              getrows = 'SummWhole',
              getcols = {'Get':'NoneExcept','Labels':'key','Cols':[1,2,3,4,5]}
              )
data.export(outputs = ['base_est_out'])


# Print out results
np.set_printoptions(precision=2,suppress=True)      # Makes printouts easier to read
print '\nEstimates =\n',data.base_est_out['coredata']
print '\nResiduals =\n',data.base_resid_out['coredata']
print '\nExpected Absolute Residuals =\n',data.base_ear_out['coredata']
print '\nStandard Errors =\n',data.base_se_out['coredata']
print '\nSummary statistics over all rows =\n',data.summstat_out['row_ents']['collabels']
print data.summstat_out['row_ents']['coredata']
print '\nSummary statistics for items 1 - 5 =\n',data.summstat_out['col_ents']['collabels']
print data.summstat_out['col_ents']['coredata']


# Compare Damon estimates with "true" model values using dmnt.residuals tool
model_resid = dmnt.residuals(observed = model.coredata,
                             estimates = data.base_est_out['coredata']
                             )
model_rmsr = np.round(np.sqrt(np.mean(model_resid**2)),2)

# Compare Damon estimates with "observed" values
obs_resid = data.base_resid_out['coredata']
valid = np.where(obs_resid != data.base_resid_out['nanval'])
valid_obs_resid = obs_resid[valid]
obs_rmsr = np.round(np.sqrt(np.mean(valid_obs_resid**2)),2)

# Get valid estimates for future reference
valid_est = data.base_est_out['coredata'][valid]

# Print out comparisons
print '\n Estimates vs Model Residuals =\n',model_resid
print '\nEstimates vs Model:  Root Mean Squared Residual =',model_rmsr
print '\nEstimates vs Observed Data:  Root Mean Squared Residual =',obs_rmsr

# Use matplotlib to graph the estimates against the model values
plt.plot(model.coredata,data.base_est_out['coredata'],'b.')
plt.xlabel('Model (true) Values')
plt.ylabel('Damon Estimates')
plt.savefig('my_script4_mod.png')

# Also graph the estimates against the observed values
plt.plot(data.coredata[valid],valid_est,'b.')
plt.xlabel('Observed Values')
plt.ylabel('Damon Estimates')
plt.savefig('my_script4_obs.png')
        
print '\nYour first Damon analysis was successful!'



Save and hit F5.  Here are the IDLE outputs:

IDLE Display

>>> 
create_data() is working...

Number of Rows= 100
Number of Columns= 80
Number of Dimensions= 3
Data Min= -8.627
Data Max= 9.356
Proportion made missing= 0.199
Not-a-Number Value (nanval)= -999.0

create_data() is done.
Contains:
['fac0coord', 'model', 'fac1coord', 'data', 'anskey'] 

coord() is working...

Dim Stab Acc Obj Speed Err Degen
1 0.95 0.42 0.631 0.214 2.375 nan
2 0.964 0.692 0.817 0.781 1.911 nan
3 0.969 0.874 0.92 0.997 1.31 nan
4 0.92 0.877 0.898 0.991 1.246 nan
5 0.811 0.874 0.842 0.765 1.26 nan
Best Dimensionality =  3 

Dim Fac Iter Change jolt_
3 0 0 1.0001
3 1 0 1.0001
3 0 1 0.05509
3 1 1 0.05509
3 0 2 0.01515
3 1 2 0.01515
3 0 3 0.00177
3 1 3 0.00177
3 0 4 0.00019
3 1 4 0.00019


coord() is done -- see my_obj.coord_out
Contains:
['ndim', 'fac1coord', 'anchors', 'changelog', 'facs_per_ent', 'fac0coord'] 

base_est() is working...

base_est() is done -- see my_obj.base_est_out
Contains:
['nheaders4cols', 'key4rows', 'nheaders4rows', 'rowlabels', 'validchars', 'rowkeytype', 'coredata', 'colkeytype', 'nanval', 'collabels', 'key4cols', 'ecutmaxpos'] 

base_resid() is working...

base_resid() is done -- see my_obj.base_resid_out
Contains:
['nheaders4cols', 'key4rows', 'nheaders4rows', 'rowlabels', 'validchars', 'rowkeytype', 'coredata', 'colkeytype', 'nanval', 'collabels', 'key4cols'] 

base_ear() is working...

base_ear() is done -- see my_obj.base_ear_out
Contains:
['nheaders4cols', 'ear_coord', 'key4rows', 'nheaders4rows', 'rowlabels', 'validchars', 'rowkeytype', 'coredata', 'colkeytype', 'collabels', 'nanval', 'key4cols', 'ecutmaxpos'] 

base_se() is working...

base_se() is done -- see my_obj.base_se_out
Contains:
['core_row', 'verbose', 'validchars', 'whole_row', 'coredata', 'rl_col', 'colkeytype', 'collabels', 'se_coord', 'key4rows', 'nheaders4rows', 'cl_row', 'whole_col', 'rl_row', 'nanval', 'fileh', 'obspercell_factor', 'nheaders4cols', 'core_col', 'rowlabels', 'cl_col', 'rowkeytype', 'whole', 'key4cols'] 

summstat() is working...

summstat() is done -- see my_obj.summstat_out
Contains:
['getcols', 'getrows', 'row_ents', 'stability', 'col_ents', 'objectivity', 'objperdim'] 

export() is working...

aa_base_est_out.csv has been saved as a text file.

export() is done.


Estimates =
[[-1.75  1.25  2.73 ...,  1.2  -4.36 -3.27]
 [ 2.71  2.09  2.1  ...,  2.22 -2.87 -1.97]
 [ 0.66  0.93  2.13 ..., -0.52 -1.37 -1.13]
 ..., 
 [-0.31 -0.37 -1.27 ...,  0.9   0.35  0.37]
 [ 3.03  1.19  0.49 ...,  1.4  -0.49 -0.21]
 [-2.9   0.69  3.84 ..., -1.74 -3.37 -2.84]]

Residuals =
[[  -0.62    1.01 -999.   ...,   -0.75    1.6     0.96]
 [   1.71    0.26    0.73 ...,    0.05    2.45    1.55]
 [  -0.58   -0.21   -0.   ...,    0.48   -0.13    1.66]
 ..., 
 [   0.59   -0.53    2.02 ..., -999.     -0.28    2.16]
 [   1.73    0.7     1.63 ...,   -1.12 -999.     -1.39]
 [   1.72   -0.24 -999.   ...,    0.83    0.47    0.82]]

Expected Absolute Residuals =
[[ 0.93  0.89  0.96 ...,  0.93  1.04  1.11]
 [ 1.09  1.05  1.13 ...,  1.1   1.22  1.3 ]
 [ 0.92  0.88  0.95 ...,  0.92  1.03  1.09]
 ..., 
 [ 0.95  0.91  0.99 ...,  0.96  1.07  1.14]
 [ 1.14  1.09  1.18 ...,  1.15  1.27  1.36]
 [ 0.92  0.88  0.95 ...,  0.92  1.03  1.09]]

Standard Errors =
[[ 0.43  0.42  0.45 ...,  0.44  0.49  0.52]
 [ 0.5   0.48  0.52 ...,  0.5   0.56  0.6 ]
 [ 0.41  0.39  0.42 ...,  0.41  0.46  0.49]
 ..., 
 [ 0.41  0.39  0.42 ...,  0.41  0.45  0.48]
 [ 0.5   0.48  0.52 ...,  0.5   0.56  0.6 ]
 [ 0.4   0.38  0.42 ...,  0.4   0.45  0.48]]

Summary statistics over all rows =
[['base_est_out Row Ents where Cols are: NoneExcept [1, 2, 3, 4, 5]' 'Mean'
  'SD' 'SE' 'Corr' 'Count' 'Min' 'Median' 'Max']]
[[   0.17    1.88    0.42    0.87  399.     -4.9     0.23    4.79]]

Summary statistics for items 1 - 5 =
[['base_est_out Col Ents where Rows are: AllExcept [None]' 'Mean' 'SD' 'SE'
  'Corr' 'Count' 'Min' 'Median' 'Max']]
[[ -0.25   2.2    0.44   0.9   80.    -3.96  -0.37   3.76]
 [  0.02   1.01   0.42   0.71  83.    -2.29   0.06   2.58]
 [  0.35   1.9    0.45   0.85  77.    -4.65   0.5    4.22]
 [  0.35   1.63   0.39   0.83  79.    -3.48   0.3    3.63]
 [  0.34   2.29   0.39   0.91  80.    -4.9    0.7    4.79]]

 Estimates vs Model Residuals =
[[-0.22 -0.09 -0.31 ...,  0.    0.35 -0.19]
 [-0.19 -0.34 -0.52 ..., -0.35  0.1  -0.2 ]
 [ 0.72 -0.16 -0.6  ...,  0.03  0.22  0.11]
 ..., 
 [-0.51  0.12  0.22 ...,  0.23 -0.13 -0.09]
 [ 0.28 -0.15 -0.17 ..., -0.37 -0.08 -0.09]
 [-0.58 -0.04  0.12 ..., -0.02 -0.43 -0.78]]

Estimates vs Model:  Root Mean Squared Residual = 0.42

Estimates vs Observed Data:  Root Mean Squared Residual = 1.21

Your first Damon analysis was successful!
>>> 


Discussion
Congratulations, you did it!

What have you done?  First, you created some artificial 3-dimensional data, added a bit of noise, and made 20% of the cells missing.  This gave us a Damon object called data.

Then you applied the coord() method to data using the command data.coord(...).  You told coord() to assess "objectivity" across a range of possible dimensionalities (1 up to but not including 6).  coord() found that a dimensionality of 3 produced the highest objectivity.  Which is good, since that is what we used when creating the data.  At this optimal dimensionality, coord() calculated a final set of row and column coordinates.

The data.base_est() command used these coordinates to calculate cell estimates for the entire array.  Each cell estimate is the dot product of the coordinates for that row and column.  If x and y are coordinate dimensions and R and C are the row and column coordinates of the cell:  Est[R,C] = R[x] * C[x] + R[y] * C[y]

After getting the estimates, you wanted to calculate standard errors.  The standard error formula, however, requires residuals (Observed - Estimates) and what are known (in Damon) as "expected absolute residuals".  Therefore, we issued the following commands:  data.base_resid(), data.base_ear(), data.base_se().

Having calculated cell-level statistics, we want some summary statistics, in this case for column entities 1-5 as well as across all rows.  For that we run data.summstat() and get the mean, standard deviation, and standard errors for those column entities and for the array as a whole.  For kicks, we also get the correlation between the estimates and observed values and get the minimum, maximum, and median of each column entity.

We ask Damon to print out snippets of each output in the IDLE screen.  To get estimates, for example, we type:
print '\nEstimates =\n',data.base_est_out['coredata']

All these output arrays reside in computer memory.  To save selected arrays to disk, we use data.export() to export the estimates as a comma-delimited file.  It shows up in the current working directory as "aa_base_est_out.csv" -- "aa" being the default prefix.  (To export files that are not Damon outputs, we would use Numpy's savetxt() function instead.)

To measure how you did, you use tools.residuals() to find the difference between the "true" Model values (before noise was added -- one of the create_data outputs) and the Damon estimates calculated from noisy data.  We find the root mean squared residual is 0.42.  Not bad, considering that the "true" values run from -9 to +9 and we added a random number between -2.5 and +2.5 to each data value before analyzing it.  We are able to predict the "true" value to within less than a unit.

You also do comparisons between the observations and the true values and find that the root mean squared residual (RMSR) is 1.21, three times larger than that for the model values.  In other words, and this should get your attention, Damon estimates are closer to the "true" values (which it did not know and were obscured with noise) than the observed values are (which it did know).

Finally, to graph these comparisons, we use the wonderful matplotlib package to make some scatterplots.  They are output to the current working directory.  Here is what they look like:


So What?
Here is what Damon has allowed you to do:
  • Predict missing values.  Remember that you made 20% of the data was missing.  You can now predict those missing cells with reasonable accuracy.  These aren't just statistically plausible values.  They are cell-specific predictions.  Actually, what we are predicting is not the "observed" datum, but the underlying "true" value -- what you would get if you made the same observations a lot of times and averaged them.   This is actually better than the observed value, because it is more reproducible.  One other note: In this example, the cells were randomly missing.  However, Damon's mathematical assumptions do not require missing cells to be randomly missing; they can be non-randomly missing as well.  That means Damon is potentially applicable to any problem that can be posed as a missing data problem.
  • Improve existing values.  In addition to predicting missing cells, we have computed estimates for the non-missing cells.  Because these estimates take into account the rest of the data array, they are more accurate and reproducible than the observed values.  This is evidenced by the two scatterplots.  Damon estimates are closer to the true values than to the observed values.  
  • Compute EARs and SEs.  We can quantify, for each cell estimate, its precision in two senses:
    • Expected Absolute Residual.  This is the expected absolute difference between the cell estimate and the observed value.
    • Standard Error.  This relates to the expected difference between the cell estimate and the true value, the value that would be obtained by making repeated observations of that cell and averaging them, if that were possible.
And that is pretty much how to do a Damon analysis.