Create Data

To do actual Damon analysis, you need data.  The random data created in the previous tutorial, being random, is useless for testing the mathematical properties of Damon.  For that, we need a tool to create numerical arrays according to Damon's mathematical model with options for mucking it up to simulate real-world conditions.  That is what create_data() does.  You will find that this tool is indispensable.  It models (almost) any kind of dataset.  With it you can test the applicability of the model, rigorously examine its strengths and weaknesses, practice with different input and output formats, and write software tests.  In our case, we just need to build a dataset for demo purposes.

Create a New Script
In IDLE, go to File/Open Module and type damon1.templates.blank .  Save as a new file my_script2.py in a directory of your choice.  Hit F5 to execute the import statements in my_script2.py.  Note that damon1.core is imported under the abbreviation dmn.

Get create_data() Parameters
In IDLE, type >>> help(dmn.create_data) .  (The create_data() function is in the damon1.core.py module, which was imported as dmn.  Note that it is a "function", not a "method" that only works on Damon objects.)  A bunch of documentation will pop up.  When you have time, you will want to read through it carefully.  For now, copy the lines in the Paste Function section and paste it into my_script2.py below the "Start programming here..." line.  Don't worry about the lines wrapping around; in your script file they'll do fine.  Within each comment line, the "<,>" symbols list parameter options.  Add the dmn prefix and set the function equal to some variable name, say, d, so that you end up with d = dmn.create_data(...).  The dmn prefix tells Python where to find the module containing create_data().

Paste create_data() Code

import os
import sys

import numpy as np
import numpy.random as npr
import numpy.linalg as npla
import numpy.ma as npma

try:
    import matplotlib.pyplot as plt
except ImportError:
    pass

import damon1 as damon1
import damon1.core as dmn
import damon1.tools as dmnt


# Start programming here...
d = dmn.create_data(nfac0,  # [Number of facet 0 elements -- rows/persons]
                    nfac1,  # [Number of facet 1 elements -- columns/items]
                    ndim,   # [Number of dimensions to create]
                    seed = None,  # [<None => randomly pick starter coordinates; int => integer of "seed" random coordinates>]
                    facmetric = [4,-2],  # [[m,b] => rand() * m + b, to set range of facet coordinate values]
                    noise = None, # [<None, noise, {'Rows':<noise,{1:noise1,4:noise4}>,'Cols':<noise,{2:noise2,5:noise5}> => add error to rows/cols]
                    validchars = None,   # [<None, ['All',[valid chars]]; or ['Cols', {1:['a','b','c','d'],2:['All'],3:['1.2 -- 3.5'],4:['0 -- '],...}]> ]
                    mean_sd = None, # [<None, ['All',[Mean,SD]], or ['Cols', {1:[Mean1,SD1],2:[Mean2,SD2],3:'Refer2VC',...}]> ]
                    p_nan = 0.0,  # [Proportion of cells to make missing at random]
                    nanval = -999.,  # [Numeric code for designating missing values]
                    condcoord_ = None,  # [< None, 'Orthonormal'>]
                    nheaders4rows = 1,  # [Number of header column labels to put before each row]
                    nheaders4cols = 1,  # [Number of header row labels to put before each column]
                    extra_headers = 0,  # [If headers > 1, range of integer values for header labels, applies to both row and col]
                    input_array = None,   # [<None, name of data array, {'fac0coord':EntxDim row coords,'fac1coord':EntxDim col coords}>]
                    output_as = 'Damon',  # [<'Damon','datadict','array','textfile','Damon_textfile','datadict_textfile','array_textfile','hd5'>]
                    outfile = None,    # [<None, name of the output file/path prefix when output_as includes 'textfile'>]
                    delimiter = None,    # [<None, delimiter character used to separate fields of output file, e.g., ',' or '   '>]
                    bankf0 = None,  # [<None => no bank,[<'All', list of F0 (Row) entities>]> ]
                    bankf1 = None,  # [<None => no bank,[<'All', list of F1 (Col) entities>]> ]
                    verbose = True, # [<None, True> => print useful information and messages]
                    )



Fill Out the Arguments
As you can see, there are some 20 arguments ("parameters") in the function.  Of these, only the first three are essential.  The rest default to the values already assigned to them.  For many purposes, these defaults work fine.

Let's say we want to build a data array with 10 rows, 8 columns, constructed using 3 mathematical dimensions.  Here is how:
  • set nfac0 = 10           # the number of facet 0 (row) entities you want (not including labels)
  • set nfac1 = 8             # the number of facet 1 (column) entities you want
  • set ndim = 3              # the number of mathematical dimensions.  Make sure ndim is smaller than the smallest of nfac0 and nfac1.
Set seed = 1, so that every time we run the script it creates the same set of random numbers.  (seed = 2 would produce a different set of random numbers.)  To simulate the fact that real-world data always has noise (error), set noise = 2.0.  This adds random numbers ranging from -1.0 to +1.0 to the model values.  To make 10% of the cells missing, set p_nan = 0.10 (means "proportion not-a-number").  We'll go with the defaults on everything else. Here's what the script looks like, in three versions, ranging from most verbose to most concise.  All three do the same thing.

We also print two outputs of create_data():  'model' and 'data'.  To clean up the array we use Numpy's set_printoptions() function.

create_data() Arguments

# Start programming here...

# Create data
d = dmn.create_data(nfac0 = 10,  # [Number of facet 0 elements -- rows/persons]
                    nfac1 = 8,  # [Number of facet 1 elements -- columns/items]
                    ndim = 3,   # [Number of dimensions to create]
                    seed = 1,  # [<None => randomly pick starter coordinates; int => integer of "seed" random coordinates>]
                    facmetric = [4,-2],  # [[m,b] => rand() * m + b, to set range of facet coordinate values]
                    noise = 2, # [<None, noise, {'Rows':<noise,{1:noise1,4:noise4}>,'Cols':<noise,{2:noise2,5:noise5}> => add error to rows/cols]
                    validchars = None,   # [<None, ['All',[valid chars]]; or ['Cols', {1:['a','b','c','d'],2:['All'],3:['1.2 -- 3.5'],4:['0 -- '],...}]> ]
                    mean_sd = None, # [<None, ['All',[Mean,SD]], or ['Cols', {1:[Mean1,SD1],2:[Mean2,SD2],3:'Refer2VC',...}]> ]
                    p_nan = 0.10,  # [Proportion of cells to make missing at random]
                    nanval = -999.,  # [Numeric code for designating missing values]
                    condcoord_ = None,  # [< None, 'Orthonormal'>]
                    nheaders4rows = 1,  # [Number of header column labels to put before each row]
                    nheaders4cols = 1,  # [Number of header row labels to put before each column]
                    extra_headers = 0,  # [If headers > 1, range of integer values for header labels, applies to both row and col]
                    input_array = None,   # [<None, name of data array, {'fac0coord':EntxDim row coords,'fac1coord':EntxDim col coords}>]
                    output_as = 'Damon',  # [<'Damon','datadict','array','textfile','Damon_textfile','datadict_textfile','array_textfile','hd5'>]
                    outfile = None,    # [<None, name of the output file/path prefix when output_as includes 'textfile'>]
                    delimiter = None,    # [<None, delimiter character used to separate fields of output file, e.g., ',' or '   '>]
                    bankf0 = None,  # [<None => no bank,[<'All', list of F0 (Row) entities>]> ]
                    bankf1 = None,  # [<None => no bank,[<'All', list of F1 (Col) entities>]> ]
                    verbose = True, # [<None, True> => print useful information and messages]
                    )

# A more concise way to create data
d = dmn.create_data(nfac0 = 10,
                    nfac1 = 8,
                    ndim = 3,
                    seed = 1,
                    noise = 2,
                    p_nan = 0.10
                    )

# The most concise way
d = dmn.create_data(10,8,3,1,noise=2,p_nan=0.10)
                    
# Set the precision of the printed arrays.  Force very small numbers to zero.
np.set_printoptions(precision=2,suppress=True)
print "d['model'] =\n",d['model']
print "d['data'] =\n",d['data']


Save and hit F5 to run MyScript2.py

IDLE Display

create_data() is working...

Number of Rows= 10
Number of Columns= 8
Number of Dimensions= 3
Data Min= -5.224
Data Max= 5.131
Proportion made missing= 0.125
Not-a-Number Value (nanval)= -999.0

create_data() is done.
Contains:
['fac0coord', 'model', 'fac1coord', 'data', 'anskey'] 

d['model'] =
Damon object (coredata)
 [[-1.98  1.16  2.42  0.49  3.08  0.93 -2.02  3.52]
 [ 2.51  1.75  1.58 -0.15  3.2  -2.83  1.2   4.36]
 [ 1.38  0.78  1.53  0.8   2.35 -2.3  -0.72  2.98]
 [ 0.71 -0.44 -0.94 -0.21 -1.2  -0.29  0.79 -1.37]
 [-2.93  1.09  3.65  1.58  4.23  0.82 -4.26  4.63]
 [ 0.5  -0.21 -1.16 -0.76 -1.31  0.32  1.52 -1.45]
 [ 2.86 -0.08  0.19  0.89  0.55 -3.41  0.06  0.9 ]
 [ 1.12 -0.73 -3.22 -1.99 -3.78  1.09  3.91 -4.24]
 [-3.66  0.26  0.27 -0.83 -0.05  4.03 -0.61 -0.38]
 [ 3.23 -0.16  0.37  1.24  0.74 -4.04 -0.3   1.12]]
d['data'] =
Damon object (coredata)
 [[  -2.14    1.6  -999.      0.09    2.38 -999.     -2.64    3.22]
 [   2.31    1.83    1.42    0.23    2.6    -2.07 -999.      4.7 ]
 [   1.22    0.9     0.81    0.2     2.95   -1.36   -1.1     3.36]
 [   1.47    0.34 -999.   -999.     -1.86    0.47 -999.     -1.53]
 [  -2.01    1.15    4.03    1.22    4.61    1.48 -999.      5.13]
 [   1.48    0.29   -1.6    -0.18   -2.11    0.22    2.34   -1.87]
 [   2.44   -0.82 -999.      1.25   -0.03   -3.87    0.04 -999.  ]
 [   1.26   -1.43   -3.04   -1.59   -4.58    0.91    4.29   -4.42]
 [-999.      0.34    0.59   -0.81    0.83    4.21    0.19   -1.1 ]
 [   2.51    0.46    0.17    0.58    1.6    -4.34    0.2     1.58]]
>>> 


IDLE tells us some essentials, in particular the contents of the the create_data() outputs.  (There is no d.create_data_out, because create_data() is not a Damon object method.)  These contents are assigned to the object name d in our script.  They are: ['fac0coord', 'model', 'fac1coord', 'data', 'anskey'].  'model' is the perfect, theoretical data array.  'data' is 'model' plus noise and missing cells.  

In our script, we specified (by default) dmn.create_data(output_as = 'Damon') .  That means 'fac0coord', 'model', etc. have been output as pre-initialized Damon objects, which you can tell from the IDLE printout.  We could have output them as datadicts, arrays, or text files.  The upshot is that we won't need to initialize a new Damon object when it comes to analysis time.

Notice that we printed out 'data' and 'model' using Python square-bracket [...] dictionary notation:  print d['data']  and print d['model'] .  Since these are Damon objects (see Damon Objects), we can access their particulars using the my_obj.attribute notation or the my_obj.data_out['key'] notation.  Type the following in my_script2.py.

Access 'data' Attributes

# Access attributes of "data" object
print "d['data'].collabels =\n",d['data'].collabels
print "d['data'].rowlabels =\n",d['data'].rowlabels
print "d['data'].coredata =\n",d['data'].coredata


Save and hit F5.  Here is the IDLE output.

IDLE Display

d['data'].collabels =
[['id' '1' '2' '3' '4' '5' '6' '7' '8']]
d['data'].rowlabels =
[['id']
 ['1']
 ['2']
 ['3']
 ['4']
 ['5']
 ['6']
 ['7']
 ['8']
 ['9']
 ['10']]
d['data'].coredata =
[[  -2.14    1.6  -999.      0.09    2.38 -999.     -2.64    3.22]
 [   2.31    1.83    1.42    0.23    2.6    -2.07 -999.      4.7 ]
 [   1.22    0.9     0.81    0.2     2.95   -1.36   -1.1     3.36]
 [   1.47    0.34 -999.   -999.     -1.86    0.47 -999.     -1.53]
 [  -2.01    1.15    4.03    1.22    4.61    1.48 -999.      5.13]
 [   1.48    0.29   -1.6    -0.18   -2.11    0.22    2.34   -1.87]
 [   2.44   -0.82 -999.      1.25   -0.03   -3.87    0.04 -999.  ]
 [   1.26   -1.43   -3.04   -1.59   -4.58    0.91    4.29   -4.42]
 [-999.      0.34    0.59   -0.81    0.83    4.21    0.19   -1.1 ]
 [   2.51    0.46    0.17    0.58    1.6    -4.34    0.2     1.58]]
>>> 


That's how you access attributes of the freshly created Damon data object.  (Note the quotation marks in the print commands, by the way.  Ordinarily single quotes are fine, but when quoting something in single quotes you need to nest the print statement in double quotes.)  So much for syntax.  To get more information about create_data() and its parameters and features, type:

>>> help(dmn.create_data) .

What Does It Mean?
Look back up to the model and data arrays.  Where do these numbers come from?  The model array is the dot product of a row coordinates array (fac0coord) and a column coordinates array (fac1coord), where the coordinates are generated randomly.  The rank (width, dimensionality) of these coordinates arrays is controlled using the ndim parameter.

Example:

Dot Product of Coordinates

>>> fac0coord = npr.randint(1,5,(4,2))
>>> fac1coord =npr.randint(1,5,(2,3))
>>> print fac0coord
[[3 4]
 [2 1]
 [2 4]
 [4 3]]
>>> print fac1coord
[[1 1 1]
 [1 3 3]]
>>> print np.dot(fac0coord,fac1coord)
[[ 7 15 15]
 [ 3  5  5]
 [ 6 14 14]
 [ 7 13 13]]
>>> 


Each dot product is the sum of the products of the coordinates for that row and column.  For example, in the first cell of the dot product array, 7 = 3*1 + 4*1.  For the next cell to the right:  15 = 3*1 + 4*3.  And so on.

That's how we build the model array.  To build the data array, we just add some random noise to model and make some cells missing.  Then we analyze the data array using Damon and measure how well it approximates the original model array.  The mathematical theory behind Damon says that, assuming the errors around the model values to be normally distributed, as the number of rows and columns increases relative to the number of dimensions (the rank or thickness of the coordinates arrays), the Damon estimates should approach the original model values.

The question naturally arises:  How valid is it to assume that real world data approximates a dot product of coordinates?  The answer is, Who knows?  A lot of real-world data does in fact behave like a dot product of entity coordinates; a lot does not.  What Damon does is impose this model on the world (for a range of dimensionalities) and report large observed - estimate residuals when the world does not conform.  It offers guidelines for editing the dataset (removing rows, columns, cells) until it does conform.  What Damon promises is that when the dimensionality is correctly specified and the data is so collected and so analyzed that the residuals are small, then the cell estimates for missing and non-missing cells will probably be close to the "true" values and likely to generalize across different samples.

Incidentally, you can use the core.create_data(InputArray =) option to build data arrays and coordinate arrays outside of create_data() and import them into the function.  That way you can build datasets that are not based on coordinate dot products and see how Damon performs with them.

As of version 1.1.13, create_data() has a new argument called apply_zeros.  This is useful for creating data with mathematically distinct subspaces, which you need for testing the sub_coord() method.