Introducing pirty (Python for Item Response TheorY)

posted Dec 13, 2018, 7:29 PM by Mark Moulton   [ updated Dec 13, 2018, 8:25 PM ]

Suppose you have 50 data files with weird formatting, and they each require complicated psychometric analysis, and the results are to be displayed nicely in 50 separate reports.  With most psychometric software, this is painful and time-consuming to execute.  Here's what it looks like.

First, you write a program that reformats each data file specifically for your psychometric application.  That gives you a bunch of new data files.  Then, one by one, you open each file in your psychometric application, choose options from a menu using point-and-click, hit run.  Maybe you have to do multiple runs for a given data file, as when doing analysis of fit.  The software outputs a bunch of results files, but they aren't formatted the way you need.  So you write another program to reformat the result files.  Then you copy and paste each table and chart into a Word document to build a report.  Oh, except the charts don't have the right labels and formatting, so you write another program that creates charts the way you want them.  Assessment companies spend months on this stuff, just to ship one set of reports.

Welcome to my life.

Fortunately, I do most of my analysis these days in Damon, and here's why that's great.  Damon is in Python.  That means I can write the data reformatting program in Python, then run Damon in Python, and run it again in Python with a loop statement, then reformat the outputs using Pandas in Python, then create gorgeous charts using matplotlib in Python.  I can do all this in one script.  I hit the magic button, and voila, 50 beautiful reports ready to be shipped.

But what do you do when your psychometric software is not in Python, when it's stand-alone.  That's when the pain starts.  I think this is one of the reasons R has become so popular over the years.  Statisticians and psychometricians realized they could automate much of their analytical work so long as all their algorithms are written in R, and that's true up to a point.  But R is a specialty language -- designed for statistics.  You don't use R to populate Word documents or build websites or run other software.  That's where Python has the decisive advantage.

Except, sadly, most psychometric software is not written in Python.  That's where my new project, pirty (Python for Item Response TheorY) is intended to fill a gap.  What pirty does is use Python to write control files for, and run, other non-Python psychometric software packages.  But it does all that from within a Python environment.

For example, I'm a big fan of Michael Linacre's Facets and Winsteps software for implementing the Rasch model (  Facets, in particular, is an enormously powerful (and underused) way to model the world.  To use it, I write a control file (or click menu options), reformat the data to be a long line of integer tuples, run the program, and open the output files in Excel or a text editor.  And writing those control files definitely takes some getting used to -- very old-school.  Now, in purty, I just write a brief Python script.  The following script calculates and prints diver abilities for a boys diving competition.

# path to pirty, path to data file
paths = {'to_pirty':'F:/',

# Import pirty.facets
import sys
import pirty.facets as fac

# Initialize a facets analysis, format data, write a specs (control) file
a = fac.Analysis(data = paths['to_data'],
                 title = '1988 Illinois Boys Diving Competition',
                 row_facets = ['diver', 'dive', 'round'],
                 col_facet = 'judge',
                 na = '.',
                 data_format = None,
                 verbose = True)

# Print diver ability measures and stats

Just a little Python, and I've done a full-blown Facets analysis.  Of course, I'm using a lot default parameter values here, but it is straightforward to tweak the parameters to perform any possible Facets analysis.  I just need to keep my copy of the Facets Manual nearby.  And now that I can run Facets in Python, I can incorporate it into my larger Python-based workflow to do all that fancy data formatting and create beautiful reports.

Under the hood, the strategy is ridiculously simple.  I use Python to write a control file, then run Facets on it via the command line using Python's subprocess() module.  The strategy will work on any software package, so long as:
  1. It's specifications can be spelled out in a separate text file that is read by the program
  2. The program can be run from the command line
Actually, this is harder in practice than I have made it sound.  A lot of programs don't know how to get their parameters from a text file, and some don't even run properly from the command line.  The process should be easy for creating and running an R script (which would open up all those lovely psychometric libraries to Python), but R on the command line has unfortunately been a little buggy and inconsistent for me.  Nonetheless, I have hopes.

For the moment, I only have pirty up and running for Facets.  I plan to expand it to Winsteps, Damon, the TAM R package, and a slew of psychometric specialty programs I've written over the years.  I also plan to make it reasonably extensible, so that others can easily write their own Python interfaces.  You can now download pirty from the Download page.  Let me know if it helps.

Damon 1.2.05

posted Jun 8, 2017, 10:53 AM by Mark Moulton

This version skips from 1.2.02 (the intermediate versions were private).  There were changes in estimation of standard errors (I'm still in the middle of a revamp), rescaling, and irt_tools.

Damon1 version 1.2.02

posted Apr 5, 2017, 12:55 PM by Mark Moulton

In rescale(), summstat(), and equate(), fixed bug that prevented
specifying a target mean and standard deviation for subscales.

Fixed index error bug in tools.flag_item_drift().  Added flexibility
in labeling points.

In standardize(), allowed assignment of nanvals to columns that
are not in an item bank.    

Damon1, version 1.2.01 released

posted Mar 21, 2017, 4:30 AM by Mark Moulton   [ updated Mar 21, 2017, 4:30 AM ]

This version included a lot of new work with standard error estimation.

Fixed bug in tools.median().

In Damon.__init__(), used new tool functions to avoid numpy deprecation

In Damon.__init__(), improved handling of floats in validchars.

In tools.flag_invalid() extended formula to allow counts as proportions.

In Damon.coord(), expanded the condcoord_ parameter to allow specification
of the 'first' facet for which to calculate coordinates.

In Damon.equate(), fixed bugs that messed up rescaling in anchored analysis 

In Damon.__init__(), forced index to be integer to get around deprecation
waring raised by latest version of Numpy.

In Damon.extract_valid(), added capacity to check row and column entities
against a bank.

In Damon.fin_est(), changed conversion back to ratio scale to remove
requirement to have same original mean, sd.

In Damon.base_est(), added functionality to refit parameter to allow 
control of the degree of the fit. Removed functionality for standardizing
base estimates.

In Damon.base_ear(), changed how Damon is used to calculate standard
errors, opting for a 2-dimensional log solution.

In Damon.equate(), changed the SE and EAR aggregation formula to sum across
log coordinates.  This causes a slight negative error estimation bias 
in equate() relative to the RMSE type of aggregation used in summstat(). 
Added a "group_se" option. 

In Damon.summstat(), added a "group_se" option to control the aggregation
of standard errors for measures.  

Damon In-the-Cloud Discontinued

posted Feb 28, 2017, 8:30 PM by Mark Moulton

Damon In-the-Cloud made it possible to run Damon on the cloud without installing Python or Damon on your personal machine.  This nifty little trick was made possible by Continuum Analytic's Wakari service.  Unfortunately, Wakari has been discontinued, so until I can find a replacement Damon In-the-Cloud must follow suit.

Damon1, version 1.2.00, major new update

posted Nov 2, 2016, 11:37 PM by Mark Moulton

Damon1 has been in an odd limbo for the past two years.  On the one hand, damon2 is what I really want to be working on, but that's a really big project and my work duties at EDS don't give me much time to work on it.  On the other hand, I need to stick new functionality somewhere.  The result is that a steady stream of new functionality has been added to damon1, most of it traditional Item Response Theory stuff for EDS, but a lot of it refining ideas in Damon.  I made a lot of progress refining the standard error statistics for dichotomous data.  I also got the equate() function to work the way I want -- preserving the meaning of a scale or subscale even as items move in and out over time.  But those are just the tip of the iceberg.  The change log below gives a sense of it.

This is a major release with lots of added functionality.

Ongoing Issue:  Due to changes in Numpy, some comparisons are causing
an exception, even though the message implies only a warning, e.g.:

    a[:, :] == 'x'   (where a's dtype is numerical)

    "DeprecationWarning: elementwise == comparison failed; 
     this will raise an error in the future."

As of this version, most of these warnings have been eliminated.  Discussion 
with the Numpy devs has led me to conclude that the warnings actually
won't lead to exceptions in the future.  They're just an occasional nuisance 
for now.

To Damon.__init__(), added the ability to convert Pandas dataframes
to Damon objects.

Added a Damon.to_dataframe() method to convert a specified Damon
datadict to a Pandas dataframe.

Added 'PtBis' as an optional statistic in summstat().

Changed create_data() to return keys in 'S60' format rather than int.  This is
a hack to get around a situation in equate() where the automatic insertion of
new string ids upcasts ints to string. If identifiers are specified as int
in validchars, they will be converted to string.

For the same reason, row and column keys are now forced to be string,
regardless of how you specify rowkeytype and colkeytype.  Supporting
multiple key types ultimately proved too difficult.

Reengineered equate() to save the coordinates of constructs and subscales as
new entities in the item bank for future reference.  This ensures construct
and subscale measures will remain comparable even as  items used to define
them are switched out.  

In equate(), added a logit option, ability to clip scores during 
rescaling, ability to set performance levels from cut-scores, and
a matrix of useful statistics per construct, such as reliability.  Standard
errors for dichotomous and polytomous data have been corrected, with and
without the logit option.  Added an option for adjusting the standard error
based on whether the component items of a construct are positively

In equate() added an option not only to specify cutpoints for defining 
performance levels, but to generate cutpoints given ratings in the data.
This has the potential to take the place of standard setting.

Extensive reworking of tools.obspercell() and the standard error formula for

Modified base_resid() to adjust the size of residuals to compensate for
distortions caused by dichotomous and polytomous data.

Changed coord() anchor['Freshen'] to anchor['Refresh_All'] and clarified
how individual item coordinates should be refreshed.  This also laid the
groundwork for analysis of item drift.

Clarified how banks should automatically load new items and refresh old
ones to make the long-term equating work-flow as simple as possible.  Fixed
a bug in coord() that prevented the refreshing and updating of item
coordinates in bank().

Moved dif_stats from to damon1.irt_tools.

Fixed bug in rasch() that prevented prior use of extract_valid().

Added a parameter in rasch() to provide user control of extreme scores.

Added tools.check_equating() to assess the degree to which two test
forms can be be called "equated".

Added some statistical functions to tools module for convenience in 
dealing with nanvals.

Add Damon.flag() method to identity row and column entities that meet
some criterion or set of criterion.

Modified Damon.extract() defaults and behavior to work seamlessly with

Modified Damon.extract_valid() to accept lists of row or column entities
to remove from the current Damon object.

Wrote new and module for performing unit tests for Damon
and in general.  Replaced tools.test_damon() to access  The
original tools.test_damon() has been replaced by tools.test_damon_old().  Added
tools.test() to access test() function.

Added fields to be stored in person/item banks, impacting bank() and

In extract_valid(), made it possible to check for variation in string
arrays when minsd is specified.

In base_fit(), made it possible to divide residuals by a single expected
absolute residual aggregation to make it easier to identify misfitting

IOMW 2016

posted Apr 21, 2016, 11:39 AM by Mark Moulton

We just held a very lovely and exciting IOMW (International Objective Measurement Workshop) conference on April 4-6 in Washington, DC.  IOMW focuses on "objective measurement" with the Rasch model as the exemplar.  The range of papers was, as always, breathtaking.  The conference was put on by Brent Duckor and myself with the help of the IOMW Conference Committee and sundry volunteers.  During the pre-conference, I had an opportunity to pitch the idea of a Python-based objective measurement library containing not only measurement and various IRT functions but also hooks to run existing measurement software such as packages in R, Winsteps, ConQuest, jMetrik, and others.  I'm calling it MOMS for now (Multidisciplinary -- or Multiplatform? -- Objective Measurement Software).

MOMS is -- naturally -- beyond my current capacities, and it will rely on SPOTS, which I have not had leisure to work on since last summer.  But my work at Educational Data Systems requires just such a library and the functions have been accumulating over the past few years.


posted Nov 18, 2015, 12:11 PM by Mark Moulton   [ updated Apr 21, 2016, 11:40 AM ]

I've updated the Road Map (after a hiatus) to include plans to rebuild Damon on a new database I've been working on called SPOTS, which stands for Subject Object Predicate Triples.  To get a feel for SPOTS, look up "triple store" and "semantic web" and "graph database".  SPOTS is intended to integrate these ideas into the Blaze ecosystem.  A parallel project, established and well-financed, is Dato, which used to be called GraphLab.  

I came up with SPOTS when thinking about how to analyze multifaceted (multi-tensor) data -- data where each cell value is associated with not just a row entity and a column entity, but a set of entities all interacting to produce a phenomenon.  (The Many Facets Rasch Model is designed for just this situation, restricted to one-dimensional spaces.)  Traditional tabular data formats, 2-dimensional arrays, don't cut it here.  One can store the interactions in multidimensional arrays or Pandas multi-index dataframes, but these sort of depend on the idea that each datapoint is the result of a constant number of entity types, e.g. 

(Person_1, Item_20, Rater_3) => a rating x
(Person_1, Item_21, Rater_4) => y

Here, each data value (x, y) is modeled as the interaction of three entity types--a person, an item, a rater.  But what if the number of entities varies?  What if the entities don't even have types:

(Person_1, Item_20) => x
(Person_1, Person_2, Task_3) => y
(Person_1, Person_2, Task_3, Rater_4) => z

Some interactions don't make sense.  Nonetheless, it is undeniable that the real world generates phenomena that are the result of an indeterminate number and type of entities interacting.  And I want to model that!

Interactions like this can be stored in a multidimensional array or multi-indexed dataframe, but you have to be ready to leave a lot of cells blank.  Which means allocating a lot of memory or disk space to hold, basically, nada.  Relational databases run into the same problem.  Also, they require you to commit to a database schema before you really understand what you're storing.  I wanted a more basic way to store data.  

That is when I stumbled on the Semantic Web, or more particularly the idea that you can convert the internet into a giant database, accessible by a standard query language (SPARQL), by storing data as "triples" with a common naming convention.  Triples can be viewed as short sentences, each with a subject, predicate, and object.

(Mary, had, lamb)
(lamb, had, fleece)
(fleece, was, white)
(white, as, snow)

A triple can also be thought of as two nodes connected by an edge, which takes us into graph theory and graph databases.  I fell in love.  Triples are magic.  Consider:  All facts in this huge, zillion-dimensional universe, can be stored as 0/1 bits in a 3-dimensional cube.  Which is another way of saying that each fact can be stated as a (subject, object, predicate) combination; they correspond to the (x, y, z) axes in a coordinate system.

Can all facts be represented this way?  Honestly, I have no idea.  But that's the claim, and certainly triples are more than general enough for my purposes and they are more general than tables and relational databases.  In other words, every relational database can be converted into a single array of triples, but not every array of triples can be converted into a set of database tables.

The other cool thing is that, unlike key-value databases and other schema-less designs, triples have just enough structure to support a uniform query language.  That's what makes triples ideal as a way to store data on the web.  The vision is that each website (that wants to) converts its data into a set of triples using RDF naming conventions ("Mary" is replaced by an identifier that looks like a URL).  Then anyone can query that data from anywhere as part of a general internet search using the SPARQL query language.  A number of data repositories like Wikipedia already support this kind of query.  (The semantic web idea has been around since the late 1990's.)

There are plenty of database experts who will point out that relational databases do just fine and you only need triple stores, or graph databases, for a few corner cases.  I'm not competent to argue the point.  All I know is that triples fit the kinds of models I want to explore, and database tables don't.

So how does this relate to Damon?  Aside from giving Damon a very flexible way to call up data for its internal operations, there is a deeper relation between triples and Damon's matrix decomposition model.

Let's go back to the triples data cube.  I said that every fact in the universe can be stored in a cube -- a really big cube.  The way it works is that every increment along the x-axis of the cube corresponds to a "subject"; every increment of the y-axis is a "predicate"; every increment of the z-axis is an "object"'.  So (Mary, had, lamb) means that the point in the cube that has (x = "Mary", y = "had", z = "lamb") is assigned a "1".  If the fact does not exist, it is assigned a "0".  If we don't know whether a fact exists, the point is left blank.

Now assemble all the facts of the universe, stated as triples, and populate the cube.  We end up with a big 3-dimensional cloud of pixels where the pixels indicate facts.  More to the point, we end up with a 3-dimensional dichotomous data array.

Now run the array through Damon.  Damon will replace each cell in the array with a probability and populate the missing cells.  In other words, Damon will in principle have calculated the truth value of every possible fact in the universe.  Suck on that for awhile!

Now, I'm just stating a vision here; reality is extremely uncooperative.  For one thing, Damon requires its entities all to fit into a common multidimensional space (the 3-D subject, predicate, object space doesn't count), and it would be hard to meet that condition.  But the vision is plain.  Damon, or a more powerful algorithm like it, should be able to quantify the truth value of every fact within a specified universe of facts.

And that's just the beginning.

Damon 1.1.18

posted Aug 8, 2014, 4:59 PM by Mark Moulton

In module (tools.dif_stats()) corrected the DIF flag formula.


posted Aug 7, 2014, 1:01 PM by Mark Moulton

Changes in Version 1.1.17

In module, reversed sign of the MH_d-dif statistic to
be consistent with comparable statistics.

In, added a get_rows option to select person subsets.

In, added a backup stratum option to handle cases where
the primary stratum leads to insufficient data.

1-10 of 44