For a project, I wanted to arrange data in a table-like, object, but with the requirement: Cells could depend on other cells, with the dependency being any function (to be provided). And when I update the cells, I want the children to update as well.
This was my starting requirement. The motivation for this was a project where I often had bad or missing data. I was running a model on this, which could tolerate some imperfect data, but my idea was that as improved data (new pulls, better sources, different scraping rules, etc.), I could have this propagate. As well, I built the data in the table historically. I wanted to be able to add a new data point, and be able to quickly, easily calculate the derived fields.
More specifically, the cell dependencies that I wanted were the two cases:
- A formula that builds off of cells in the same row, storing to a new column in that row.
- A formula that SUMIFs a column from earlier dates, conditional on the value of another column.
Additional requirements and practical concerns that arose include:
- Saving dynamically - I want to be able to store potentially large amounts of data, and holding all this in memory at once can slow the computer. So I wanted to be able to open only part of the table at a time, without having to worry this.
- File reliability - When I work with the files, I sometimes cause them to become corrupt. This came up as I worked, and I haven't solved this, yet. However, through backups and limiting exposure, I've mitigated damage.
- Partial refresh - I want to limit the amount of recalculation I do, by being specific about what gets recalculated.
- Date protection - Because this is used in a modeling context, I want to be extra careful that I'm not bleeding future data into the past. My solution here needs improvement.
- Key merging - For the use case, I was pulling data from different sources, and matching these up was challenging. In my initial implementation, I tried to make the merge logic live in the table, looking up the best match every time a value was added to the table. This proved to be complex and expensive, so I made the table agnostic of matching logic, and made it easier to get all the keys available at a date, so that the matching can be done elsewhere.
In this section, I give the basics for how the table itself works.
This table allows you to add Columns, then set and get values from entries in those columns via the set_cell_value and get_cell_value functions. Both of these store values by their date and column (together called CellAddr) and a key. Additionally this table allows you to call refresh(), which will update the columns in order of their dependencies, so that if any column X depends on a column Y, then Y will get updated before X. The refresh is performed by calling the refresh function on the column (and passing a copy of this table), which will contain logic on how to update values given the current state of the table. The columns are also responsible for knowing their dependents. Before using, call open() to load all the relevant data. When done, call close() to save the data. (Some of the data will save as it goes, but not everything.) All data gets saved by prefix.
The table passes the set and get functions along to a CellMaster (defined below). This is in charge of sharding the data and dynamically saving and loading.
When adding new columns or new dependencies, the table passes these requests to the ColumnManager (defined below). This is in charge of holding copies of the columns; saving and loading; and deciding the refresh order, which will update as columns or dependencies are added.
The table also needs to know periodically about the complete set of dates. For some uses, it is helpful to have this as a sorted list, and in other circumstances, it's better to have it hashed, with the set of keys set at that date. This is handled via a DateSet object (defined below).
In this section, I give the basics for the columns that are used in the table works.
This column holds, as it's primary data, a function and a list of required columns. The function takes a dict, where the keys are required columns, and the values are the values in those columns for a row (cell address/key); and the function should return a value, which will be stored in this column. The refresh function loops through the rows that need refreshing, and looks up the required columns, storing their values in a dict, which then gets passed to the function; the resulting value is stored in this column for that row.
The dill library is used to save this, because pickle can't save / load functions.
This column adds data from earlier dates in a given column, conditional on the value of another column. To specify how to do this we need a few pieces of information, saved to the column:
- How many recent years of history to use in the calculation (called tail_length).
- Which column determines the condition (called output_key), and which column(s) we look in for that condition (called input_maps, with attributes key_column and value_column).
- This is best explained with an example:
For example if we had a table with two input_maps, WaterfallMap(
"Player1", "Points1") and WaterfallMap("Player2", "Points2"), along with
the output_key, OutKey, like this:
| Player1 | Player2 | Points1 | Points2 | OutKey | WF_Col |
| X | Y | 1 | 2 | X | None |
| Y | Z | 30 | 40 | X | 1 |
| Z | X | 500 | 600 | Z | 40 |
| None | None | None | None | X | 601 |
The WF_Col (this column) does the following calculations:
- For the first row, there are no previous rows, so it returns None.
- The second row looks for Xs in the previous rows and finds the one with 1
point, so the total is 1.
- The third row looks for Z on previous rows, and finds only on the second
row, so the total is 40.
- The fourth row looks for X on previous rows (the fact that Player and
Points are blank on this row doesn't affect the calculation, because it
only computes previous rows). It finds two Xs with a total of 601.
The way these values are calculated in practice is by moving down a column in order, keeping track of a running total for each "key". For each new date we encounter: We first reduce, from the running total, those cells that fall out of the range we care about; then we record a value for each output_key; then we increment by each of the input_maps for the new date.
This is obviously much more efficient than resumming previous rows with each new value we want to compute. Additionally, we create "Snapshots" periodically. This saves off the running total. When we need to refresh only partial data, we pick up from the latest snapshot that is before the first refresh date, and go from there. (We don't stop early, even though we could; the primary use case is either to update everything at once or to add newer rows.) As we compute the running total, we need to handle the saving of these snapshots.
For the set of dates with Snapshots saved we use the Date Set helper (defined below) and for the Snapshots saving and loading themselves, we use the Page Master (defined below). These components were primarily designed for the main table, and we re-used them here. Both are non-trivially opened and closed, so we handle them in this column's open and close function.
When subtracting an expiring date, we re-lookup the values at that date. In an earlier implementation, we cached this, but this ends up replicating the work of a well-designed CellMaster, while making the implementation more complicated and over-sizing the Snapshot saves.
This section highlights some extra components that I made to implement this.
I found that I needed to store dates both as a set and as an ordered list. I made a simple wrapper for these two structures, with a push method to add more dates. It also handles saving and loading, storing and recalling to a location that is based on a passed prefix. I've also found that it's convenient to have all the cell keys listed for a date, so I've piggybacked the set here, making it a dict with date keys and the corresponding cell keys as values.
Cell Master / Page Master
The value of cells are stored in a big map keyed by cell address and key. This map could potentially get too large to hold in memory all at once (efficiently). I wrote a class called CellMaster that maintained a bunch of smaller maps. When trying to get or set a value from a key, a user could send the key to the CellMaster, which then knows to which of the smaller maps to send the key. CellMaster is also in charge of reading and writing these small maps to files. It does not key all small maps open at once, but rather keeps a fixed number open at a time, deciding when to save and close a file based on a least-recently used rule.
I decided to keep small cells, but to keep many open at a time. The small cells allowed more flexibility. [As well, if data ever become corrupted, this limited exposure, but I'd like to find a better fix to that problem.] Cells are addressed by month and column name. Originally I had addressed only on date, but I found that I was accessing way more data than I needed, and opening and closing data was a bottleneck. Keeping more cells open at a time saved on overhead by reducing thrashing, and I found that I wasn't close to holding enough memory to cause performance issues.
This Master class was abstracted (the base class is called Page Master), so that I could reuse it for the Snapshots I used in the Waterfall class.
For a table, we need to keep track of the all the columns. In doing so, we need to know how store (locally), load, and save columns. We also need to know how to maintain the refresh order, which tells us which order to refresh the columns in so that every dependent column gets refreshed after the column it depends on.
The columns are stored locally, here in the column master, keyed by their name. This will be the single local copy of a column in a given table. Saving and loading is handled similarly to the way we do date set, with the exception that we don't save all columns. Rather we keep track of which columns have been accessed (and therefore potentially modified), and only write those; this is because saving and loading columns can be expensive. An improvement on this design would be to allow lazy loading of columns, since saving and loading is an expensive operation.
We keep a copy of all the dependencies of each column here in the column manager, by copying these from the column. This serves as a graph, which we then run a topological sort on to maintain the refresh order.
Piles of data
Throughout my design, I've chosen to save all of the data in a data folder in small bits. The name of the table identifies the files, and loading a table mostly involves looking through folders for the name of the table, or prefix.
An alternative approach would be to try to save a single large file per table. This would give the flexibility of specifying and moving the file. It makes the implementation more complex, and we couldn't dynamically save / load pages.
Open and Close
The helper components each had to load and save things upon being opened and closed. So I made open and close methods that should be called when these components are created and destroyed. Similarly, I added these to columns and the table class; the table class is the only thing that would get used by the user, and it's up to them to call open and close on this. [Technically columns are created directly by the user. But on the original creation, it's okay that they don't call open (since these are used to load data); after that the table's column manager closes (saves) and opens (loads) next time.
An alternative design would be to trigger these to automatically as the variables are created and destroyed. I avoided this for now, because 1) I was having trouble implementing, and 2) it was clearer conceptually to manually trigger these.
One difficulty I had was that operations were not failing gracefully. Files storing data were becoming corrupt if the program stopped unexpectedly during a write.
We attempted to account for this, for the various components: For DateSet, I save two copies of each file. Similarly I save two copies of all my column files, with a function recover_from_backup that can be called manually to copy the backups into the main files. For ColumnManager, I chose to backup all my column files upon saving. For KeyManager, I chose to use small pages and allow pages to get corrupted, and returned None on reads of corrupted cells; this is tolerable, because my model generally handled missing data well.
These solutions are scattered; they were implemented ad hoc in response to problems encountered. An improvement for the project would be to handle this better, creating atomic writes, so that files can't become corrupt. Another (independent) improvement would involve
Table / column relationship
The relationship between table and column in the final implementation is a little unusual. The table keeps pointers to all the columns, indexed by name, and the columns keep pointers back to the table. The data that are stored in the table, are managed by the Table class, but the logic (and potentially state) for the columns are kept in the Column class.
Within the column, we use the table pointer mostly in the refresh logic, and an earlier implementation took the table as an argument of the refresh function. But we also use it to append to the column name when saving, so we take that pointer at initialization.
Earlier implementations tried a few ways to strike this balance, but these all ended up being difficult to use.
An alternative would be to have the refresh function return an "operation," which the table will then handle; this would make the column mostly unaware of the table, except through the prefix name (which is how the column knows to load with the table to begin with). This idea could also open the door to more advanced scheduling of the operations on the table. These are all potential improvements.
Refresh entire column at a time
A refresh is when a signal is sent that upstream columns have been updated, so that the column needs to now be updated.
In my final implementation, I refresh an entire column at a time. There is logic in the column about how I should update that column. The column may lookup, from the table, which cells in that column need to be updated, then determine how to update only part of it. One advantage of this (vs. the cell-by-cell update discussed below) is that dependencies are column-level
In an earlier implementation, I refreshed cell-by-cell, passing the cell to the column. This had a few major drawbacks: Firstly, it created an awkward division of labor: The update logic for the Waterfall column needed to know not just what cell it was updating at any moment, but also needed to know about other cells it updated in a round of updates; this created a lot of extra state-tracking. Secondly, the dependency graph was computationally complex. My final implementation maintains a update column order that only needs to be updated as I add new columns. Thirdly, my implementation of the WaterfallColumn didn't work if the cells in a column were getting updated (by other dependencies) between waterfall update.
My final implementation is slightly less general than a cell-by-cell refresh, which may allow a dependency graph that ends up having a cycle (not allowed) when we reduce cells to to columns; but this scenario is hard to imagine. Additionally, my final implementation is restricting because our cell_dependencies function will only trigger dependent columns of the same date; however, this can be relaxed if need be, without changing the decision to refresh an entire column at a time.
Column names and lookup
In my final implementation, I chose to use column names as references, along with a column dict in the Table class. This allows me to refer to dependencies and cell address by just their column name.
In an earlier implementation, I tried to use the columns themselves in place of these. However some of my derived columns store state, and it became messy, difficult, and inefficient to try to make copies in some cases, but not in others. There were a lot of unintended consequences.
Cell address / keys
I read and write values based on their cell address / key combinations. This is awkward, for a few reasons:
- I need to pass both address and key to a number of calls.
- When reading or writing to the page, I need to concatenate these as a single key.
- Looping through all entries, requires looping through the addresses, looking up the keys for each address, and looping through those.
Though, this is awkward, we chose the addressing scheme so that an address represents a unit that we update, so that two values are in the same address if and only if they always get updated at the same time. (That is, the same date and same column.) This makes refreshing a little bit clearer, because it's easy to say what exactly I want to refresh. As well, I want my paging logic to never subdivide a cell address.
As well, I made the decision to pass an entire cell address to functions that need either date or column. This cleaned up code and simplified. It happened very often that cell address and key were used in tandem, so an alternative design could bundle these together too.
Because of the dependency arrangement I made, I was worried that cells may have unexpected dependencies. That is, the user of the library could set up their formulas incorrectly; a cell could depend on a depend on a cell, which depends on a cell, and so on, eventually depending on a cell that contains data the user shouldn't know about on that date. To add extra protection, I decided to add an additional check.
In my final implementation, I make a function called available_on_date on the Column class. This takes a cell address (used for date), and returns a date. By default, a cell is available on the date where it's stored, but there's also a derived Column class, called ProtectedColumn that isn't available until the next day. This is intended for target variables, where we wouldn't want to have knowledge of the values until a later date. Our two implemented derived columns are SimpleFormula and Waterfall; SimpleFormula requires that all its required (parent) columns are available that day, while Waterfall only requires that its required columns are available by the next day.
This logic could be improved. It has the weaknesses that 1) it's too complicated for what it's accomplishing; 2) it's still bug-prone, because I'm manually stating when columns are available and when I want them to be available, so it's possible that I could have some complex dependency that I miss; and 3) it's not flexible enough, because I can't (for example) make a formula on protected columns that then isn't "available" until the next date. This all needs a little more thought. For now I put a flag in the header that can be toggled to disable the current logic.
The final implementation is agnostic about our specific Columns types, which makes the library extensible. An alternative approach would be to write this specifically for the SimpleFormula and Waterfall columns that we ended up using: That is we can simply assert that SimpleFormula is dependent only on unprotected columns. But more than not being extensible, I don't like this alternative because this is intended to be a safety check, and I don't want to rely on the same column logic that could cause a bug.
Another alternative would be to attach an available_on date to every piece of data. As I combine data, I would take the maximum of the dates. Then I could store this back into derived column cells. This is closer to my original implementation; which turned out to be difficult and messy to implement. As well, this approach doesn't block any kind of dependency. It would be up to the user of the library to make sure that they aren't misusing data by building a model where the target is available before the predictors. I would need to build a way to communicate this to the user of the library. This is probably the best way to implement though.
All of our helper components make read and write calls to disk, via pickle or dill libraries. To test this without creating files, we relegated saves and loads to thin libraries. Then on the test, I created a mock version of the class that overwrote these methods, but were otherwise the same. In addition to saving / loading locally, we make logs of the reads and writes that happen, which can be helpful in tests too. For the
An alternative implementation could fake the filesystem on a lower level. Or, I could try to abstract the BaseClass -> MockBaseClass derivation to reduce boilerplate.