An Analytical Study of Financial Vitality
In this Capstone Project, my goal is to develop a model to predict whether a church will be able to keep its door open in the long run based on their financial vitality, defined as the ability of a parish to maintain their financial operations and membership. I will be looking at the publicly available parochial data from Episcopal parishes, coupled with Tapestry demographic datasets derived in part from U.S. census data. While this current model will be built based on the data of The Episcopal Church, my hope is that this can be applied to other types of religious organizations with some adjustments on a case by case basis.
Prior to this project, the prediction of church survival was often done on an actuarial basis, without necessarily taken into consideration the multifaceted dynamics of church financial viability such as membership and neighborhood shifts. My model will taken into consideration the financial and demographic data to predict the most likely outcome for a given church. In other words, will a church close or open? Even if a church predicted to close does not end up closing, the prediction is still helpful as a warning sign which can inform policies and planning.
A business problem that can be solved by this project is to add an important factor in determining on the property and liability insurance rates for a given parish based on their attendance and income trends.
The Episcopal Church is the U.S. member of the worldwide Anglican Communion. The Episcopal parishes derive their tax-exempt status from their local diocese, which is created by the acts of the triennual General Convention. Dioceses typically pay an apportionment from their budgets to the central Episcopal Church, while their parishes pay assessments to support their dioceses. In addition to these expenses (typically average 12 to 15% of their budgets) to support the governing structures, parishes also need to budget for paying staff members and their benefits, maintenace and upkeep of their buildings and grounds, and any outreach and missions to benefit the wider community, such as food bank and energy assistance.
Traditionally, most U.S. churches rely on the annual pledges from their members to keep their doors open. Over time, members with deep ties to their parishes often designate a bequest, hoping to help benefit their church communities in perpetuity. Those bequests become part of a parish's endowment. Some parishes rely more on the income from their endowment for their survival as opposed to pledges from their existing members. Bur very few parishes can actually rely on their endowment perpetually. There is a delicate balance between membership and the rate of giving in order for a parish to remain financially healthy.
Each year, every Episcopal parish in good standing is required to submit a parochial report to the Domestic and Foreign Missionary Society of The Episcopal Church. The parochial dataset consists of 8600 rows and 130 columns where each row represents a parish. The variables represent data about the location, attendance, finanical indicators, and affiliations with an Episcopal diocese and province.
According to Matt Dancho from Business Science University, who has analyzed a number of top winners of forecasting in Kaggle competitions, "a best-in-class forecasting system" includes five key elements, namely
Feature engineering
Experimentation and ensembling
Knowledge of key events
Deep learning
Boosting Errors
In short, this approach employs a number of submodels, assembled into stacking algorithms, with hyperparameter tuning to improve final performance. Essentially, we can calibrate the proper ratio of each model in the ensemble to minimize the overall errors.
The blog post by Gouthaman Tharmathasan from Feb 2021 describes this method in depth: https://towardsdatascience.com/multiple-time-series-forecast-demand-pattern-classification-using-r-part-2-13e284768f4
The average Sunday attendance variable for each year is found in separate columns. In order to build a time series, it is necessary to convert the existing data from wide format into a long format. In addition, the annual Sunday attendance masks the seasonality of the attendance patterns. Fortunately, the Easter Sunday attendance data is available as well, which is generally the one Sunday with the highest attendance. Thus, the average Sunday attendance becomes one data point for a year and the Easter attendance is added to create the seasonality trends. In addition, an annual minimum is calculated as 50% of the average Sunday attendance. That extra data point is also added, creating three data points per year for the time series. Timetk is the package that helps make this process smoother. For further information, please refer to the documentation of timetk: https://cran.r-project.org/web/packages/timetk/timetk.pdf.
The following represents the time series of attendance of one parish after feature engineering.
Data were cleaned and processed in R Studio including using splitting the dataset into training and test sets followed by Random Forest classification within R. Afterwards, both the training and test sets were exported to Domo for further processing, clustering and classifications using Random Forest.
Initial sub-model buidling with their accuracy measures. Mean Absolute Error will be used as a yardstick for determining the merits of the models as that is a good indicator of the attendance data. In general, a MAE of less than 10 is considered good. However, we do not want MAE is be too close to zero as that can indicate overfitting. As we can see, at this stage, the best model is the Cubist Spline Model with MAE of 3.01.
There are two ways to conduct cross validations for time series. For sequential models such as ARIMA or linear regressions, a rolling plan can be used as illustrated on the left.
For non-sequential models (those that are based on supervised learning such as random forest), k-fold cross validations can be implemented as illustrated on the left.
After performing cross validations, adding weights and stacking the submodels, the three best models were selected to form the ensemble - Cubist, Random Forest, and GLMNET. The MAE of the new stacked and weighted ensemble model is better/lower than any of the initial submodels - 2.84. That means is prediction is generally only off by a count of 3 persons when predicting the attendance trends.
Plot of Attendance (y) over Time (x). The black series is the actual data, while the red part indicates the predictions made by the ensemble model.
The stacked and weighted ensemble time-series modeling was shown to be effective in reducing the MAE for this project.
This reduced MAE is a value add to the business problem of projecting the attendance trends of a given parish.
I have demonstrated how this model can work for one parish. The future step would be to implement this model at scale (for loop or panel data processing).
To take full advantage of ensembling and stacking, it is important to make sure we have ample patterns of seasonality in our input data.
Of equal importance is the knowledge needed to experiment with both sequential and non-sequential models and their respective validation processes.