medical cost predictions

optimal way to predict medical cost of patients based on their historical data such as age, gender children, smoking habits and their region of live. of course this is regression problem so linear, non-linear and ensemble methods are used to choose best model to make prediction with less variance and with higher accuracy possible.

Data quality

before applying any modelling algorithm or before do any changes to data examine basic assumptions with 4-plots, which are:

  • Fixed Location: If the fixed location assumption holds, then the run sequence plot will be flat and non-drifting.

  • Fixed Variation: If the fixed variation assumption holds, then the vertical spread in the run sequence plot will be approximately the same over the entire horizontal axis.

  • Randomness: If the randomness assumption holds, then the lag plot will be structureless and random.

  • Fixed Distribution: If the fixed distribution assumption holds, in particular, if the fixed normal distribution holds, then the histogram will be bell-shaped, and the normal probability plot will be linear.

central tendency

after prove data fulfill underlying assumption then exploratory data analysis can be begun. first summery about location statistics and variability of data then distribution and correlation statistics are calculated. four estimators used to estimate mean(not robust), trimmed mean, winzorized mean and median. winsorized mean has the lowest standard error among three mean estimators. median is lower than mean in every feature that is a hint on right skewed distribution.

distribution analysis

chi-squared goodness of fit test done on every discrete and continues feature to estimate best distribution to match with data.

normality test

three types of testing methods are used to test the normality of distribution. qq-plots and anderson darling, wilk Shapiro.

  • H0 = The null hypothesis assumes no difference between the observed and theoretical distribution

  • Ha = The alternative hypothesis assumes there is a difference between the observed and theoretical distribution

feature standardization, transformation and selection

three methods used to standardized features which are -

  • quantile transformation

  • box-cox transformation

  • yoe-Johnson transformation

scaling is done by the robust scaling method because there are outliers in data.

models

coefficient of determination or R2, It measures the amount of variance of the prediction which is explained by the dataset. R2 values close to 1 mean an almost-perfect regression, while values close to 0 (or negative) imply a bad model.

Linear models

Non-linear models

Ensemble models

links

models and analysis repository - Medical Cost prediction