Regression tree for pricing in non-life insurance: A study of survival trees, random forest and gradient boosting

Specialeforsvar ved Johanne Toftdahl Christensen

Titel: Regression tree for pricing in non-life insurance:
A study of survival trees, random forests and gradient boosting

Abstract: 

Abstract:  The thesis discuss regression trees for pricing in non-life insurance, more specifically the modelling of claim sizes. To do this we test single trees, random forests and gradient boosting that are all defined as machine learning models. This is done with some theoretical considerations and by applying the methods to a dataset from the shipping industry. We introduce the exponential tree and the survival tree used in modelling as both single trees and expanded to random forests. In the gradient boosting we test three different distributions: exponential, Pareto and lognormal. The gradient boosting model is a black box model, but a measure is introduced to asses the relative importance of the covariates in the random forest, making the forest less of a black box. The hypothesis is that the machine learning models is a more simple method of reaching as good or better estimations than can be achieved by more standard models such as the Pareto regression or a standard GLM. Model evaluation is done both by using RMSE in a 3-fold cross validation and comparing the ability to estimate the total claim size. Also 10 test policies are introduced to give examples of estimated prices under different models. Not only are the tree-based models compared to one another, but also with the more common Pareto regression. Finally the models are compared to a simple average and the overall performance is dis-cussed. Tree-based models can be used for pricing. With this data set the estimates areas good as what can be achieved with a Pareto regression, but the full potential is not reached. As the dataset used is small both in number of observations and covariates, not much is gained from using the machine learning-models. For a data set with hundreds covariates the benefits from using the machine learning models may be significant. Also a dataset with a smaller variance in claim sizes may give a better fit 

Vejleder:  Jostein Paulsen
Censor:    Mette M.  Havning