Machine Learning for Counting Data - GLM, GAM and Poisson Regression Trees

Specialeforsvar ved Louise Thor

Titel: Machine Learning for Counting Data - GLM, GAM and Poisson Regression Trees

Abstract: In this thesis the use of machine learning approaches (more specifically regression trees, random forests and gradient boosting) for estimating claim frequency in non-life insurance is compared to more standard approaches, namely GLM and GAM. The theory is presented and applied to data collected in the shipping industry. In all models we assume that the claims follow a Poisson distribution and we begin by introducing the GLM and the GAM. After these we introduce the single tree using the negated log likelihood of the Poisson distribution as loss function and expand to the random forest. Finally we introduce the Poisson gradient boosting. The point of the thesis is to find out whether better or as least as good results can be achieved from using machine learning (where less of the decision making is left up to the person doing the modelling) compared to more standard methods such as GLM and GAM. To compare the models 10-fold cross validation is performed for both the Mean Absolute Error (MAE) and the log likelihood. Besides this the dataset is split into a training and a testing dataset such that estimated claim frequencies and suggested premiums can be calculated and presented for each of the models. In the conclusion the results are compared to just using a simple average for estimation. We find that it is definitely possible to use machine learning for estimating claim frequency and with gradient boosting there is actually some substantial gains to be achieved. The same can not be said for the random forest which has somewhat inconclusive results.

Vejleder: Jostein Paulsen
Censor: Mette Magdalene Havning