I began using R for data visualisation, and, over time, it has become integral to my workflow. It’s just so flexible, and the more you get immersed in the R world, the more you realise what you can do with it.
At ERSM Re (ERSM), I’m doing a lot of reinsurance consulting work in R. Here you can see some examples of the projects developed.
POPULATION PYRAMID ANIMATED PLOT:
Video: data manipulation with tydiverse packages to get data ready for plotting …
Original data: head(data)
Index | Region | Gender | Date | De_0_a_4 | De_05_a_9 | De_10_a_14 | De_15_a_19 | …. |
---|---|---|---|---|---|---|---|---|
256 | Africa | Female | 1950 | 19129 | 14824 | 12918 | 11518 | |
257 | Africa | Female | 1955 | 21773 | 16776 | 14196 | 12489 | |
258 | Africa | Female | 1960 | 24786 | 19431 | 16075 | 13654 | |
259 | Africa | Female | 1965 | 28271 | 22415 | 18794 | 15555 | |
260 | Africa | Female | 1970 | 32033 | 25841 | 21732 | 18208 | |
261 | Africa | Female | 1975 | 36719 | 29543 | 25136 | 21172 |
Using pivot_longer() from tidyr to get in the same column all the values from different ages and computing the frequencies for each date and region.
data_1 <- data %>%
dplyr::filter(Region %in% c("Africa","Asia","Europe")) %>%
tidyr::pivot_longer(cols=-c(Index,Region,Country_code,Gender,Date), names_to="Age", values_to="Population") %>%
dplyr::group_by(Region, Gender, Date, Age) %>%
dplyr::summarise(n = sum(Population)) %>%
dplyr::ungroup(Gender) %>%
dplyr::mutate(freq = n / sum(n))
Now, the data is clean and ready for plotting…
Region | Gender | Date | Age | n | freq |
---|---|---|---|---|---|
Africa | Female | 1950 | De_0_a_4 | 19129 | 0.0840 |
Africa | Female | 1950 | De_05_a_9 | 14824 | 0.0651 |
Africa | Female | 1950 | De_10_a_14 | 12918 | 0.0567 |
Africa | Female | 1950 | De_15_a_19 | 11518 | 0.0506 |
Africa | Female | 1950 | De_20_a_24 | 10091 | 0.0443 |
Africa | Female | 1950 | De_25_a_29 | 8668 | 0.0381 |
Population pyramid plot:
- facet_wrap() to make a plot for each region.
- transition.states() from gganimate package to move along eacy year.
p <- ggplot(data_1, aes(x = Age, fill = Gender,
y = ifelse(test = Gender == "Male", yes = -freq, no = freq))) +
geom_bar(stat="identity", alpha=.65) +
scale_y_continuous(labels=scales::percent_format(accuracy=1)) +
facet_wrap(~Region) +
scale_fill_viridis_d(option="D") +
coord_flip() +
theme_minimal() +
theme(panel.spacing=unit(2,"lines"), axis.title=element_text(size=18), plot.title=element_text(size=22),
strip.text=element_text(size=18), axis.text=element_text(size=11),
panel.border=element_rect(color="gray85", fill=NA)) +
labs(title = "Population Pyramid", subtitle="From 1950 to 2020", x = "Age", y = "\nPercent of population") +
geom_text(aes(x=max(Age), y=max(freq), label=as.factor(Date)), alpha=0.3, hjust=1, vjust=0.75, col="gray", size=8) +
transition_states(as.factor(Date), state_length=50)
Video of the animated plot:
STOP LOSS Reinsurance:
Pricing Computing risk premium…
Parametric Model using a LOG NORMAL distribution to fit the empiric data. Their two parameters are the mean and the sd.
Function to compute the reinsurance pure premium:
E <- function (yinf,ysup,par1,par2,premium) {as.numeric(integrate(function(x) (x-yinf) * premium * dlnorm(x,par1,par2), lower=yinf,upper=ysup)$value + (1-plnorm(ysup,par1,par2)) * (ysup-yinf) * premium)}
Claims Reserving:
Using the ChainLadder Package in R…
The MackChainLadder model uses the chain ladder approach for predicting ultimate and IBNR values for each row (in this case accident year) for a cumulative loss triangle. The default method of the model predicts the ultimate values using chain ladder ratios with the assumption of no tail factor, and the standard error of the ultimates are approximated using a log linear model. The model also has the option to use two other ratios: the simple average and the weighted average of the development ratios.
The BootChainLadder is a model that provides a predicted distribution for the IBNR values for a claims triangle. First, the development factors are calculated and then they are used in a backwards recursion to predict values for the past loss triangle. Then the predicted values and the actual values are used to calculate Pearson residuals. Using the adjusted residuals and the predicted losses from before, the model solves for the actual losses in the Pearson formula and forms a new loss triangle. The steps for predicting past losses and residuals are then repeated for this new triangle. After that, the model uses chain ladder ratios to predict the future losses, calculates the ultimate and IBNR values like in the previous Mack model. This cycle is performed N times. The IBNR for each origin period is calculated from each triangle (N times) and used to form a predictive distribution, from which summary statistics are obtained such as mean, prediction error and quantiles.
ReIns Package:
An R new package with powerful tools for Reinsurance data analysis…
The ReIns package contains implementations of:
- Basic extreme value theory (EVT) estimators and graphical.
- EVT estimators and graphical methods adapted for censored and/or truncated data.
- Splicing of mixed Erlang distributions with EVT distributions (Pareto, GPD).
- Value-at-Risk (VaR), Conditional Tail Expectation (CTE) and excess-loss premium estimates.
It’s very useful for fitting claims distributions. One usually wants a fit for the whole distribution. ReIns package proposes the splicing of a Mixed Erlang (ME) distribution for the body and an extreme value distribution, i.e. Pareto or GPD, for the tail. Also, it provides some tools to see how well the spliced distribution fits the data: