We do a comparison of exposures created by actxps (an R package) and ExperienceAnalysis.jl (a Julia package).
library(actxps)
library(readr)
library(magrittr)
library(dplyr)
library(lubridate)
census_dat <- read_csv("census_dat.csv")
r_df <- expose_py(
census_dat,
start_date = "2006-6-15",
end_date = "2020-02-29",
target_status = "Surrender"
) %>% select(pol_num, pol_date_yr, term_date, exposure, status)
jl_df <- read_csv("df_jl.csv") # from create_csv.jl
ExperienceAnalysis.jl creates 1887 more rows of exposures. We want to understand why.
print(paste("row count Julia", nrow(jl_df)))
## [1] "row count Julia 143166"
print(paste("row count R", nrow(r_df)))
## [1] "row count R 141279"
print(paste("difference", nrow(jl_df)-nrow(r_df)))
## [1] "difference 1887"
We can use left joins to find rows from R that have no match in Julia.
r_julia <- r_df %>% left_join(jl_df, c("pol_num", "pol_date_yr" = "from"))
## Warning in left_join(., jl_df, c("pol_num", pol_date_yr = "from")): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 61311 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
The warning above lets us know that there are multiple matches,
indicating duplicate (policy_num, from)
combinations from ExperienceAnalysis.jl. We see that
ExperienceAnalysis.jl generates three rows with no
exposure_fraction.
jl_df %>%
group_by(pol_num, from) %>%
mutate(pol_from_count=n()) %>%
filter(pol_from_count > 1)
Rows from actxps that have no match in ExperienceAnalysis.jl follow these patterns:
term_date is defined,
term_date == pol_date_yrterm_date is not defined
pol_date_yr falls on a leap day (xxxx-02-29) orpol_date_yr falls on 2020-03-01r_julia %>% filter(is.na(exposure_fraction))
term_date defined,
term_date == pol_date_yrExperienceAnalysis.jl appears to treat date intervals with a non-inclusive right boundary, [issue_date, termination_date). actxps appear to have an inclusive right boundary.
r_julia %>% filter(pol_num %in% c(640, 1523))
According to section 4.3 of the Society of Actuaries (SOA) experience study document, both of these approaches are wrong some of the time.
For a lapse on a policy anniversary, using 11:59 pm on the day before the anniversary assures that the lapse is allocated to the proper policy year. The date assumption may need to be adjusted for certain events under study. For example, a death on the policy anniversary would be incorrectly assigned to the prior policy year by using 11:59 on the day before. Deaths should therefore be assumed to occur at 11:59 pm on the date of death, not the prior day.
ExperienceAnalysis.jl is not correct on pol_num 640
because it does not create an exposure interval containing the day
2014-11-02. actxps is not correct on pol_num 1523 because
it assigns the lapse to the day 2019-09-30 instead of 2019-09-29.
term_date not defined, pol_date_yr falls
on a leap day (xxxx-02-29)actxps does not create exposures properly for policies issued on leap day.
r_df %>% filter(pol_num == 10465)
ExperienceAnalysis.jl appears to not assign some dates to the correct interval. The fifth row should start on 2012-02-29.
jl_df %>% filter(pol_num == 10465)
term_date not defined, pol_date_yr falls
on 2020-03-01The end date of the study is 2020-02-29, so this should not happen. I am unsure if this is related to having an end date that falls on a leap year.
r_julia %>% filter(pol_num %in% c(2830,2877,1397,4621))
We do the same inspection of rows with no match.
julia_r <- jl_df %>% left_join(r_df, c("pol_num", "from" = "pol_date_yr"))
julia_r %>% filter(is.na(exposure))
The rows in Julia that are not in R all have from as
2006-06-15 or xxxx-02-28.
julia_r %>%
filter(is.na(exposure)) %>%
group_by(from) %>%
summarise(count=n())
Rows of the form xxxx-02-28 are explained in the previous section on leap days.
from is 2006-06-15Policy 4120 was issued on date 2005-05-27. The start date of the study truncates the interval [2006-05-27, 2007-05-27) to [2006-06-15, 2007-05-27). This appears to work as expected in ExperienceAnalysis.jl.
jl_df %>% filter(pol_num == 4120)
actxps appears to not create partial exposure intervals that begin at the start date of the study.
r_df %>% filter(pol_num == 4120)
The following issues will be made on GitHub: