Mortality Analysis – The Next Level

 

I just returned from Seattle, where I was attending the AAIM Triennial Conference. For the 3rd time, I taught the Basic Mortality Methodology Course. It went very well and was well received by the group. Further, there were a couple of excellent sessions on analytics and big data applied to life insurance.

During this conference I was asked by several attendees and colleagues how they might further their study and understanding of mortality analysis. Some even broaden that to ask about data analysis in general, because it is becoming a more prominent skill in the world of life insurance.

So, in this post I will cite some of my favorite resources. Some are aimed at delivering a high-level understanding of analytics, while others are more technical, teaching you how to actually perform the analyses.

A lot of the books I mention are available as a free pdf or as a ‘pay what you want’ download. This is a testament to the open nature of much of the community that uses R and other types of free software. If you can, please consider paying something so that these fantastic resources continue to do their excellent work.

First, a bit of self-promotion. This is a link to 3 videos which I made for a seminar called “Mortality Analysis with Modern Tools”. It contains a refresher on basic, ‘classical’ mortality analysis, a tutorial on the use of SEER*Stat to gather cancer survival data, and a brief introduction to R, the free statistical programming language, and R-studio, a software tool which makes using R much easier.

The password is AAIM2016.
Books:
Life Expectancy in Court :  This book is a very clear, concise and easy to understand analysis of the use of actuarial analysis in the determination of life expectancy in a legal setting. It relies only on pencil-and-paper methods which are easily translated to spreadsheet tools.
Intro to Statistical Learning (this site has the pdf as well as links to slides and videos):  This is a methodological textbook, and a toned-down version of the very academic text “Elements of Statistical Learning”. It contains tons of great examples with R-code and detailed vignettes. There is a set of free videos by the authors and a robust community of folks who have attempted and completed the exercises at the back of each chapter.
Reckoning with Risk . This one is more for a lay audience – great for underwriters but also very clarifying for anyone else. The author really brings home the difference between common measure of test performance like specificity and sensitivity – and the more important measure of real world use like positive predictive value. Definitely a must read if you break into hives when someone shows you a 2×2 table.
R for Data Science. A bit more technical here – this one is not about statistics, but rather the manipulation, cleaning and display of data which is so integral to any analytic endeavor. I highly recommend the ‘tidy’ approach as is outlined in this text.
Courses:
Coursera Data Science: This is a series of 10 online courses which are available for a nominal fee. They provide an excellent introduction to the use of R as well as a fairly detailed look at the statistics underlying typical analytic methods like linear and logistic regression. I would say that 7 or 8 out of the 10 are fairly easy – though some have time-consuming homework. The others are pretty challenging. I leave it up to you to decide which ones. The authors include several biostatiticians from the Johns Hopkins School of Public Health.
Chromebook Data Science . I have not taken these, but I plan on checking out at least a couple of them. It is given by Jeff Leake – one of the aforementioned JHU biostatiticians, and focuses on the use of a simple web-enabled laptop to perform serious data analysis by making the most of web-based resources like AWS.
Regression Modelling Strategies: This is a book, but also a short (one-week) course offered at Vanderbilt University by the eminent statistician Frank Harrell. This is one if you are feeling like you might actually know something, and are ready to find out otherwise. I took this one a few years ago – it was excellent, but also very challenging.  Dr. Harrell also has a great blog and his book is an excellent resource.
There are so many more of these it would be folly for me to attempt to review them all.
Websites:
R-bloggers: A repository of web log posts from around the internet which deal with R and various other statistical issues. Their top-ten list is very useful, but you can also discover other nuggets of gold on any given visit.
KD nuggets: Speaking of nuggets, this site offers a wide array of articles relevant to data analysis. It’s scope is broader than R-bloggers so you may need to dig a little deeper to find what you want. But FYI they have much more comprehensive reviews of courses and books than I am offering here.
Personal Activity:
I can’t emphasize this enough. The best way to gain and retain mortality analysis skills is to actually practice those skills on real world data. So go ahead – analyze your department’s workflow, analyze the mortality risks you find in a relevant article, assign yourself the task of updating your company’s underwriting manual on a topic that interests you, or write a mortality abstract for publication. But find something that takes the task out of the theoretical realm and into the real world. You will find that your knowledge of the topic does the same thing.

Survival Analysis Intro

I follow r-bloggers regularly. This post about survival analysis popped up recently. It is a very concise and readable intro to the topic – dealing mostly with theory, and then just touching on the commands in R used to implement survival models. There are several references as well if you want to chase that particular rabbit down that particular hole.

Breast Cancer Survival in the SEER Data

SEER, a division of the National Cancer Institute, records and tracks cancer cases in 18 mostly urban areas across the country, spanning nearly 60% of the population. They have been doing this since 1973 and data is available for the asking.

Since my prior publication in the Journal of Insurance Medicine on the impact of micrometastases in breast cancer survival, I have been waiting for the SEER data to age enough to determine if immunohistochemically detected tumor cells in the lymph nodes (so-called “isolated tumor cells” or ITCs) actually impact prognosis.

Recently I looked into the data and was pleased to find that there were 1379 cases with ITCs among all stage I or II (AJCC 6th edition) with no nodal ‘macro’ metastases (N0), and no distant metastases (M0). There were another 22,731 who had been tested and were negative for ITCs. Additionally, from 2004 forward there were 36,530 who had not had the testing done.

I used Cox models to evaluate the possible risks of these ITCs. In each of then I used restricted cubic splines for age, and included sex and T stage as co-variates. The findings were pretty surprising. When the women who had not had testing were included, both positive and negative ITC tests were protective (HR 0.67 and 0.72 , respectively).

Since this could have been due to ‘informative missing’ – meaning the test was not done because of good prognosis or some other beneficial factor not related to the other co-variates, I tried another fit with only women who had the test done. This really did not change anything – the group with a positive test had a HR of 0.94 compared to the group with negative testing – an insignificant difference (p=0.68).

One obvious factor missing from this analysis is treatment. It is quite likely that the women with ITCs were treated more aggressively than their counterparts who had no testing. Nonetheless, the results here imply that, within the current milieu of testing and treatment, women with ITCs do just as well as women without them, and better than those who were never tested.

You can view my R-code  here and my SEER*Stat query here (on my Google drive site – you may not be able to navigate here if behind a firewall). If I can expand this out a bit more it may be the basis for a future JIM submission.

New Publications

The latest issue of the Journal of Insurance Medicine posted today. It contains 2 articles that I authored, one as a contributor along with a great group of friends and colleagues on MIB’s Mortality Research and Analysis Committee (about breast cancer mortality), and another as the lone author about the Random Forest algorithm for survival data.

I’ll spoil the conclusion on that last one – when I used a Cox model and a RSF model on colon cancer survival data from SEER, they had very similar concordance error rates, which is kind of a vote for Cox in that circumstance since the hazard ratio output offers a readier quantification of the relative importance of the predictors.

I got the idea to do this while taking courses in the Coursera Data Analysis Signature track. We had to do a project with our own data and create a Shiny app to go with it. (A Shiny app is an interactive web page that can be created using R and R-studio). I chose to create a colon cancer survival calculator based on SEER data and using a Random Forest approach. You can try out my app here, but be patient, it takes a while to load the first time.