Skip to content
Social Complexity Lab edited this page Dec 22, 2023 · 19 revisions

This paper has gone viral and much of the coverage is not so accurate. In this FAQ, we try to explain what the paper actually says.

Predicting death

Q: Is your algorithm really able to predict people's day of death, age when you die, or anything like that?

A: No! Let us explain.

First, let's explain what the number 78.8% accuracy (that has been widely reported) actually means.

  • We look at a subset of individuals aged between 35 and 65. This is because it is particularly challenging to make survival predictions in this cohort. The vast majority of individuals who pass away are older. And young have extremely low probability of dying.
  • That dataset we split up in two parts
    • Training data: Used to teach the model which correlations are in the data. The training data is the vast majority of the data.
    • Test data: We used the test data to understand how well the model is doing.
  • We now train the model on the training data
  • In the training data, the model learns from information in the years 2008-2015 to tell the difference between actual life/death outcomes for people in the training data during 2016-2020.
  • The trained model is then run on the test data (100000 individuals). Here the model sees the 2008-2015 data and makes a prediction. We then check against actual outcomes if it got it right.

So far so good.

  • There is one final wrinkle. Accuracy is defined as (correct guesses)/(total guesses). Because our cohort is very young, almost everyone survives (more than 95%).
    • This means that if we created an algorithm that always predicted “survive”, it would get a very high accuracy (over 95%).
    • To address the issue, we balance the dataset, equivalent of 50000 with survive outcome and 50000 with death outcome.
    • (In this balanced dataset a random guess would get 50% accuracy.)
    • When we run our algorithm on that balanced dataset, we get 78.8% accuracy.

Some important consequences.

  • We don’t make predictions for everyone, only the test data.
  • We’re not predicting how long people will live. Rather we test mortality over the next 4 years for a young cohort of individuals.

Access to the algorithm

Q: Can you download the software and try this out?

A: No! The dataset and model contain sensitive data and both are safely stored at Statistics Denmark.

Some follow-ups:

  • We have heard that there are websites that claim to implement life2vec. Those are fraudulent, so be careful.
  • We are working on ways to share the model with the wider research communities, but as LLMs are known to potentially leak data, we have to do further research before we can do this.
  • We have not yet studied how our results generalize to other countries/contexts, but are actively investigating this topic.

Clone this wiki locally