Winsorising

Imagine you and your friends are playing a game where you all roll a die. You write down the number you get, and the person with the highest number wins. However, your friend keeps getting really low numbers, and they're not having any fun. You decide to make the game more fair by changing the rules a little bit.

Instead of counting the highest numbers, you decide to ignore the really low numbers and only count the highest ones. That way, your friend who keeps getting low numbers still has a chance to win. This is kind of like what we do when we "winsorize" a set of data.

Winsorizing is a fancy word that means we're going to change some of the numbers in a set of data so that extreme values don't have as big an impact on the average or median. We do this because sometimes data can have really high or really low values, and those values can mess up our calculations and make it harder to see the bigger picture.

For example, let's say we have a list of 10 numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 100. If we're trying to calculate the average (or mean) of these numbers, we get:

(1+2+3+4+5+6+7+8+9+100) / 10 = 15.5

But this average is kind of deceptive, because there's one really big number in there (100) that's throwing everything off. If we "winsorize" this data by replacing the highest and lowest values with the next highest or lowest value, we get:

1, 2, 3, 4, 5, 6, 7, 8, 9, 9

Now, when we calculate the average, we get:

(1+2+3+4+5+6+7+8+9+9) / 10 = 5.4

This is a more accurate picture of what's going on in the data, because we've removed the really high value that was skewing the results.

So, winsorizing is just a way of changing some of the numbers in a data set to make it easier to see trends and patterns, and to make sure that extreme values don't throw everything off.

Related topics others have asked about:

Huber loss, Robust regression, Trimmed estimator