I recently came up against a methodology in a project which I found to be unsound. Basically the data was steadily drifting up (it was price data so no surprises there). The model we had to implement required us to clean the data using Zscoring where the |Zscore| was <2. Very normal Outlier filtering. Two things occurred to me though, firstly, price data is not normally distributed (it tends to remain above a certain rate as a hard lower limit and vary roughly normal above that, looking like one half of a normally distributed graph) and secondly, the Zscoring removed early low items and late high items only. The outliers in the middle, although visually evident, weren’t cleaned using this method.
I tried to explain this to some people who are significantly more skilled in this area than me and I failed, most likely because I like pictures. So, here is a revised method for Zscoring using linear regression (and nothing is stopping you from using more advanced linearisation methods . . . cubic splines spring to mind but that is probably for another post.
Firstly we need a data set. The data needs to be normally distributed for the demo (so the Z scoring works nicely) and steadily increasing. I selected a sample size of 10 000, normally distributed around a mean varying from -1 to 1 and the zero intercept at 0.
df=data.frame(obs=c(1:10000),obs1=c(1:10000)/5000 - 1+rnorm(10000))
Immediately you should see where I’m coming from? Lets colour the points based on their Zscore using the traditional method:
df$z1 <- (df$obs1-mean(df$obs1))/sd(df$obs1)
df$c1 <- abs(df$z1)>2
There you go, now you see it? Plain as day? The longer my sample, the more non-outlier data gets dropped at the edges and less in the middle. So, let’s linearise the data, and re-score it. So this is where some fun things may happen and you could build really advanced models. One thing that comes out of this it the ability to pick up clusters that are not evident using the raw data.
So using the linear model function I can calculate the slope and intercept of my data:
spread.model=lm(formula=obs1 ~ obs, data=df)
Which gives me roughly an intercept of -1 and a slope of 2 in 10000 (good thing that looks just like my input formula). Let's create another column with the linearized data, and recalculate the ZScore.
Now if we view this against the linearised data, we get exactly what we expect:
However, I’m going the apply the linearized colour from the linearized Zscore to the original data and you get this:
Now you are able to drop the outliers of a linearly varying sample. Now let’s only consider the the absolute value of the normal distribution which flips the negative point over (most shops sell at the recommended price right?) so it looks like the price data we’d expect in the wild.
df=data.frame(obs=c(1:10000),obs1=c(1:10000)/5000 - 1+points)
And we see that we are dropping all the high prices near the end and the well behaved shops right in the beginning.
So running through the previous steps we get (the code from here on out is identical to the code before):
And this drops only the high outlier prices and not the inliers at the beginning. We also include the high inliers near the end.
In Alteryx the flow is really simple too: