Finding outliers in oscillating data

Posted on Wednesday, 17 May 2017
How do you detect outliers in oscillating data sets?

I would like to describe a method to clean up oscillating data and find the outliers using Fast Fourier Transforms. FFTs are really cool when trying to find the periodicity of data. I elected to use FFTs rather than Moving averages (but the principle would be similar, linearize -> find outliers) because they lend themselves nicely to seasonal data.

So let’s get started with some data:

df=data.frame(obs=c(1:10000),obs1=sin(c(1:10000)/1000) - 1+abs(rnorm(10000)))
qplot(obs,obs1,data=df)

 

Z Scoring this data, we will get what we expect (the tops chopped) but not what we want:

df$z=(df$obs1-mean(df$obs1)/sd(df$obs1))
df$c=abs(df$z)<2
qplot(obs,obs1,data=df,colour=c)

The first thing we need to do is move from time domain to frequency domain. Fortunately R has a really nice library to take the hard math out of this. We then want to calculate the magnitude (the phase is not important at this stage, because we want to know how big the waves are and not their angle.)

library(seewave)
fd=fft(df$obs1)
mods=Mod(fd)
qplot(c(1:length(mods)),mods)

As we’d expect from the data we have a few very low frequencies in the data, and lots of noise in the higher frequencies.

Now we do something you can do in numbers and not really filters (and some of the artifacts of this will be visible later). We filter top half of the data out (set it zero, and not delete it otherwise the reverse FFT breaks) and we filter the top 10 frequencies which should give us a good match to the base waveform in the data. Enter the perfect low pass filter (frequency domain only). This is a little trick I’ve seen in industry where you mess with the frequency domain directly rather than trying to reverse the filter design into time domain.

lpfilt=c(1:length(mods))
lpfilt[1:length(mods)]=0
lpfilt[1:floor(length(mods)/2)]=1
qplot(c(1:length(mods)),lpfilt)

This give us a nice split on the waveform.

mods=mods*lpfilt
qplot(c(1:length(mods)),mods)

Now to filter the top 10 harmonics (play with this number in your use case):

filter=((mods>sort(mods,decreasing=TRUE)[10])*1)
qplot(c(1:length(mods)),mods*filter)

Hmmm, there is an annoying harmonic in the 3000 range, lets try 5 instead of 10.

Nice, right now we have a clean low frequency waveform we can apply to the data, so let’s first bring it back into time domain so the we have data we can work with:

df$fobs=Re(fft(fd*filter,inverse=TRUE))/(length(mods)/2)
qplot(obs,fobs,data=df)

Now we can linearize our data using the waveform:

df$linear=df$obs1-df$fobs
qplot(obs,linear,data=df)

And that curve down at the end is an artifact of our transform and we seem to have missed a frequency at around the 5th harmonic? So, play and try 7 after we’re done.

Now we can just Z score the data using |z|<2

df$zfft=(df$linear-mean(df$linear)/sd(df$linear))
df$cfft=abs(df$zfft)<2
qplot(obs,linear,data=df,colour=cfft)

And our resulting scoring looks as follows:

And on the original data, we get:

qplot(obs,obs1,data=df,colour=cfft)

As further work you can try to combine both the linear and the fft version for trending data. There is also nothing stopping you from using Laplace transforms and z-transforms to filter out noisy streaming data to find outliers (in fact, if I were deploying this in a streaming environment that is most likely what I’d use)

Hope you found this informative.

Stay in touch
Share

Sign up for our Slipstream newsletter