There are no robots (traffic lights) between Sangster and Sloane on Main road but it is one of the main arterial roads from the Northern Suburbs into the Bryanston and Sandton area. I was of the opinion that a 5 minute difference in reaching the Sangster intersection would have a serious impact on the total travel time. Once you get to Sloane the trip becomes much more bearable.
Around 2010 I purchased a little log book (http://www.littlelogbook.co.za) for my car. It sat happily logging where I was all the time every time the car turned on. I worked at the Dimension Data Campus until July 2014 and my current and intervening jobs all required me to drive across the N1 highway at Main Road. So I did what any good techie would do and checked data. I had from January 2013 onwards, yay!
I exported the data from the little log book using CSV format, which gave me a decent idea of what I was looking for. I could get speed and travelled distance for any trip. The data contained whether it was work or not, start and end points, duration, distance etc. It was cool, but not cool enough, I was looking for next level cool.
The application shows a really nice graph of your speed and location (sorry Heisenberg) at any point in the journey and it supports exporting to KML (Keyhole Markup Language). Now Google Earth KML is not standard and the standard is loose enough for this to be an issue. So when I threw this into Google Earth I got each track perfectly but alas no speed. Below is the same data from the KML.
Ok, so the data did exist. Now to bring out the power tools. I opened the KML in Alteryx which understands most of what KML can do and I got this:
This was about the lowest helpfulness I could get. I tried the XML parser as well but it was also not exactly what I wanted. Alteryx also supports normal text files so what I did is I convinced it that the file was not XML but normal line text. I knew the times where in order but in a separate section to the point data.
So now to filter out the data I needs to tag the information I needed. Firstly I set up a directory read (so all the kml files in a single directory get read), this allowed me to use a single flow with no controls on 80 months worth of KML data. Just to be on the safe side I numbered all the rows because order is important and I would need to ensure that the order was preserved even after I did funny things with the data. Alteryx also supports Regex (posix compliant) and for those of you who dislike regex (or at least your brain feels like it’s melting go here). What I did is I tagged the Route description line, point info lines and time info lines. I then added a trip Identifier (the trip ID would increment every time the route changed) and I copied the route descriptor to each record in the same trip.
Then I split the flow so I could get all the point and all the times (each set had the same number of records so I filtered and the matched point info on one path and all the time info on the other path and dropped all the other lines. I then sorted the data (to ensure all the lines were still in order and rejoined based on record position. I also changed the date and time I parsed on the time flow into a datetime stamp. I now had time for each point.
I then took the route and split it using Regex (I do love regexes, I hate writing them, but I do love them). This gave me a start name and end name (which might be useful for another time. I then resorted just in case and dropped the recordID because I want the record IDs to be sequential and we had just dropped a huge number if row that had no pertinent information to my project (like view position etc).
I also converted each latitude longitude pair into a spatial point and if the previous record was in the same trip I would calculate the distance from the previous point and time difference (which was usually a second but the logger didn’t log if you didn’t move so sometimes more). I calculated speed in km/h using distance (m) / time (s) *3.6 and wrote it to an Alteryx Database and my local staging server. I now had my full dataset.
When people discuss data projects we say data prep takes the longest. We are now halfway through the data prep.
Now, because each trip didn’t start at the intersection and end at the next intersection i need to filter all the points in the target area. I used the map input tool and trade area tool. It took a bit of trial and error to get the perfect point that would exclude all data from just after the Sangster road intersection to the Sloane Road intersection.
Well, that left me with 138 206 data points falling into a 400 metre radius from the bridge over the highway.
So now the problem becomes evident. I mentioned that this was my primary route to and from work. I also had clients on the East Rand and the West Rand and Sandton and Pretoria. The N1 is a very important road, I also travelled it a lot and it goes directly under (through) my data of interest. I thought I’d just match all trips that went across the bridge and that would drop all the ones on the highway because how often drive across that really busy bridge only to turn around and go a completely different route on the highway right? Wrong! Turns out that I did that a lot. It is faster to cross the Main Road bridge and turn back to get on the highway than it is to get directly on the highway from the other side. Maybe that’ll be a project for later.
What I did then I thought was quite clever. I matched all the points in a little 200m circle on the south side of the data set, this only matched points that were:
In the target set
On the road I wanted
One of the other tricks I did earlier also is worth mentioning. I cut the data set just after the first traffic light and just before the next. This meant that if I graphed the points there would be a jump in point order.
What I did is I matched all the points that went across the bridge and left joined the data so that matching points had a new “On Path” value equal to 1. So now out of my data set I new which points were on the route. Then with the ordered data set I set each subsequent point to be “On path” if its number was one unit away and on the same trip. This meant that each point had to be sequentially connected to the one over the bridge to be considered. This dropped the highway traffic and the Sangster hairpin.
I started using caching tools so it does show an error on the flow, but then i don’t have to repeat the previous steps in the flow so the runtimes are a lot better. I now want to work out the distance (which should be roughly the same for each trip, give or take a few meters). There is a high drift in distance the faster you go (oh there you are Heisenberg). Now I get distance and direction by matching the data points by timestamp to the minimum and maximum timestamp. I give the minimum timestamp 1 and the maximum as 2. I can then also work out speed.
Now to finish the calcs, I rejoin the speed and distance to the data points. I need to be careful in Tableau to use and aggregation of speed and distance to ensure I don’t duplicate the info.
So let’s have a peak at the data know. Firstly, lets so the start and end points:
Now we are getting somewhere, aside from GPS drift, this is a really good indication of the left and the wrong side of the road. So now I want morning traffic only which would be South Bound, before 9AM.
Now, the spikes, 2015, I started taking my kids to school down that road, instead of the leisurely 8:30 drive I was now on the 7AM run. However, if I got to that section just before 7AM (which I finally got right in april 2015 I could get through this section at pre July 2014 times.
Let’s have a look by year and sort the durations based on start time.
If I hit that section of road at 6:55AM, I can get through quickly and predictably. If I hit that section of road after 7, I am around 26 seconds slower (over 400m) If I hit it at 8 it is quite quick and after 9 slows down again.
Please excuse me, I need to go set my alarm clock.