I was recently asked to do what turned into a fairly fiddly statistics project. A university library has moved to new premises, and the new building has a mixture of furniture types: armchairs, small café-style tables, large desks and so on. They want to monitor usage of the different furniture, to find out whether they've got the right mix for their students, or whether anything should be changed. This means tracking the number of students in each category of furniture at each counting time.
There are a few layers of complication.
- There are several counting times each day, so there's nested/duplicate data. This basically introduces a three-axis analysis: tracking the number of students in each type of seating at each time on each day. It also means that analyses over time will need to collate the information for each day, as will analyses by time of day.
- Counts can't always be taken, depending on staff availability, so sampling will be erratic.
- Furniture has been broken down heavily as a starting point, but may need regrouping later.
- A graduate study room is to be tracked as well, but separately from the main library.
I set up the usual table structure: ever-increasing Time and Date rows as the rows, and the furniture types as fixed columns, with a Total column at the end to sum the students in the library at any time. There's also a secondary column to track the Graduate room. I named this Readercount, and defined it as follows:
=OFFSET( 'Raw data'!$A$2, 0, 0, COUNTA('Raw data'!$A:$A), COUNTA('Raw data'!$2:$2) )
This creates a dynamic named range, which expands as new rows are added to the bottom. What it's doing is:
Select this stuff(Location of first cell I want, move down 0 rows, move across 0 cols, select a number of rows equal to (the number of rows in column A containing data), select a number of cols equal to (the number of cols in row 2 containing data )
In other words: get the first cell in the block, then select all the rows and all the columns that are in use.
It starts at Row 2, because row 1 contains values for the number of seats in each section, as I'll explain in a moment.
Using names rather than just blocks of cells has two advantages. One is that I can use dynamic ranges, which is really useful for calculations like sum and mean values. The other is that you don't have to worry about formulae breaking in unexpected way if you move, delete or add cells - as long as the name still selects the right cells, all formulae based on it should still work. It also helps pinpoint mistakes in formulae, because names are more transparent and simpler than cell references. You can avoid repeatedly defining complex cell ranges and risking mistakes.
A heatmap seemed like a sensible and intuitive way to map the average usage of each seating type, and show how it changes over time. For the 'heat', I decided to use the percentage of the seating type that was in use. I added in a row giving the number of seats of each type, and named it Seats - this is row 1.
On a new sheet, I did some test analyses to make sure things were working. For any complex work, I tend to go through step-by-step progression, which is a habit from coding, but also a useful way to check I actually know what I'm doing. This makes it much easier to pick up on typos, data entry errors and other confounding factors, before getting into complex formulae. Since I didn't really know what I was trying to do, I faffed about with several ideas and several new functions before getting something really useful; I won't bother discussing the mistakes and dead ends. I ended up with three main blocks of calculation.
The first is a simple block of potential usage. This calculates how many seats were available in each seating type, in each timeslot - adjusted to account for the number of counts taken. It shows the maximum possible students who could have been recorded.
This is a simple COUNTIF. I gave a timeslot in column B, and each successive column had a version of the following code:
=COUNTIF( Readercount, $Bthis row) * INDEX(Seats,0,column)
The row always matches the current row; this selects the timeslot value in column B as the timeslot you want to filter for with COUNTIF. Using $B allows easy drag-and-fill of the other columns. The column value picks the column from range Seats that matches the seating type.
In other words:
=COUNTIF( Readercount, this timeslot) * INDEX(Seats,0,this seating type )
Count up instances in the Readercount block of this timeslot, and multiply them by the number of seats listed for this seating type in the Seats block.
Note that with INDEX, if you enter 0 for the row or column, it will pass through every row or column in the chosen range. This is crucial to the COUNT* and SUM* formulae. If you enter a specific row and a specific column, it'll only look up one specific cell.
Next I calculated the actual seat usage: the total number of people who used each type of seating during each timeslot. I gave a timeslot in column B, and each successive column (representing a seating type) had a version of the following code:
=SUMIF( INDEX(Readercount,0,2), $Brow, INDEX(Readercount,0,column) )
Go through all rows in Readercount. Whenever the value in the Time column matches our timeslot value, find the column in that row that corresponds to the current seating type, and add its value to a running total.
In practical terms: total number of readers counted in that slot
Thirdly, I calculated the mean seat usage, by taking the total usage and dividing by the number of counts for the timeslot in question. Luckily, there was no variation in counts between the seating types, only between timeslots, so that's one less thing to worry about. As always, I gave a timeslot in column B, and each successive column (representing a seating type) had a version of the following code:
=SUMIF( INDEX(Readercount,0,2), $Brow, INDEX(Readercount,0,column) ) / COUNTIF(Readercount, $Brow)
=SUMIF( INDEX(Readercount,0,timeslot), this timeslot, INDEX(Readercount,0,this seating type) ) / COUNTIF(Readercount, timeslot)
Here's a timeslot and a seating type. For every timeslot value in Readercount that matches our chosen timeslot value, get that row's entry for our chosen seating type, and add it to a running total. Finally, divide that total by the number of counts made for this timeslot.
In practical terms: total number of readers counted in that slot / number of counts in that slot
Finally, I worked up to the percentage usage. This was simply the mean seat usage multiplied by 100.
=SUMIF( INDEX(Readercount,0,2), $Brow, INDEX(Readercount,0,column) ) / ( COUNTIF(Readercount, $Brow) * INDEX(Seats,0,column)) * 100
=SUMIF( INDEX(Readercount,0,timeslot), this timeslot, INDEX(Readercount,0,this seating type) ) / ( COUNTIF(Readercount, timeslot) * number of seats available ) * 100
For every timeslot value in Readercount that matches our chosen timeslot value, get that row's entry for our chosen seating type, and add it to a running total. Divide that total by the number of counts made for this timeslot multiplied by the number of seats in the room. Finally, multiply by 100 to give a percentage figure.
In practical terms: total number of students counted in that slot / hypothetical maximum number that could have been counted * 100
For the actual heatmaps, I selected relevant cells from the percentage usage block, and created a surface chart. There were several changes necessary.
- The default interval for the 'heat' (percentage usage) scale was 20. I had to tweak this by going into the Depth Axis, and setting the Major Unit to 10.
- The default colours are revolting and useless for a heatmap. Astonishingly, there doesn't seem to be any simple way to specify a gradient for the colours used, or any preset colour ranges. You can, however, add a gradient fill to the legend, and you can add a gradient as the fill for a single value: I can't think of a single reason why you'd ever want to do either of these things. Baffling. I had to manually go through each 'band' on the legend and set its fill colours.
- The vertical time axis automatically places the earlier values at the bottom of the scale. I found this unintuitive, and flipped it (Format Axis > Options > Series in Reverse Order).
- A graduate study room is to be tracked as well, but separately from the main library.
The heatmap indicates that the “South Desk” and “West Desk” are by far the most popular, that the “Comfy Chairs” and “Oval Table” are the least popular, and shows up the generally higher occupation rates between 11.30-5.30pm, which correspond to more students present.
On request, I also created a line graph of the seating usage, with each timeslot as a separate line. Compared to the heatmap, this emphasised the pattern of use for each seating type. It demonstrated that this was more or less constant throughout the day, except for a slightly increased tendency to use armchairs at the end of the day, based on a small sample of counts at this time.
Analysing the graduate data proved considerably fiddlier, even though it's a single column of data. I'll discuss that in another post. EDIT: Other post is now up.
If anyone is interested, you can get a (slightly cleaned-up, but hopefully otherwise unchanged) copy of the spreadsheet here.