Wednesday 16 January 2013

Excel: calculating composite actual and proportional use with COUNTIFS and SUMIFS

This is the second part of a discussion of my recent capacity-usage stats project tracking where and when people sit in a library. It might be helpful to any project involving nested data, comparing incomplete datasets, or datasets which increase in size (as in, number of rows of data) during the project. You can find the first part here.

This part of the job involved COUNTIFS, SUMIFS, INDEX, COUNTA and named ranges to track composite average usage while compensating for patchy data.

This half of the project concerned tracking usage of the library building's graduate study space, which is not part of the library, and therefore was monitored separately. People were interestd in the patterns of use over time, and in the proportion of capacity that was actually used, to help establish how efficiently the space is exploited. This could, for example, show that there was insufficient study space, or that the space was underused and could be better put to some other purpose. The main complication of this task, compared to the last, was that counts in the study space were much patchier than in the library. All counts included each area of the library, but sometimes the staff didn't get round to counting in the graduate space, since this involved going to a different area of the building.

Because the analyses include potential versus actual usage, the number of counts taken influences the figures. This means that it's important to only include data from counts that actually included the graduate space, otherwise any analyses will behave as though the count was zero, undermining accuracy and leading to under-reporting.

The potential-use-tracking field, which checks the number of counts taken and the amount of seating available, needs to include only those counts which actually included the graduate space (those with a non-empty value in that column). The actual-use-tracking field needs to only include rows with a count entry for the graduate space (only sum values from rows that include a non-null value in the "graduate space" column).

Luckily, at this point I discovered SUMIFS and COUNTIFS.

To begin with, I set up Time entries as rows. For columns, I tracked Number of Counts, Total Users (in that timeslot), Potential Use (at full capacity), Mean Use, and Percentage Use.

As I mentioned last time, the range of cells with recorded values in is called Readercount, and the set of capacity values for seating types is called Seats. To count cells based on multiple criteria, I turned to COUNTIFS. I calculated the number of relevant counts as follows:

=COUNTIFS( INDEX(Readercount,0,time column), $Bchosen timeslot, INDEX(Readercount,0,grad space column), "<>" )

=count up( all entries in the Time column within the range Readercount, if they match the chosen timeslot, and if the value in the grad space column is not null )

For summing the number of readers at various points, I had to use SUMIFS instead. This handy value sums up cells that meet multiple criteria.

=SUMIFS( INDEX(Readercount,0,grad space column), INDEX(Readercount,0,time column), $Bchosen timeslot, INDEX(Readercount,0,grad space column), "<>" )

=sum up( all entries in the grad space column within the range Readercount for which the time column in that row matches the chosen timeslot and the value in the grad space column is not null )

Potential use was:

=COUNTIFS( INDEX(Readercount,0,time column), $Bchosen timeslot, INDEX(Readercount,0,grad space column), "<>" ) * INDEX(Seats,0,grad space capacity )

Average use was just total users divided by number of counts taken for the chosen time slot. Rather than build the formulae again, I simply referenced the existing cells.

=D4/C4

In other words:

=Sum of users / Number of counts

In full, this is:

=SUMIFS( INDEX(Readercount,0,grad space column), INDEX(Readercount,0,time column), $Bchosen timeslot , INDEX(Readercount,0,grad space column), "<>" ) / COUNTIFS( INDEX(Readercount,0,time column), $Bchosen timeslot, INDEX(Readercount,0,grad space column), "<>" )

You can probably see why I went for the cell references...

Similarly, percentage use was:

=D4 / ( C4 * INDEX(Seats,0,16) ) * 100

or

=total users / ( counts taken * grad space capacity ) *100

In case anyone's curious, there's a couple of reasons why I used range references for the seating capacities in this project, rather than just plain numbers.

  1. It makes the formulae more transparent, even though it also makes them more complex. Anyone reading through it can quickly work out, and confirm by finding the range in question, that INDEX(Seats,0,16) refers to a fixed value that corresponds to a seating capacity. With plain numbers, it's much harder to work out where the numbers are coming from and why they're there. This is important for debugging, and also makes it easier for someone other than me to maintain the spreadsheets in future.
  2. It allows for changes in measurement type. Some of the capacities are somewhat hazy. Students sometimes use window bays as improvised seating. The graduate space has a few armchairs as well as the work desks; at present, students are counted regardless of where they're sitting, but only the work desks are treated as "capacity", because the aim is to determine if there are enough useful workspaces for everyone.
  3. It allows for future-proofing. If the capacity in a room is changed, someone can amend the value in a single cell and start a new set of statistics.
  4. It makes the project more transferrable. Individual formulae, or the whole spreadsheet, can be copied over to a new document and tweaked to track similar data elsewhere.
  5. Because I do a lot of filling with my formulae ('fill down' and the like), actual numbers are a potential liability, as they can end up changing across the block of cells, rather than remaining static. Not that hard to overcome, but annoying nevertheless.

I used exactly the same structures to calculate for date, so I won't bother repeating that. I simply changed which column I referenced in INDEX.

Building charts

The previous charts, for the library itself, were simply tracking average values that were constantly updated. So they could simply track a fixed block of cells. The same is true of the use-by-timeslot chart for the graduate space, which simply took average values from the chart.

For the graduate space, however, I also had to generate a chart showing patterns of usage over time, which needed updating alongside the data. While it's a bit fiddly in the short term, the lowest-maintenance and most elegant way to do this was to use named ranges in the chart's dataset.

I defined two new dynamic named ranges for this sheet: Graddate and Graduse.

Graddate was defined as

=OFFSET( 'Graduate charts'!$B$11, 0,0, COUNTA( 'Graduate charts'!$B:$B) -7 )

=OFFSET( this worksheet!$date column$first row in date column, 0,0, COUNTA( this worksheet!$date column:$date column ) -7 )

In the Graduate Charts worksheet, go to the first cell in the Date column, and adjust this starting position by 0 rows and 0 columns. From here, select a number of cells equal to ( number of non-empty cells in the date column - number of irrelevant non-empty cells in this column )

The reason for the -7 is simply that this worksheet contains all the grad space analysis, both timeslot-based and date-based. The same column (B) contains seven cells'-worth of data relating to the timeslot-based calculations, which is irrelevant here, and I compensated by ignoring 7 rows. This wasn't really necessary - all that would happen otherwise is seven blank rows at the end of the date-based section would be included in the range. Excel should ignore these when generating charts, so it wouldn't make much difference, but it's tidier and better practice to make the adjustment as I go along. After all, in future I might use some of those cells for something else, and break the formulae. In theory I could have used another COUNTA rather than the fixed 7 value, which would have been better practice, but... I didn't.

To get the charts, I used these names as data sources. I started simply by manually selecting some appropriate cells from the correct columns, then went into Select Data Source.

I set the Legend Entries (Series), in the Series values (not Series name!) to:

='name of file' !Graduse

I set the Horizontal Category Axis Labels to:

='name of file' !Graddate

This gave me a chart based on a dynamic range of cells. As soon as further entries are added to the worksheet, the chart will adapt to include their data. All I have to do is keep filling down the dates and formulae as new data is added.

Annoyingly, I have to manually delete dates when no counts were taken in the graduate space - including weekends and holidays - as these produce #DIV/0 errors, and are treated as counts with zero students on the graph. Actual zero counts are left in, of course.

I could, in theory, write some code to do this for me. Every time a new row was added to the raw data, it would check to see if the date was already present on the grad calculations sheet, and if not, add a new valid line so long as a count was present for the graduate space. So far it hasn't been worth the time.

Pseudocode for adding new rows

This is pure pseudocode, dashed off hastily, but it gives the outline of how I'd do this if necessary.

onEvent(rawData.gradCount.newRow)
{
  if(gradData.indexOf(newRow.date)==0)&&(newRow.gradCount!='null')
    {
    get(gradData.lastRow)
    setDate(gradData.lastRow)==rawData.newRow.date
    fillDown(gradData.lastRow.formulae)
    }
}

Finally, here is a link to a (slightly tidied up) version of the Excel file.

No comments:

Post a Comment