Useful Student Evaluation Data

Posted on 2016-02-06 by nbloomf
Tags: unix, cli, awk, latex

I’ve had a few people ask about a script I wrote to scrape student evaluation data from the reports generated by Blackboard. That script, and instructions for using it, are available at the bottom of this page. But I decided it is also appropriate to write up how the script was written, because (1) it is very likely that it will need to be changed in the future and (2) it’s a nice case study for the power of the unix way.

I recommend looking at the script itself while reading this post.

The Problem

My university, like many others, uses student evaluations to help measure the effectiveness of its faculty. The results of these evaluations are used to improve teaching practice and make decisions about retention and promotion, and so are an important part of a faculty member’s professional portfolio. Many universities, mine included, have moved recently from fill-in-the-bubble paper-based evaluations to online surveys. For NSU, and I suspect many others, this means evaluations are given inside Blackboard, an education-focused CMS.

The problem is this. There is exactly one way to get the results of my student evalutations: by generating a report inside Blackboard complete with colored graphs. Unfortunately this report is typically between 8 and 12 pages long, and the data is presented in a nearly useless way, with lots of cruft and distracting, unnecessary junk. If I could get the raw evaluation data out, this alone would be no problem. But there is no “Export Data” function available.

So it appears we have a choice to make: we can put up with ugly, unreadable student evaluation results, or we can painstakingly transcribe the data from the reports to (say) a spreadsheet. Both of these options are unacceptable to me, because I am a stubborn aesthete and profoundly lazy. But as you may expect, there is a third option! The data is clearly all there in the report, obfuscated. There is hope that we can get it out programmatically.

The Solution

Let’s take a closer look at the report itself. There are two ways to generate a student eval report: as a webpage, with the numerical data presented with charts, and as a PDF which also includes all the student comments. The webpage is a plain HTML document, which is just structured text. A priori, it is much easier to extract structured data from text than from a PDF, so let’s start there.

The eval data in the HTML report is presented as a graphic chart. This is bad, because it is much much harder to extract information from an image than from text. Now I’ll save a local copy of the HTML report, called report.html, and open it in a text editor. Poking around, it was clearly not intended for human eyes; the HTML elements are given classes with names like “style_32” and most of the element IDs are unhelpful. Properly semantic HTML should use meaningful class names and IDs. But OK.

This is a huge file, so let’s search for information we care about and know is in there: one of the questions. The first question starts with “My instructor contributed” and that string is probably unique enough. Searching for that string brings up two results. The first is the text of the question as presented in the report. The second instance is this line, which I’ve broken up here for readability:

<table id="test-responses-nonmatrix-My instructor 
contributed to my learning in this class."


This table has everything we want: the text of the question in the table ID, and the raw response data given by “series” and “value”. (That “display:none;” style also explains why this useful information is not shown in the report.) The data here is (0,5.0), (1,5.0), (2,2.0), (3,3.0), and (4,1.0). Comparing these numbers to the original report, it looks like response 0 corresponds to “Strongly Agree”, 1 to “Agree”, and so on, and the decimal numbers are the number of responses, because of course it makes sense to count things with decimal numbers.

So clearly the data is there, now we just have to get it out and clean it up.

Get it out

We’re going to use some command line unix tools. Unix (by which I mean also unix-like systems like linux and osx) has a large ecosystem of small programs which each perform tasks only within a focused scope, but which are easily composed with one another to do more complex tasks.

Since we’re working with an HTML file, which is inherently more structured than plain text, we will start by thinking of tools which can work with HTML. The W3C, which maintains the HTML standard, also maintains a collection of unix tools for working with HTML called the html-xml-utils. While not installed by default on most systems, these live at W3C and can be installed using apt-get, brew, or yum. Three of these tools in particular are handy for us: hxnormalize, which makes an HTML file XML-conformant (required for the other hx-tools); hxclean, which “applies heuristics to correct an HTML file”; and hxselect, which extracts from an HTML file those elements matching a given CSS selector. hxselect is the real magic, as with a little more understanding of the struture of an HTML file we can use it to pull out precisely what we want.

Going back to the report HTML, notice that (1) the data we want is in a table whose ID starts with test-response-nonmatrix, and (2) the data table is inside a td of class style_58. Searching for the next instance of test-response-nonmatrix shows that the next data table is also inside a style_58 element. Of course style_58 is meaningless, but a reasonable start is to try pulling out only the table elements inside elements of class style_58 using hxselect; that can be done using the CSS selector .style_58 table. After a hxnormalize and hxclean, indeed this extracts exactly the response data and nothing else.

Clean it up

With hxnormalize, hxclean, and hxselect, we can get all the evaluation data out of the HTML report; now it’d be nice to get it into CSV format. CSV is a good format for tabular data because it is easy to work with. Using sed we can split the data to one question per line, change all of the <td>0</td> to SA (for “strongly agree”), get rid of the .0 in the results, and remove the HTML cruft. This turns each question into one line like this:

"My instructor contributed to my learning in this class." SA 5 A 5 U 2 D 3 SD 1

Almost there. There is a problem, though; in the original data, if any question has a response type with no responses, rather than report a zero, that response type is omitted. So for instance if no students had responded Uncertain to a question the results would be

"question text" SA 1 A 3 D 1 SD 2

rather than

"question text" SA 1 A 3 U 0 D 1 SD 2

To get a good CSV file we need zeros for any missing responses. I did this using sed regular expressions with backreferences, one for each possible combination of present and missing responses. There’s probably a more elegant way to do this, but with only 30 cases to handle brute force is fine.

To finish up, I replaced all the response type labels with commas and included a header line. The result is a clean CSV file, ready for further processing with spreadsheet software or other tools.

The Results

If you’re feeling impatient, you can download the final script. To use it,

  1. Download the script to the directory where you wish to use it and cd to that directory in a terminal.
  2. Make sure you have the html-xml-utils. (To quickly find out try man hxselect.) How you get these depends on your system; Debian-based unices can say apt-get install html-xml-utils
  3. Make the script executable with chmod +x
  4. Download the HTML version of the student evaluations you’d like to scrape; save this file to the directory which contains
  5. Say ./ file.html, where file.html is the name of the report to scrape.
  6. The results are in eval.csv.
  7. Important! Spot-check the results to make sure they agree with the original report. This script has been tested on the Fall 2015 evaluations, but the report may change from semester to semester. Fortunately, it “should” be “straightforward” to edit the script for other semesters.

One unsolved problem is that this script only gets the numerical eval data, not student comments. This can also be done using a similar strategy with a tool called pdftotext. But that’s a story for another day. :)

Update: Formatting with LaTeX and AWK

Another semester, another set of evaluations! I am pleased to find that, half a year later, the original spiff script still extracts eval results from the BB generated reports with no changes. (Remember: our script is at the mercy of whoever or whatever generates the HTML reports!) The original investment of effort starts paying off. There is room for improvement, though. In my portfolio, I don’t just include a printout of a CSV file with raw evaluation data. Instead, the data is presented in a pretty LaTeX table. Last semester, I converted my CSV to LaTeX tables by hand. It was tedious (though not as tedious as it could have been!), but okay given my constraints at the time. This semester I’ve got some room to think about a better way.

The spiff script takes one class’ worth of eval data and turns it into lines of text like so:

Prompt,Strongly Agree,Agree,Undecided,Disagree,Strongly Disagree
"Question 1",0,1,0,0,0
"Question 2",0,0,0,1,0
... et cetera

But I want to display my results in a table, which is written in LaTeX like this:

 & Prompt & SD & D & U & A & SA & Mean \\
1. & Question 1 & 0 & 0 & 0 & 1 & 0 & 75% \\
2. & Question 2 & 0 & 1 & 0 & 0 & 0 & 25% \\
... et cetera

Note the change in response order, the question numbers, and the calculated mean. This is a job for AWK! In one sentence, AWK is a small but expressive language which picks its teeth using line-oriented text manipulation problems like this. A basic AWK program takes as its input a text file, and looks something like this.

BEGIN { (stuff to do before we see any text) }
/pattern-1/ { (stuff to do with lines that match regex pattern-1) }
/pattern-2/ ( (stuff to do with lines that match regex pattern-2) }
/pattern-n/ { (stuff to do with lines that match regex pattern-n) }
END { (stuff to do before we exit) }

AWK sits in between sed and bigger tools like Perl and Python on the power/complexity continuum.

The AWK script cleanup takes the nice CSV file produced by spiff and converts it to the body of a LaTeX table, ready to be inserted in a template file I keep with my other portfolio documents. To use it on the file data.csv simply put cleanup.awk and data.csv in the same directory, cd there, and say

awk --file cleanup.awk data.csv > out.txt

In all, it took longer to write this explanation than to write the AWK script (including time reading the AWK documentation).

A note on modularity

The purpose of cleanup.awk is to turn our eval data into a nicely marked up LaTeX table. This feature could have been added directly to, so that next semester we only have to run one command to go from HTML to LaTeX. But I argue that this is the Wrong Thing, and it is actually better to have two tools than one. There are two different functions being performed here: “extract data” and “format data”. By making the implementations of these two functions separate, we are free to change the “format data” step, or replace it altogether, or even to have multiple different formatting tools living side-by-side. And if, say, the HTML report format changes wildly and spiff is broken beyond repair, we can focus on replacing only the “extract data” function, safe in the knowledge that as long as we respect the input expected by cleanup we don’t have to worry about LaTeX formatting.