"You just won't believe how vastly, hugely, mind-bogglingly big it is."

filed under:

2018-03-26 Practical Pyplot with

I've got some data I've been collection for the past year or so about my personal finanances, my weight, blood pressure, and so on. I thought it would be fun to graph it all, and perhaps to set up a cron to graph it automatically as I update it. The data is kept, like the rest of my life that isn't actually executing on the grey matter inside my skull, in org-mode files. I wanted to write a script to take one of these files, munge however is necessary, produce a nice line chart that shows the changes in the different columns over time, and upload it to a webserver so I have easy access to it.

Duct Tape

The library I reached for, pyplot, is part of matplotlib, itself part of the constellation of scientist-y python modules with much more colorful names like NumPy and Pandas. This corner of the Python world is a Great Abstruse Mystery to me, so I thought this was a good opportunity to get a better handle on it. Unsurprisingly, Pandas, the portion of this ecosystem that is used for data import, export, and manipulation, does not have a built-in way to extract information from an org-mode table. Fortunately, it can consume a CSV files, so my first task is get it into CSV. I've yet to find a reasonable Python library to reliably parse org-mode files in this way (which is why I am in the process of writing and perfecting such a parser for another project, org-blog, which I hope eventually to spin out into a standalone library), but Emacs itself can do it quite reliably with the org-table-export function, bundled with Org, and therefore with all recent Emacs versions.
This leaves me a quandary: Do I embed a call to the emacs executable from within my Python file? Write my script in Emacs lisp and call out to Python? I chose a third way, which is just to write the script in bash, and call out to the Emacs binary and the Python library.


Exporting data from an org-mode file that contains only a single table starting on the first line (as is the case with all the data I am interested in at the moment) turns out to be suprisngly easy:
emacs -q \ --batch \ --file "$src" \ --eval "(org-table-export \"$dest\" \"orgtbl-to-csv\")"
$src here is the org-mode file you're exporting and $dest is the CSV file you're exporting to.
Next problem: the data I'm interested in graphing all has a leftmost column named "date" that has an org-mode inactive timestamp. This also isn't a format that is easily-slurped by Pandas. Since I'm already in a shellscript anyway, I reached for the sed to do a quick transform into a more manageable date format:
sed -i 's/^\[//' $dest sed -i 's/\ [A-Z][a-z][a-z]\]//' $dest
Put it all together, and we have a nice litte bash function:
function org2csv { src="$1" dest="$2" emacs -q \ --batch \ --file "$src" \ --eval "(org-table-export \"$dest\" \"orgtbl-to-csv\")" sed -i 's/^\[//' $dest sed -i 's/\ [A-Z][a-z][a-z]\]//' $dest }


Now that I have a neat CSV with nicely-formatted dates, it should be pretty easy to output a little graph. It should be and, now that I've spent most of a couple hours digging into the Matplotlib/Pyplot documentation, it is! First of all, I figured out that I need to use a non-interactive backend. The Matplotlib "agg" backend seemed to fit the bill, but the gotcha is that, when importing both matplotlib and pyplot, you need to set the backend after the first but before the latter. We also want pandas and sys, for reasons that will become clear later, leaving us with this at the top of our script:
import sys import pandas as pd import matplotlib #use a noninteractive backend (this must happen before we import pyplot) matplotlib.use('agg') import matplotlib.pyplot as plt
Next, we read in from the script arguments what the input CSV file and output PNG file will live:
src=sys.argv[1] dest=sys.argv[2]
Then we start importing the data, specifying that the top row of the CSV is the header (and therefore contains the names of the columns instead of data per-se). We also have to change the "date" column to be a real datetime column instead of just a string (it was anticipiation of this that we did the sed funny business earlier). It's also important that we set the "date" column as the index. This turns the data object into one with a TimeseriesIndex instead of a RangeIndex, an important distintion that is necessary before Pandas will let us take the next step, which is to resample at a Monthly interval:
data pd.read_csv(src, header0) data['date'] = pd.to_datetime(data['date']) data = data.set_index('date') data.resample('M')
This resampling is crucial because it smooths out all the parts of our data that are missing (I only weigh myself a couple times a week, for example, not every day).
The heart of the script is where we do the graphing itself (scientist types apparently call this "plotting"), a simple iteration over all the columns in our data object:
for col in data.columns: plt.plot(data.index, data[col], '-', label=col)
Since we've properly set lables for each of the individual plots on our graph, generating a nice legend is quite easy. As the cherry on top, we turn on grid lines so that graph is a little easier to read, and save the resulting figure out as a PNG:
plt.legend() plt.grid(True) matplotlib.pyplot.savefig(dest, bbox'tight', pad_inches2)
I could have put this code in a separate .py file. and called it from my shellscript, but in a utility script like this, I find it's nice not to keep it all in one file if possible, so I wrapped in it up in a bash function, employting the -c flag to the Python binary:
function csv2png { #usage: csv2png SRC DEST python3 -c " import sys import pandas as pd import matplotlib #use a noninteractive backend (this must happen before we import pyplot) matplotlib.use('agg') import matplotlib.pyplot as plt src=sys.argv[1] dest=sys.argv[2] data pd.read_csv(src, header0) data['date'] = pd.to_datetime(data['date']) data = data.set_index('date') data.resample('M') for col in data.columns: plt.plot(data.index, data[col], '-', label=col) plt.legend() plt.grid(True) matplotlib.pyplot.savefig(dest, bbox'tight', pad_inches2) " $@ }


I talked in a previous post about how useful it is to be able to use TRAMP for a quick and dirty data transfer in cases where you need to jump through an intermediate box, and I thought I'd use the same trick here, since that's exactly the case I found myself in. I came up with this function, which uses copy-file rather than copy-directory, since I'm only moving one file, not a whole tree:
function trampcp { src="$1" dest="$2" emacs -q \ --batch \ --eval "(copy-file \"$src\" \"$dest\" t)" }
Once I've got my PNG, this will allow me to upload it to my webserver through my jumpbox.


With all this scaffolding in place, it becomes easy to write a wrapper for all this to marshal the necessary temporary files:
function org2graph { orgfile="$1" csvfile=$(mktemp).csv pngfile=$(mktemp).png destfile="$2" org2csv "$orgfile" "$csvfile" csv2png "$csvfile" "$pngfile" trampcp "$pngfile" "$destfile" rm "$csvfile" "$pngfile" }
Now I can make just one call per graph, specifying the local org-mode source file and the remote server path:
org2graph ~/data/ "/ssh:jumpbox|ssh:cam@webserver:/var/www/graphs/finances.png" org2graph ~/data/ "/ssh:jumpbox|ssh:cam@webserver:/var/www/graphs/health.png"
You can find the full source of the script here.