Using Gnuplot and cron to find out what hard disk to buy next time

Whatever hard disk you buy seems to fill up in a shorter and shorter time. On the other hand, spending a fortune on a large disk is not necessary for many people. It's not always a case of 'the bigger, the better'. Is there a way to predict what capacity would best suit you, so that you don't pay too much now for something you might need only in five years from now and, at the same time make sure you don't run out of space in two months?

Well, I'm sure there will be sophisticated answers or maybe even tools for that, but if the way you devoured megabytes in the past is anything to go by, you might want to try this: plot a graph of your daily usage of the file space and then, lacking anything more imaginative, fit the data with a straight line, and try to guess what your usage might be in the future. Something like this:

Sounds like a lot of work, but luckily Linux (or Unix) has a few utilities that are excellent at automating all this, so that you don't have to lift a finger (apart from typing and setting up the initial scripts described here).

First you need to collect daily data for enough time as to make the graph worth plotting. You might need to watch a few different directories, so it pays to actually use a script recording the current usage of the directory you're monitoring to a data file. The script could look like this:

#!/bin/sh
# used_myfiles
#
# Records the calendar date and the total size of files
# in a directory, in megabytes.

# get the date and time; format: 10 02 2003 22:33:44
# (Americans might dislike this...)
date_string=`date +"%d/%m/%Y %H:%M:%S"`

filesystem=$1
logfile=$2

total_megabytes=`du -sm $filesystem`
(echo $date_string $total_megabytes) >> $logfile

After installing this script, you could run it manually like this:

used_myfile /home/user33 /home/user33/LOG/used_home_user33.dat

Here user33 is your user name, /home/user33 is the directory which you want to monitor, and /home/user33/LOG/used_home_user33.dat is the file into which data is to be collected. Running the script once appends an entry like this to the data file:

20/07/2003 03:05:00 20537 /home/user33

This says that on 20/07/2003, at 03:05:00, the /home/user33 directory contained about 20,000 MB, i.e. about 20GB.

But since we're using Linux (or Unix) here we might let the computer do the work. We can do that using cron (a utility for periodic/automated execution of tasks). The cron entry would look like this:

5 3 * * * /home/user33/bin/used_myfiles /home/user33 /home/user33/LOG/used_myfiles.home_user33

This translates as: at minute 5, hour 3, i.e. at 03:05 of every day of the month (first *), every month of the year (second *), on any day of the week (last *), execute whatever command follows after the asteriscs. Remember that, unlike your account, cron prefers full paths (your search path and aliases are not taken into account by cron, you have to be explicit).

If you haven't used cron before, copy the line above into a file called, say mycrontab and then run crontab mycrontab. Watch your local mail, cron likes sending complaining mail to the user if anything goes wrong (and sometimes even if nothing goes wrong).

Everything's fine up to know, the data is slowly accumulating in the data file, with each new addition being made some time in the night (or any other time you're not glued to your computer).

You can also get the plot of usage values up to the present by using a Gnuplot script, like this:

#!/usr/bin/gnuplot

set terminal postscript landscape color
set output '/home/user33/LOG/used_myfiles.ps'


set title "My Files over Time"
set timestamp "Last updated: %d/%m/%Y, %H:%M" top

set xlabel "Date (mm/yy)"
set timefmt "%d/%m/%Y"
set xdata time
set xrange [ "1/7/2003":"1/9/2004" ]
set format x "%m/%y"

set ylabel "MB"
set yrange [ 0 : ]

set key left
set grid

total_fit(x) = a1*x + b1
docs_fit(x) = a2*x + b2
fit total_fit(x) '/home/user33/LOG/used_myfiles.home_user33'     using 1:3 via a1, b1
fit docs_fit(x)  '/home/user33/LOG/used_myfiles.home_user33_docs' using 1:3 via a2, b2
show variables


plot '/home/user33/LOGBOOK/used_myfiles.home_user33'     using 1:3 t 'Total' with lines, \
     '/home/user33/LOGBOOK/used_myfiles.home_user33_docs' using 1:3 t 'Docs'   with lines, \
     total_fit(x), docs_fit(x)

Call this script 'plot_myfiles.gnu' and save it in the LOG directory.

The script reads the data files with the daily values, plots all the data contained therein, finds a best fit and produces an output file with the plot and the fit. The fit is extended into the future. There is, of course, no guarantee that the fit should be linear. Indeed, some arguments suggest an exponential, at least for longer times. If you wish, Gnuplot could easily accomodate that, but for 'short' times and relatively stable space usage, the linear approximation is good enough.

You could be running the Gnuplot script by hand, whenever you need it, but again, you could use cron to do this for you daily, at some time after the other cron task(the one actually producing the latest data) finished. Use an entry like this:

# generate the current plot for GB as a function of time
30 3 * * * /usr/bin/gnuplot /home/user33/LOG/plot_myfiles.gnu >& /dev/null

The output from Gnuplot, while potentially useful, is not needed in this simplified context, so we redirect it to standard error.

If you got here, there's nothing left to do. Now, each time you want to take a look at how much space you need, just go to the LOG directory and look at the file used_myfiles_user33.ps, which has been updated overnight. The Gnuplot home page is now at http://www.gnuplot.info.


Fanel Donea Back to my home page