hist


Download the source code for hist

hist’s History

hist is a command-line program I wrote that prints a histogram for a file. It reads a file or standard input and prints a horizontal histogram showing the frequency of each field (a whole line by default). The fields can be separated by any delimiting string.

I wrote hist to test another program I was writing—a program to print a random line from a file. The hard part about testing random programs is making sure they are actually random. To make sure the program was picking each line with the same probability, I ran the program a few thousand times, sending the output to hist to see if each line showed up approximately the same number of times.

I originally wrote an AWK script, but I decided to write it in C++ for fun.

As part of writing the C++ version, I made an AWK-like record-parsing class. It allows the line to be split based on a field-separating string or at fixed character locations. The original version (what’s used in hist) did not support regular expressions because I didn’t need them at the time. I only really needed to count whole lines. The Record class has come in handy several times since I wrote it, so I’m glad I took the time to write hist and the Record class.

Usage

The basic syntax for running hist is hist [OPTIONS] [FILE]...

If one of the files is - or no files are listed, standard input is read (the keyboard by default, or the output from another program). All the options are optional, so if you run just hist, it will read from standard input using the default options.

The output format is:

field 1:****************  
field 2:*****  
field 3:*********  
field 4:***********  
...

Options

-s scale
Set the histogram’s scale, that is the histogram will be scaled by the amount specified. scale can be any floating point number. Each bar length will be multiplied by scale, so you can shorten long bars by choosing a scale less than 1, or lengthen them with a scale greater than 1.
If scale is zero, it has no effect. Also, avoid using “Inf”, it will try to make the bars infinitely long.
-f n
Set the field to count in the histogram. Like AWK, field numbers start at 1, and 0 is the whole line (the default if -f is not used). To count from the end, use negative numbers. For example, -f -3 would be the third field from the end of the line.
-c character
Specify the character use for the histogram. The default is asterisk (*).
-d string
Set the field separator. Make sure you escape any whitespace or other “special” shell characters. If you don’t know what that means, you probably don’t use a shell much, and I’m not sure why you are still reading this…
-n
Prints the exact count of each field in parentheses before the bars. This option is useful when the output is scaled, or for piping the output to another program for further analysis.
-S
Sort the bars by frequency, with the smallest bars first. Not to be confused with -s (scale)
-r
Sort in reverse order so that the longest bars come first. If -S and -r are both specified, the sort order will be reversed (as if -S weren’t there). Multiple -r‘s do not reverse each other.
-v
Show version information and exit.
-h
Show a basic help screen and exit

Notes

Running hist with no arguments, is equivalent to hist -s 1 -f 0

The only operating system I have compiled hist in is Linux, but it should work on any other platform that has a standard C++ compiler. To compile it, you will need make, and gcc (both should come with or be available in any Linux/*BSD/UNIX distribution). Just extract the files to a directory and run “make” in that directory. If all goes well, you should have an executable file called “hist”. Move it to a directory that is in your $PATH to install it.

hist is in the public domain; do whatever you want with it.