Good bye statgraphics™
New posts will be at
I will give further details on their strange behavior later ...
(Some software companies seem to invest more money into lawyers than into good software developers)
New posts will be at
I will give further details on their strange behavior later ...
(Some software companies seem to invest more money into lawyers than into good software developers)
Here are the 100 most frequently used words in Graphics of Large Datasets. Obviously, we are talking about how to "plot large data in figures".
(Pause the mouse over a keyword to see the number of occurences or click the link to see the context where the word is found)
Chris Volinsky has a very nice page, which allows to look for subgraphs that connect authors by papers in computer science journals (based on DBLP) or actors connected by movies (based on IMDB).
Here is the proximity graph that connects me (not being a computer scientist) with Donald E. Knuth: 
Visit Chris' page here.
Histograms are often mistaken with barcharts. The fundamental distinction between the two is

Much has been written about optimal bin width for histograms - almost nothing about the choice of the anchor point. Changing the latter often changes the shape of the histogram more dramatic than choosing between 8, 9 or 10 bins.
Setting the anchor point from 0 to -2.4 yields:

In most applications, there are sensible breaks which can be chosen. Since we look at 3,111 voting districts, we can use far more bins and start at 0 with bin width of 5.

If we link a second attribute into the histogram, the whole thing gets more exiting!

We don't really can see what is going on here (although we might guess, that there is a slightly higher proportion of highlighting towards the higher votes for Kerry).
When we use the same normalization trick as for Spineplots (see previous post), we get the clearer picture of the Spinogram:

Well, that's what we would have expected, probably except for the increase at the left end of the scale.
The problem with the histograms used so far is that we looked at voting districts, and not at voters! This will distort the impression if the districts are not of equal size. Weighting above histogram will move the mode further to the right.

Finally we get the weighted spinogram, which probably supports more the hypothesis ... of the selected group.

(Sorry for the lengthy post ... but concepts get a bit more complex)

Well, as we look at the percentage changes, we do not have any clue about the underlying group sizes. As whites are by far the larger group in this example, the absolute increase for whites is certainly much bigger. Any better display at hand?
So called "Skyline Plots" - as implemented in RENOIR - take the absolute size of groups into account by adjusting the bin width, such that the plot covers both aspects: absolute and relative change.
(This is certainly a different example as the one from BBC News, but without the absolute figures it is impossible to recreate the skyline plot for the prison example.)
Looking at the colors, we find the odd choice of coding an increase of prisoners in green and the decrease in red. (Does not make much sense, unless the graph comes from the company which runs the prinson ...)
| Stage Results | cumulative Time | Ranks |
|---|---|---|
![]() |
![]() |
![]() |
| (click on the images to enlarge) | ||
Prolog: Did anyone have Thor HUSHOVD on his list?
Stage 01: Except for 5, all arrived in the peleton
Stage 02: MC EWEN already on 3.
Stage 03: First drop outs
Stage 04: BOONEN still keeps the yellow jersey
Stage 05: Only O'GRADY can't keep the pace of the top 10 from the Prolog
Stage 06: First 37 still in a window of one minute
Stage 07: This was the day of team T-Mobile!
Stage 08: No stage to remember ...
Stage 09: Still waiting for the mountains, so we look at Sebastian JOLY ...
Stage 10: MERCADO and DESSEL out of the blue?
Stage 11: LANDIS now in yellow
Stage 12: POPOVYCH's day, now on 10. 5 withdrawls after the mountains.
Stage 13: PEREIRO SIO's second place awards him the yellow jersey.
Stage 14: No changes within the top 17.
Stage 15: LANDIS back in yellow; two more mountain stages; down to 152
Stage 16: LANDIS passes yellow to PEREIRO, KLÖDEN the real winner in the end?
Stage 17: LANDIS back after great ride; top 3 within 30''
Stage 18: No changes, we are all waiting for the show down tomorrow
Stage 19: Everything as expected, LANDIS too strong for PEREIRO
Stage 20: Profile of a winner ...
To play with the data yourself, get the Mondrian software and the dataset. Thanks goes to Sergej Potapov, who wrote the script to manage the data!

Using a simple histogram, maybe with an added density estimator, and/or a simple standard boxplot for group comparison does the job here. No need to "invent" a new plot, which introduces more problems as it solves any.
Actually this plot makes a good case against the use of graphics ...

(This statistics also tells me why my quota is used up, so I moved most of the recent stuff to another server!)
Interesting to see that almost half of the cyclists categorized as "sprinter" don't make it to the Champs-Elysees.
Looking at the ranks is quite funny as well. Here's what happens when you start as first and last ... 
(Certainly, David Zabriskie would have probably looked better when not having the crash in the team time trial)

Recent Comments
Hi Martin,
You have a very cool blog here…lov
there is nothing to defend here, 'cause