October 28, 2006

Concordance Visualization

Books featuring the "Search Inside" option in Amazon also have a concordance (here is a definition) added to the books info.

Here are the 100 most frequently used words in Graphics of Large Datasets. Obviously, we are talking about how to "plot large data in figures".

(Pause the mouse over a keyword to see the number of occurences or click the link to see the context where the word is found)

al  algorithm  analysis  approach  area  barchart  bars  between  bins  cases  categorical  categories  cells  chapter  class  companies  coordinate  data  datasets  density  diagram  different  displays  distribution  et  even  example  few  fig  figure  first  give  graph  graphics  group  highlighted  histogram  important  information  ing  interactive  large  layout  left  level  lines  may  means  methods  million  mosaic  new  nodes  number  parallel  pixel  plot  points  possible  problem  range  responses  results  right  sales  sample  sampling  scale  scatterplot  section  see  selection  set  several  should  shown  shows  size  small  software  space  split  state  statistical  structure  three  time  tion  tree  two  use  used  users  values  variables  view  visualization  weight  years  zooming 

Posted by Martin at 11:15:51 | Permanent Link | Comments (0) |

October 17, 2006

Communities of Interest ...

... is the name for subgraphs of a large network (e.g. telephone calls) which have certain target properties.

Chris Volinsky has a very nice page, which allows to look for subgraphs that connect authors by papers in computer science journals (based on DBLP) or actors connected by movies (based on IMDB).

Here is the proximity graph that connects me (not being a computer scientist) with Donald E. Knuth:

Visit Chris' page here.

Posted by Martin at 21:48:13 | Permanent Link | Comments (0) |

statistical graphics 101: Histograms

It's been too long since the last posting (on barcharts) in the teaching corner. This one will be on histograms.

Histograms are often mistaken with barcharts. The fundamental distinction between the two is

  • Barcharts show counts (or weights) for the discrete axis of a categorical variable
  • Histograms show an approximation of the density function (if scaled accordingly) of a countinuous variable.
As a consequence, the only thing that can be quantified in a barchart is the bar height (better length, which makes it independent from their orientation). On the other hand, in a histogram, the area of the boxes is proportional to the density approximation. If all bars have the same width in a barcharts, or gaps are drawn in a histogram (which is complete nonsense), the two plots can get mixed up.

(% votes for Kerry in the 2004 presidential election)

Much has been written about optimal bin width for histograms - almost nothing about the choice of the anchor point. Changing the latter often changes the shape of the histogram more dramatic than choosing between 8, 9 or 10 bins.
Setting the anchor point from 0 to -2.4 yields:


(changing the anchor point to -2.4)

In most applications, there are sensible breaks which can be chosen. Since we look at 3,111 voting districts, we can use far more bins and start at 0 with bin width of 5.


(using meaningful parameters (0, 5) - density estimate added)

If we link a second attribute into the histogram, the whole thing gets more exiting!


(all districts where more than 15% have a college degree selected)

We don't really can see what is going on here (although we might guess, that there is a slightly higher proportion of highlighting towards the higher votes for Kerry).

When we use the same normalization trick as for Spineplots (see previous post), we get the clearer picture of the Spinogram:


(highlighting in a Spinogram)

Well, that's what we would have expected, probably except for the increase at the left end of the scale.

The problem with the histograms used so far is that we looked at voting districts, and not at voters! This will distort the impression if the districts are not of equal size. Weighting above histogram will move the mode further to the right.


(the weighted histogram represents voters not districts)

Finally we get the weighted spinogram, which probably supports more the hypothesis ... of the selected group.


(the weighted spinogram)

(Sorry for the lengthy post ... but concepts get a bit more complex)

Posted by Martin at 19:59:55 | Permanent Link | Comments (3) |

October 14, 2006

The Good & the Bad [10/2006]

Antony pointed me to this nice example found on BBC News.

So what is the message here? "Chinese and other foreigners (not being British nationals) more and more fill Brirish jails ..."

Well, as we look at the percentage changes, we do not have any clue about the underlying group sizes. As whites are by far the larger group in this example, the absolute increase for whites is certainly much bigger. Any better display at hand?

So called "Skyline Plots" - as implemented in RENOIR - take the absolute size of groups into account by adjusting the bin width, such that the plot covers both aspects: absolute and relative change.


(This is certainly a different example as the one from BBC News, but without the absolute figures it is impossible to recreate the skyline plot for the prison example.)

Looking at the colors, we find the odd choice of coding an increase of prisoners in green and the decrease in red. (Does not make much sense, unless the graph comes from the company which runs the prinson ...)

Posted by Martin at 11:16:56 | Permanent Link | Comments (0) |