Tutorial for boxplot
Box-plot and violin plot
These words were typed according to
wikipedia to let me know the concepts and descriptions. One can totally skip this part.
In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quantiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plooted as individual points.
Box plots display differences between populations without making any assumptions of the underlying statistical distribution: they are non-parametric. The spacings between the different parts of the box help indicate the degree of dispersion (spread) and skewness in the data, and identify outliers. In addition to the points themselves, they allow one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Boxplots can be drawn either horizontally or vertically. Wiki
Violin plots are a method of plotting numeric data. A violin plot is a combination of a box plot and a kernel density plot. Specifically, it starts woth a box plot. It then adds a rotared kernel density plot to each side of the boxplot.
The violoin plots is similar to box plots, except that they show the prbability density of the data ad different values (in the simplest case this could be a hitogram). Typically violin plots will include a marker for the median of the data and a box indicaiting the interquantile range, as in standard box plots. Oberlaid on this box plot is a kernel density estimation. Wiki
How-to do the plot
Here I will introduce a script
boxplot.sh as one-line command to plot various box-plots on given data.
Input file format
Two types of input files are supported. The first type is the table file with the first column as ID and other columns as data values, just as what you think in the mind. It is suitable when all boxes share same IDs. If each box contains different set of IDs, you may use the sceond format.
Matrix data table like
diamonds as described in Basic things you should know to use s-plot.
The first column is the ID variable, normally the values in this column should be unqiue. The other columns are data columns and you can have any number of columns if you want.
Molten format as described in Basic things you should know to use s-plot.
A practical example, if one want to compare the expression level of all genes in multiple samples, the table format is OK. Otherwise, if one only want to compare the expression distribution of top 100 genes in each sample, the molten format should be used since each sample may have different top 100 genes.
boxplot.sh -f diamond.extract.matrix or
boxplot.sh -f diamond.extract.matrix.melt -m TRUE will get the following boxplot.
Also want to get the violin plot,
boxplot.sh -f diamond.extract.matrix -V TRUE. If you do not want to plot the inner-boxplot, please give
Plot the distribution of
price in each given category.
boxplot.sh -f diamond.extract.matrix -a cut -I "'color'";
boxplot.sh -f diamond.extract.matrix -a color -I "'cut'".
Remember to exclude other columns if there is any by giving their names to
-I in format
"'col'". Pay attention to the double quotation marks.
If the input file is in molten format,
boxplot.sh -f diamond.extract.matrix.melt -m TRUE -d value -F variable -a color -I "'cut'" -x color or
boxplot.sh -f diamond.extract.matrix.melt -m TRUE -a color -I "'cut'" -x color would be suitable since the default value for
variable. For the practical example,
rpkm should be given to
gene given to
Manually set color for boxes in each category and exclude outliers to display clearly.
boxplot.sh -f diamond.extract.matrix -a cut -I "'color'" -x 'cut' -c TRUE -C "'red','blue'" -o TRUE.
For manual color setting, give
-c and color list to
TRUE, outliers will be excluded. Default, points out [minimum/1.05, maximum*1.05] will be treated as outliers. One can give other numbers to
TRUE to set other ranges of outliers.
Excluding outliers will be usefull for displaying data with very large ranges. Another way to do this is scale the Y-axis using
boxplot.sh -f diamond.extract.matrix -a cut -I "'color'" -x 'cut' -c TRUE -C "'red','blue'" -s TRUE -v "scale_y_log10()".
-s means executing scale y-axis. Strings given to
-v indicates the way to transform y-axis, including
coord_trans(y="log2"). You may also want to add a value like
1 given to
-S to avoid to get log value for
Set the order of boxes in one category or set the order of categories (default alphabetical order).
boxplot.sh -f diamond.extract.matrix -a cut -I "'color'" -l "'price','carat'" -L "'Ideal','Premium','Very Good','Good','Fair'" -x 'cut'.
For legend variable, the order should be given to
-l. For category variable, the order should be given to
-L. You may also want to rotate the x-tics to display vertically sometimes, please give
Plot only the distribution of one column
price in each category.
boxplot.sh -f diamond.extract.matrix -r 70 -a cut -I "'color','carat'";
boxplot.sh -f diamond.extract.matrix -r 70 -a color -I "'cut','carat'".
Remember to exclude other columns if there is by giving their names to
-I in format
"'col'". If you want to do this to molten files, please remove unneeded numerical columns before metling process using shell commands like
grep -v and give new file to
Plot the distribution of
price in different
boxplot.sh -f diamond.extract.matrix -a carat -I "'cut','color'" -B 4 -x 'carat' -y 'price'
boxplot.sh -f diamond.extract.matrix -a carat -I "'cut','color'" -B "c(0.1,0.4,0.7,1,6)" -x 'carat' -y 'price'
The first command gives
-B indicates splitting
carat into 4 categories.
The second command gives numerical vector
-B to splie
carat into given ranges.
Please see Basic things you should know to use s-plot to modify other formats like the position of legend, width, height, resolution and output type of pictures, install required module or generating R scripts only.