Tutorial for boxplot

Box-plot and violin plot

These words were typed according to wikipedia to let me know the concepts and descriptions. One can totally skip this part.

In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quantiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plooted as individual points.

Box plots display differences between populations without making any assumptions of the underlying statistical distribution: they are non-parametric. The spacings between the different parts of the box help indicate the degree of dispersion (spread) and skewness in the data, and identify outliers. In addition to the points themselves, they allow one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Boxplots can be drawn either horizontally or vertically. Wiki

Violin plots are a method of plotting numeric data. A violin plot is a combination of a box plot and a kernel density plot. Specifically, it starts woth a box plot. It then adds a rotared kernel density plot to each side of the boxplot.

The violoin plots is similar to box plots, except that they show the prbability density of the data ad different values (in the simplest case this could be a hitogram). Typically violin plots will include a marker for the median of the data and a box indicaiting the interquantile range, as in standard box plots. Oberlaid on this box plot is a kernel density estimation. Wiki

boxplot-norm-distribution wiki violin-plot wiki

How-to do the plot

Here I will introduce a script boxplot.sh as one-line command to plot various box-plots on given data.

Input file format

Two types of input files are supported. The first type is the table file with the first column as ID and other columns as data values, just as what you think in the mind. It is suitable when all boxes share same IDs. If each box contains different set of IDs, you may use the sceond format.

Matrix data table like diamonds as described in Basic things you should know to use s-plot.

The first column is the ID variable, normally the values in this column should be unqiue. The other columns are data columns and you can have any number of columns if you want.

  #filename diamond.extract.matrix 
  ID  carat       cut color price
  1  0.23     Ideal     E 0.326
  2  0.21   Premium     E 0.326
  3  0.23      Good     E 0.327
  4  0.29   Premium     I 0.334
  5  0.31      Good     J 0.335
  6  0.24 Very Good     J 0.336
  

Molten format as described in Basic things you should know to use s-plot.

       cut color variable value
     Ideal     E    carat  0.23
   Premium     E    carat  0.21
      Good     E    carat  0.23
   Premium     I    carat  0.29
      Good     J    carat  0.31
 Very Good     J    carat  0.24
   Premium     D    price 2.757
     Ideal     D    price 2.757
      Good     D    price 2.757
 Very Good     D    price 2.757
   Premium     H    price 2.757
     Ideal     D    price 2.757
  

A practical example, if one want to compare the expression level of all genes in multiple samples, the table format is OK. Otherwise, if one only want to compare the expression distribution of top 100 genes in each sample, the molten format should be used since each sample may have different top 100 genes.

  #table file (faked value)
  Gene	ES_rpkm	iPS_rpkm Type
  Pou5f1  1.23     1.22  TF 
  Tet1    0.21     0.20  Enzyme
  Tet2    0.23     0.24  Enzyme
  Tet3    0.01     0.01  Enzyme
  Nanog   1.31     1.30  TF
  #molten format
  gene	rpkm	sample	type
  Pou5f1	1.23	ES  TF
  Pou5f1	1.22	iPS  TF
  Tet1	0.21	ES  Enzyme
  Tet1	0.20	iPS Enzyme
  Tet2	0.23	ES  Enzyme
  Tet2	0.24	iPS Enzyme
  Tet3	0.01	ES  Enzyme
  Tet3	0.01	iPS Enzyme
  Nanog	1.23	ES  TF
  Nanog	1.22	iPS  TF
  

Begin plotting

Simply, running boxplot.sh -f diamond.extract.matrix or boxplot.sh -f diamond.extract.matrix.melt -m TRUE will get the following boxplot.

diamond.extract.matrix.boxplot-simple1

Also want to get the violin plot, boxplot.sh -f diamond.extract.matrix -V TRUE. If you do not want to plot the inner-boxplot, please give FALSE to -W.

diamond.extract.matrix.boxplot.violin1

Plot the distribution of carat and price in each given category.

In cut category, boxplot.sh -f diamond.extract.matrix -a cut -I "'color'";

In color category boxplot.sh -f diamond.extract.matrix -a color -I "'cut'".

Remember to exclude other columns if there is any by giving their names to -I in format "'col1','col2'" or "'col'". Pay attention to the double quotation marks.

If the input file is in molten format, boxplot.sh -f diamond.extract.matrix.melt -m TRUE -d value -F variable -a color -I "'cut'" -x color or boxplot.sh -f diamond.extract.matrix.melt -m TRUE -a color -I "'cut'" -x color would be suitable since the default value for -d is value, -F is variable. For the practical example, rpkm should be given to -d and gene given to -F.

diamond.extract.matrix.boxplot.price_carat_color_set

Manually set color for boxes in each category and exclude outliers to display clearly.

boxplot.sh -f diamond.extract.matrix -a cut -I "'color'" -x 'cut' -c TRUE -C "'red','blue'" -o TRUE.

For manual color setting, give TRUE to -c and color list to -C.

When -o is TRUE, outliers will be excluded. Default, points out [minimum/1.05, maximum*1.05] will be treated as outliers. One can give other numbers to -O when -o is TRUE to set other ranges of outliers.

Excluding outliers will be usefull for displaying data with very large ranges. Another way to do this is scale the Y-axis using log10 or log2.

Such as boxplot.sh -f diamond.extract.matrix -a cut -I "'color'" -x 'cut' -c TRUE -C "'red','blue'" -s TRUE -v "scale_y_log10()".

Giving TRUE to -s means executing scale y-axis. Strings given to -v indicates the way to transform y-axis, including scale_y_log10(),coord_trans(y="log10"), scale_y_continuous(trans=log2_trans()), coord_trans(y="log2"). You may also want to add a value like 1 given to -S to avoid to get log value for 0.

diamond.extract.matrix.boxplot.manual_color_noOutlier

Set the order of boxes in one category or set the order of categories (default alphabetical order).

boxplot.sh -f diamond.extract.matrix -a cut -I "'color'" -l "'price','carat'" -L "'Ideal','Premium','Very Good','Good','Fair'" -x 'cut'.

For legend variable, the order should be given to -l. For category variable, the order should be given to -L. You may also want to rotate the x-tics to display vertically sometimes, please give -90 to -b.

diamond.extract.matrix.boxplot.legend_xvariable_order

Plot only the distribution of one column price in each category.

In cut category, boxplot.sh -f diamond.extract.matrix -r 70 -a cut -I "'color','carat'";

In color category boxplot.sh -f diamond.extract.matrix -r 70 -a color -I "'cut','carat'".

Remember to exclude other columns if there is by giving their names to -I in format "'col1','col2'" or "'col'". If you want to do this to molten files, please remove unneeded numerical columns before metling process using shell commands like grep -v and give new file to boxplot.sh.

diamond.extract.matrix.boxplot.price_color

Plot the distribution of price in different carat categories.

boxplot.sh -f diamond.extract.matrix -a carat -I "'cut','color'" -B 4 -x 'carat' -y 'price' or boxplot.sh -f diamond.extract.matrix -a carat -I "'cut','color'" -B "c(0.1,0.4,0.7,1,6)" -x 'carat' -y 'price'

The first command gives 4 to -B indicates splitting carat into 4 categories.

The second command gives numerical vector c(0.1,0.4,0.7,1,6) to -B to splie carat into given ranges.

diamond.extract.matrix.boxplot.price_carat_num diamond.extract.matrix.boxplot.price_carat_interval

Please see Basic things you should know to use s-plot to modify other formats like the position of legend, width, height, resolution and output type of pictures, install required module or generating R scripts only.

Ref

  • ggplot2
  • http://www.statmethods.net/graphs/boxplot.html
  • http://stackoverflow.com/questions/2492947/boxplot-in-r-showing-the-mean
  • http://stackoverflow.com/questions/10441214/r-boxplot-with-multiple-factor-labels
  • http://stackoverflow.com/questions/10805643/ggplot2-add-color-to-boxplot-continuous-value-supplied-to-discrete-scale-er
CHENTONG
版权声明:本文为博主原创文章,转载请注明出处。
alipay.png

CHENTONG

CHENTONG
积微,月不胜日,时不胜月,岁不胜时。凡人好敖慢小事,大事至,然后兴之务之。如是,则常不胜夫敦比于小事者矣!何也?小事之至也数,其悬日也博,其为积也大。大事之至也希,其悬日也浅,其为积也小。故善日者王,善时者霸,补漏者危,大荒者亡!故,王者敬日,霸者敬时,仅存之国危而后戚之。亡国至亡而后知亡,至死而后知死,亡国之祸败,不可胜悔也。霸者之善著也,可以时托也。王者之功名,不可胜日志也。财物货宝以大为重,政教功名者反是,能积微者速成。诗曰:德如毛,民鲜能克举之。此之谓也。

R 学习

R语言是比较常用的统计分析和绘图语言,拥有强大的统计库、绘图库和生信分析的Bioconductor库,是学习生物信息分析的必备语言之一。Rstudio是编辑、运行R语言的最为理想的工具之一,支持纯R脚本、Rmarkdown (脚本文档混排)、Bookdown (脚本文档混排...… Continue reading

本地使用Rfam 12.0+

Published on June 16, 2017

Linux学习(一)- 文件和目录

Published on June 08, 2017