R-tutorial: Difference between revisions
Jump to navigation
Jump to search
imported>Weigang (Created page with "# Install R & RStudio on your home computer ## R from this website: https://mirrors.nics.utk.edu/cran/ ## R studio from this website: https://www.rstudio.com/ # Create a new p...") |
imported>Weigang mNo edit summary |
||
Line 1: | Line 1: | ||
* Install R & RStudio on your home computer | |||
** R from this website: https://mirrors.nics.utk.edu/cran/ | |||
** R studio from this website: https://www.rstudio.com/ | |||
* Create a new project by navigating: File | New Project | New Directory. Name it project file "genes" | |||
* Import abalone data set: Tools | Import DataSet | From Web URL, copy & paste this address: http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/hg.tsv2 | |||
* Select "Yes" for column heading. Rename the data set if you wish (short but informative names, e.g., human.genes). Do not use spaces, use dot or underscore as name delimiters (e.g., "human.genes" or "human_genes", but never "human genes") Same rule for column or row names | |||
<syntaxhighlight lang= | <syntaxhighlight lang="bash"> | ||
hg.len <- hg$Gene.End - hg$Gene.Start + 1 # calculate gene length; access variables by pressing "Tab" (auto-completion) | hg.len <- hg$Gene.End - hg$Gene.Start + 1 # calculate gene length; access variables by pressing "Tab" (auto-completion) | ||
hist(hg.len, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long | hist(hg.len, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long | ||
Line 11: | Line 11: | ||
median(hg.len) # More representative. Use median for a variable not normally distributed | median(hg.len) # More representative. Use median for a variable not normally distributed | ||
summary(hg.len) # Show all quartiles | summary(hg.len) # Show all quartiles | ||
log.len <- log10(hg.len); hist(log.len, br=200) # Log of gene length is more normally distributed | log.len <- log10(hg.len); hist(log.len, br=200) # Log of gene length is more normally distributed | ||
mean(log.len); median(log.len) # They should be similar, since log.len is normal | mean(log.len); median(log.len) # They should be similar, since log.len is normal | ||
Line 19: | Line 18: | ||
boxplot(Gene.End - Gene.Start + 1 ~ Chromosome, data = hg) # show gene length by chromosomes | boxplot(Gene.End - Gene.Start + 1 ~ Chromosome, data = hg) # show gene length by chromosomes | ||
</syntaxhighlight> | </syntaxhighlight> | ||
* Export a PDF or image | |||
* Open a new R script, name it as "hg.R" | |||
* Select commands and save to script | |||
* Retrieve and edit a command by pressing "up" or "down" arrows | |||
* Retrieve commands by using the search box on the "History" table | |||
* Type q() to quit |
Revision as of 17:14, 1 May 2016
- Install R & RStudio on your home computer
- R from this website: https://mirrors.nics.utk.edu/cran/
- R studio from this website: https://www.rstudio.com/
- Create a new project by navigating: File | New Project | New Directory. Name it project file "genes"
- Import abalone data set: Tools | Import DataSet | From Web URL, copy & paste this address: http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/hg.tsv2
- Select "Yes" for column heading. Rename the data set if you wish (short but informative names, e.g., human.genes). Do not use spaces, use dot or underscore as name delimiters (e.g., "human.genes" or "human_genes", but never "human genes") Same rule for column or row names
hg.len <- hg$Gene.End - hg$Gene.Start + 1 # calculate gene length; access variables by pressing "Tab" (auto-completion)
hist(hg.len, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
mean(hg.len) # not representative, super-long genes carry too much weight to the average length
median(hg.len) # More representative. Use median for a variable not normally distributed
summary(hg.len) # Show all quartiles
log.len <- log10(hg.len); hist(log.len, br=200) # Log of gene length is more normally distributed
mean(log.len); median(log.len) # They should be similar, since log.len is normal
boxplot(hg$Gene.End-hg$Gene.Start+1 ~ hg$Chromosome)
write.csv(hg, "hg.csv", row.names = FALSE) # save into a file
hg <- read.csv("hg.csv") # read back into R
boxplot(Gene.End - Gene.Start + 1 ~ Chromosome, data = hg) # show gene length by chromosomes
- Export a PDF or image
- Open a new R script, name it as "hg.R"
- Select commands and save to script
- Retrieve and edit a command by pressing "up" or "down" arrows
- Retrieve commands by using the search box on the "History" table
- Type q() to quit