Initial Packages to Install
For our training sessions, we will utilize several R packages to streamline our data analysis workflow. The following code installs these packages:
install.packages(c(
"tidyverse",
"lubridate",
"broom",
"readxl",
"writexl",
"janitor",
"gtsummary",
"flextable"
))Package Overview and Usage
Data Manipulation and Analysis:
tidyverse: This is a collection of R packages designed for data science. It includes tools for data manipulation (
dplyr), visualization (ggplot2), and more. We’ll usetidyverseto ensure a consistent and efficient approach to data analysis.lubridate: Working with dates and times in R can be challenging.
lubridatesimplifies this process by providing functions to parse, manipulate, and perform arithmetic on date-time objects. We’ll use it to handle date-time data effectively.broom: After performing statistical analyses, results are often returned in complex lists.
broomconverts these objects into tidy tibbles (data frames), making it easier to report and visualize results. We’ll usebroomto tidy up model outputs for interpretation and presentation.
Data Import and Export:
readxl: Importing data from Excel files is a common task.
readxlfacilitates reading both.xlsand.xlsxfiles directly into R without requiring external dependencies. We’ll use it to import Excel data into our R environment.writexl: Exporting data frames to Excel is made simple with
writexl. It allows writing data to.xlsxfiles without needing Java or Excel installations. We’ll use it to save our processed data frames as Excel files.
Data Cleaning:
- janitor: Data cleaning is a crucial step in analysis.
janitorprovides functions for examining and cleaning dirty data, such as perfectly formatting data frame column names and identifying duplicate records. We’ll use it to streamline our data cleaning process.
Table Creation and Formatting:
gtsummary: Creating publication-ready analytical and summary tables can be time-consuming.
gtsummaryoffers an elegant and flexible way to generate such tables, summarizing data sets, regression models, and more with sensible defaults and customizable capabilities. We’ll use it to present our statistical results clearly and professionally.flextable: When producing reports, formatting tables for different outputs (HTML, PDF, Word, PowerPoint) is essential.
flextableprovides a framework to create and customize tables, allowing for the addition of headers, footers, and formatting of cell content with text or images. We’ll use it to ensure our tables are well-formatted across various document types.broom: In addition to tidying model outputs,
broomcan be integrated with table formatting packages to produce clean and structured summaries of statistical analyses. This facilitates the creation of tables that are both informative and publication-ready.
By integrating these packages into our workflow, we aim to enhance the efficiency and clarity of our data analysis tasks.
The following sections are some sort of quick run-through on the common usage of the tidyverse packages along with the base and stats packages.
Brief Details on the tidyverse package
The tidyverse is a cohesive collection of R packages designed to facilitate data science tasks by sharing a common design philosophy, grammar, and data structures. Loading the tidyverse provides access to its core packages, each tailored for specific aspects of data manipulation and visualization.
Core Tidyverse Packages
When installing tidyverse, the following packages are bundled together as part of the installation.
ggplot2: Implements the Grammar of Graphics, enabling the creation of complex and multi-layered visualizations. It’s widely used for generating a variety of plots and charts.
dplyr: Provides a consistent set of functions for data manipulation tasks such as filtering, selecting, arranging, and summarizing data. It’s essential for efficient data wrangling.
tidyr: Offers functions to reshape and tidy data, ensuring that datasets are structured optimally for analysis. It helps in converting data into a tidy format where each variable is a column, and each observation is a row.
readr: Facilitates the reading of rectangular data formats like CSV and TSV into R. It’s designed for fast and friendly data import.
purrr: Enhances R’s functional programming capabilities by providing tools for working with functions and vectors. It simplifies the process of applying functions to data structures.
tibble: Introduces a modern take on data frames, offering a more user-friendly and consistent experience when working with tabular data.
stringr: Simplifies string manipulation by providing a cohesive set of functions designed to make working with strings as straightforward as possible.
forcats: Offers tools for handling categorical variables (factors) in R, making it easier to work with and analyze categorical data.
By integrating these packages into your R environment, you can streamline your data science workflows, ensuring consistency and efficiency across various tasks.
Brief Details on the stats package
- Probability Distributions
- Functions for densities, probabilities, quantiles, and random generation (d*, p*, q*, r*).
- Examples:
dnorm,pnorm,qnorm,rnormfor normal;dbinom,pbinom,qbinom,rbinomfor binomial.
- Includes distribution-family objects (
Beta,Binomial,Poisson,Gamma, etc.).
- Functions for densities, probabilities, quantiles, and random generation (d*, p*, q*, r*).
- Statistical Tests
- Tests of location, variance, proportions, and rank-based tests.
- Examples:
t.test,wilcox.test,bartlett.test,chisq.test,fisher.test,prop.test.
- Multiple-comparison adjustments (
p.adjust).
- Tests of location, variance, proportions, and rank-based tests.
- Linear and Generalized Linear Models
- Main fitting functions:
lm(linear models),glm(generalized linear models).
- Family objects:
binomial,poisson,Gamma,quasi.
- Methods:
anova,summary,predict,update, andstepfor model selection.
- Main fitting functions:
- Model Diagnostics and Influence
- Evaluate model fit and investigate outliers or influential points.
- Examples:
cooks.distance,dfbeta,dfbetas,dffits,hatvalues,influence.
- Tools for checking deviance, log-likelihood (
deviance,logLik).
- Evaluate model fit and investigate outliers or influential points.
- ANOVA and Model Summaries
- Functions to compare nested models and produce summaries.
- Examples:
aov,anova,summary.aov,summary.glm,summary.lm.
- Tools for investigating factor-level differences (
TukeyHSD).
- Functions to compare nested models and produce summaries.
- Multivariate Analyses
- Techniques for factor analysis, principal components, canonical correlations, and clustering.
- Examples:
factanal,prcomp,princomp,cancor,hclust,kmeans.
- Includes supporting functions like
biplot,screeplot, rotation methods (varimax,promax).
- Techniques for factor analysis, principal components, canonical correlations, and clustering.
- Time Series
- Methods for modeling, forecasting, and decomposing time-series data.
- Examples:
ts,acf,pacf,ar,arima,HoltWinters,filter,stl.
- Diagnostic and visualization tools:
ts.plot,tsdiag,cpgram,monthplot.
- Methods for modeling, forecasting, and decomposing time-series data.
- Smoothing, Interpolation, and Nonlinear Fits
loess,lowess,spline,splinefun,approx,density,smooth.spline.
- Nonlinear least squares fitting with
nlsand self-starting models (SSasymp,SSlogis, etc.).
- Formulas, Terms, and Model Frames
- Tools for building and manipulating model formulas.
- Examples:
formula,terms,update,reformulate,model.frame,model.matrix.
- Functions for adding or dropping terms (
add1,drop1) and managing contrasts.
- Tools for building and manipulating model formulas.
- Data Summaries and Utilities
- Summaries by group, reshaping, and weighting.
- Examples:
aggregate,ave,weighted.mean,reshape.
- Diagnostics for missing data:
na.omit,na.exclude,na.pass,complete.cases.
- Summaries by group, reshaping, and weighting.
These clusters capture the main themes of the stats package. Each group helps streamline data analysis workflows.
Brief Details on the base package
- Basic Data Types and Coercion
- Creating and coercing vectors:
atomic,logical,numeric,integer,complex,raw,as.*,is.*.
- Converting between types:
as.character,as.numeric,as.logical,as.complex,as.factor.
- Checking object attributes:
mode,typeof,storage.mode,attributes,attr.
- Creating and coercing vectors:
- Data Structures
- Vectors:
c,rep,unique,duplicated,rev,match,%in%.
- Lists:
list,lapply,sapply,vapply,unlist,as.list.
- Matrices and Arrays:
matrix,array,dim,dimnames,aperm,as.matrix,t,rbind,cbind,colSums,rowSums.
- Data Frames:
data.frame,as.data.frame,merge,[.data.frame,$<-.data.frame.
- Factors:
factor,as.factor,ordered,levels,droplevels,is.factor.
- Tables:
table,xtabs,prop.table,margin.table.
- Vectors:
- Subsetting and Replacement
- Operators:
[,[[,$, and their replacements ([<-,[[<-,$<-).
- Removing duplicates or dropping dimensions:
unique,drop.
- Splitting and recombining:
split,unsplit,cut,tapply.
- Operators:
- String Handling
- Simple manipulations:
paste,paste0,nchar,substr,strsplit,tolower,toupper,trimws.
- Case folding and translation:
chartr,casefold.
- Matching and substitution:
grep,grepl,regexpr,gregexpr,regmatches,gsub,sub,pmatch,agrep.
- Encoding-related:
iconv,Encoding,enc2utf8,enc2native.
- Simple manipulations:
- File and Directory Operations
- File paths:
file.path,path.expand,basename,dirname,normalizePath.
- File manipulation:
file.create,file.copy,file.remove,file.rename,file.append.
- Directories:
dir,dir.create,unlink.
- File info:
file.info,file.exists,file.size,Sys.readlink.
- File paths:
- Input/Output and Connections
- Reading/writing text:
readLines,writeLines,cat,print,message,warning.
- Binary data:
readBin,writeBin,readChar,writeChar.
- Connections (files, URLs, pipes):
file,gzfile,bzfile,xzfile,url,pipe,socketConnection.
- Special connections:
textConnection,rawConnection.
- Managing connections:
open,close,flush,seek,isOpen,truncate.
- Reading/writing text:
- Math and Special Functions
- Arithmetic:
+,-,*,/,^,%/%,%%,%*%(matrix multiply).
- Rounding:
round,ceiling,floor,trunc,signif.
- Trigonometric and hyperbolic:
sin,cos,tan,asin,acos,atan,sinh,cosh,tanh.
- Exponential and logs:
exp,log,expm1,log1p.
- Cumulative/math:
cumsum,cumprod,cummax,cummin.
- Special:
gamma,lgamma,beta,choose,factorial,digamma,psigamma.
- Arithmetic:
- Dates and Times
- Date classes and conversion:
Date,POSIXct,POSIXlt,as.Date,as.POSIXct,as.POSIXlt.
- Time intervals:
difftime,as.difftime.
- Extracting components:
weekdays,months,quarters,julian.
- Rounding/truncation:
round.POSIXt,trunc.POSIXt.
- Date classes and conversion:
- Control Flow
- Conditionals:
if,ifelse.
- Loops:
for,while,repeat.
- Exiting loops:
break,next.
- Exiting functions:
return.
- Conditionals:
- Environment and Scope
- Environment access:
environment,globalenv,parent.env,new.env.
- Assigning objects:
assign,<-,<<-,env$var,rm,exists.
- Searching and attaching:
search,attach,detach.
- Namespace handling:
loadNamespace,unloadNamespace,isNamespaceLoaded.
- Environment access:
- Evaluations and Expressions
- Expression parsing:
parse,eval,evalq,expression,quote,bquote,substitute.
- Calls and arguments:
call,match.call,args,formals,do.call.
- Delayed evaluation:
delayedAssign,promise.
- Expression parsing:
- Condition Handling and Recovery
- Generating conditions:
stop,warning,message,signalCondition.
- Handling conditions:
try,tryCatch,withCallingHandlers,conditionMessage,invokeRestart.
- Checking equality:
all.equal,identical.
- Generating conditions:
- Miscellaneous
- System information and environment variables:
Sys.getenv,Sys.setenv,getwd,setwd.
- Memory and garbage collection:
gc,memory.profile,object.size.
- Object summaries:
summary,str,print.
- Searching for objects:
ls,objects,apropos.
- Higher-order functions:
Filter,Find,Map,Position,Reduce,Negate.
- System information and environment variables:
These clusters highlight the main functionalities in the base package.
What’s Next?
Go back to Starting Page
Go back to Installation Guide
Go back to Installing, Using, and Removing R Packages.
Proceed to Installing Git Version Control