Initial Packages to Install

Author
Affiliation

Prof. J Babiera and Prof. R Cuenca

Center for Computational Analytics and Modelling (CCAM), PRISM, MSU-IIT

For our training sessions, we will utilize several R packages to streamline our data analysis workflow. The following code installs these packages:

install.packages(c(
    "tidyverse", 
    "lubridate", 
    "broom", 
    "readxl",
    "writexl",
    "janitor",
    "gtsummary",
    "flextable"
))

Package Overview and Usage

Data Manipulation and Analysis:

  • tidyverse: This is a collection of R packages designed for data science. It includes tools for data manipulation (dplyr), visualization (ggplot2), and more. We’ll use tidyverse to ensure a consistent and efficient approach to data analysis.

  • lubridate: Working with dates and times in R can be challenging. lubridate simplifies this process by providing functions to parse, manipulate, and perform arithmetic on date-time objects. We’ll use it to handle date-time data effectively.

  • broom: After performing statistical analyses, results are often returned in complex lists. broom converts these objects into tidy tibbles (data frames), making it easier to report and visualize results. We’ll use broom to tidy up model outputs for interpretation and presentation.

Data Import and Export:

  • readxl: Importing data from Excel files is a common task. readxl facilitates reading both .xls and .xlsx files directly into R without requiring external dependencies. We’ll use it to import Excel data into our R environment.

  • writexl: Exporting data frames to Excel is made simple with writexl. It allows writing data to .xlsx files without needing Java or Excel installations. We’ll use it to save our processed data frames as Excel files.

Data Cleaning:

  • janitor: Data cleaning is a crucial step in analysis. janitor provides functions for examining and cleaning dirty data, such as perfectly formatting data frame column names and identifying duplicate records. We’ll use it to streamline our data cleaning process.

Table Creation and Formatting:

  • gtsummary: Creating publication-ready analytical and summary tables can be time-consuming. gtsummary offers an elegant and flexible way to generate such tables, summarizing data sets, regression models, and more with sensible defaults and customizable capabilities. We’ll use it to present our statistical results clearly and professionally.

  • flextable: When producing reports, formatting tables for different outputs (HTML, PDF, Word, PowerPoint) is essential. flextable provides a framework to create and customize tables, allowing for the addition of headers, footers, and formatting of cell content with text or images. We’ll use it to ensure our tables are well-formatted across various document types.

  • broom: In addition to tidying model outputs, broom can be integrated with table formatting packages to produce clean and structured summaries of statistical analyses. This facilitates the creation of tables that are both informative and publication-ready.

By integrating these packages into our workflow, we aim to enhance the efficiency and clarity of our data analysis tasks.

The following sections are some sort of quick run-through on the common usage of the tidyverse packages along with the base and stats packages.


Brief Details on the tidyverse package

The tidyverse is a cohesive collection of R packages designed to facilitate data science tasks by sharing a common design philosophy, grammar, and data structures. Loading the tidyverse provides access to its core packages, each tailored for specific aspects of data manipulation and visualization.

Core Tidyverse Packages

When installing tidyverse, the following packages are bundled together as part of the installation.

  1. ggplot2: Implements the Grammar of Graphics, enabling the creation of complex and multi-layered visualizations. It’s widely used for generating a variety of plots and charts.

  2. dplyr: Provides a consistent set of functions for data manipulation tasks such as filtering, selecting, arranging, and summarizing data. It’s essential for efficient data wrangling.

  3. tidyr: Offers functions to reshape and tidy data, ensuring that datasets are structured optimally for analysis. It helps in converting data into a tidy format where each variable is a column, and each observation is a row.

  4. readr: Facilitates the reading of rectangular data formats like CSV and TSV into R. It’s designed for fast and friendly data import.

  5. purrr: Enhances R’s functional programming capabilities by providing tools for working with functions and vectors. It simplifies the process of applying functions to data structures.

  6. tibble: Introduces a modern take on data frames, offering a more user-friendly and consistent experience when working with tabular data.

  7. stringr: Simplifies string manipulation by providing a cohesive set of functions designed to make working with strings as straightforward as possible.

  8. forcats: Offers tools for handling categorical variables (factors) in R, making it easier to work with and analyze categorical data.

By integrating these packages into your R environment, you can streamline your data science workflows, ensuring consistency and efficiency across various tasks.


Brief Details on the stats package

  • Probability Distributions
    • Functions for densities, probabilities, quantiles, and random generation (d*, p*, q*, r*).
    • Examples: dnorm, pnorm, qnorm, rnorm for normal; dbinom, pbinom, qbinom, rbinom for binomial.
    • Includes distribution-family objects (Beta, Binomial, Poisson, Gamma, etc.).
  • Statistical Tests
    • Tests of location, variance, proportions, and rank-based tests.
    • Examples: t.test, wilcox.test, bartlett.test, chisq.test, fisher.test, prop.test.
    • Multiple-comparison adjustments (p.adjust).
  • Linear and Generalized Linear Models
    • Main fitting functions: lm (linear models), glm (generalized linear models).
    • Family objects: binomial, poisson, Gamma, quasi.
    • Methods: anova, summary, predict, update, and step for model selection.
  • Model Diagnostics and Influence
    • Evaluate model fit and investigate outliers or influential points.
    • Examples: cooks.distance, dfbeta, dfbetas, dffits, hatvalues, influence.
    • Tools for checking deviance, log-likelihood (deviance, logLik).
  • ANOVA and Model Summaries
    • Functions to compare nested models and produce summaries.
    • Examples: aov, anova, summary.aov, summary.glm, summary.lm.
    • Tools for investigating factor-level differences (TukeyHSD).
  • Multivariate Analyses
    • Techniques for factor analysis, principal components, canonical correlations, and clustering.
    • Examples: factanal, prcomp, princomp, cancor, hclust, kmeans.
    • Includes supporting functions like biplot, screeplot, rotation methods (varimax, promax).
  • Time Series
    • Methods for modeling, forecasting, and decomposing time-series data.
    • Examples: ts, acf, pacf, ar, arima, HoltWinters, filter, stl.
    • Diagnostic and visualization tools: ts.plot, tsdiag, cpgram, monthplot.
  • Smoothing, Interpolation, and Nonlinear Fits
    • loess, lowess, spline, splinefun, approx, density, smooth.spline.
    • Nonlinear least squares fitting with nls and self-starting models (SSasymp, SSlogis, etc.).
  • Formulas, Terms, and Model Frames
    • Tools for building and manipulating model formulas.
    • Examples: formula, terms, update, reformulate, model.frame, model.matrix.
    • Functions for adding or dropping terms (add1, drop1) and managing contrasts.
  • Data Summaries and Utilities
    • Summaries by group, reshaping, and weighting.
    • Examples: aggregate, ave, weighted.mean, reshape.
    • Diagnostics for missing data: na.omit, na.exclude, na.pass, complete.cases.

These clusters capture the main themes of the stats package. Each group helps streamline data analysis workflows.


Brief Details on the base package

  • Basic Data Types and Coercion
    • Creating and coercing vectors: atomic, logical, numeric, integer, complex, raw, as.*, is.*.
    • Converting between types: as.character, as.numeric, as.logical, as.complex, as.factor.
    • Checking object attributes: mode, typeof, storage.mode, attributes, attr.
  • Data Structures
    • Vectors: c, rep, unique, duplicated, rev, match, %in%.
    • Lists: list, lapply, sapply, vapply, unlist, as.list.
    • Matrices and Arrays: matrix, array, dim, dimnames, aperm, as.matrix, t, rbind, cbind, colSums, rowSums.
    • Data Frames: data.frame, as.data.frame, merge, [.data.frame, $<-.data.frame.
    • Factors: factor, as.factor, ordered, levels, droplevels, is.factor.
    • Tables: table, xtabs, prop.table, margin.table.
  • Subsetting and Replacement
    • Operators: [, [[, $, and their replacements ([<-, [[<-, $<-).
    • Removing duplicates or dropping dimensions: unique, drop.
    • Splitting and recombining: split, unsplit, cut, tapply.
  • String Handling
    • Simple manipulations: paste, paste0, nchar, substr, strsplit, tolower, toupper, trimws.
    • Case folding and translation: chartr, casefold.
    • Matching and substitution: grep, grepl, regexpr, gregexpr, regmatches, gsub, sub, pmatch, agrep.
    • Encoding-related: iconv, Encoding, enc2utf8, enc2native.
  • File and Directory Operations
    • File paths: file.path, path.expand, basename, dirname, normalizePath.
    • File manipulation: file.create, file.copy, file.remove, file.rename, file.append.
    • Directories: dir, dir.create, unlink.
    • File info: file.info, file.exists, file.size, Sys.readlink.
  • Input/Output and Connections
    • Reading/writing text: readLines, writeLines, cat, print, message, warning.
    • Binary data: readBin, writeBin, readChar, writeChar.
    • Connections (files, URLs, pipes): file, gzfile, bzfile, xzfile, url, pipe, socketConnection.
    • Special connections: textConnection, rawConnection.
    • Managing connections: open, close, flush, seek, isOpen, truncate.
  • Math and Special Functions
    • Arithmetic: +, -, *, /, ^, %/%, %%, %*% (matrix multiply).
    • Rounding: round, ceiling, floor, trunc, signif.
    • Trigonometric and hyperbolic: sin, cos, tan, asin, acos, atan, sinh, cosh, tanh.
    • Exponential and logs: exp, log, expm1, log1p.
    • Cumulative/math: cumsum, cumprod, cummax, cummin.
    • Special: gamma, lgamma, beta, choose, factorial, digamma, psigamma.
  • Dates and Times
    • Date classes and conversion: Date, POSIXct, POSIXlt, as.Date, as.POSIXct, as.POSIXlt.
    • Time intervals: difftime, as.difftime.
    • Extracting components: weekdays, months, quarters, julian.
    • Rounding/truncation: round.POSIXt, trunc.POSIXt.
  • Control Flow
    • Conditionals: if, ifelse.
    • Loops: for, while, repeat.
    • Exiting loops: break, next.
    • Exiting functions: return.
  • Environment and Scope
    • Environment access: environment, globalenv, parent.env, new.env.
    • Assigning objects: assign, <-, <<-, env$var, rm, exists.
    • Searching and attaching: search, attach, detach.
    • Namespace handling: loadNamespace, unloadNamespace, isNamespaceLoaded.
  • Evaluations and Expressions
    • Expression parsing: parse, eval, evalq, expression, quote, bquote, substitute.
    • Calls and arguments: call, match.call, args, formals, do.call.
    • Delayed evaluation: delayedAssign, promise.
  • Condition Handling and Recovery
    • Generating conditions: stop, warning, message, signalCondition.
    • Handling conditions: try, tryCatch, withCallingHandlers, conditionMessage, invokeRestart.
    • Checking equality: all.equal, identical.
  • Miscellaneous
    • System information and environment variables: Sys.getenv, Sys.setenv, getwd, setwd.
    • Memory and garbage collection: gc, memory.profile, object.size.
    • Object summaries: summary, str, print.
    • Searching for objects: ls, objects, apropos.
    • Higher-order functions: Filter, Find, Map, Position, Reduce, Negate.

These clusters highlight the main functionalities in the base package.

What’s Next?