A brief history
●Project R was born in the statistics department of the University of Auckland, New Zealand;
●The founders of the project are Robert Gentleman and Ross Ihaka, now associate professors;
●The project started in 1991, but the first release was in 1996;
●R software is now considered the most powerful statistical computing language in the world;
The Computing Environment
●Cross-platform (Windows, MacOS, Linux);
●Open-source (software, manuals, reference cards, all downloadable from the www.r-project.org website);
●It has numerous integrated tools for data analysis;
●Allows you to implement matrix calculus;
●Easily manipulated and useful for data storage;
●The term environment is intended to distinguish R as a fully planned and coherent system, rather than a collection of extremely specific and inflexible tools.
Statistical Analysis Techniques
Most of the statistical techniques, from the most classic to the most recent, have been implemented in the R environment.
Only some of these are integrated into the basic environment, many others are provided in the form of packages, through the family of websites called CRAN (Comprehensive R Archive Network).
⮚A community of over 2 million users and developers provides time and technical expertise to maintain, support and develop the R language and environment, tools and infrastructure.
⮚At the heart of the community, the R Core group, of about 20 members, takes care of the maintenance and guides the evolution of R.
⮚The official public structure is provided by the R foundation, a non-profit organization that ensures the financial stability of R-project and administers the copyright of the software and documentation.
●From the site https://www.r-project.org/
●Click Download R
●Choose the CRAN you want (the physical place from which to download the software)
●Choose the operating system on which to download the program (Windows, Linux, MacO)
●Click install R for the first time
●Start the download
⮚The most commonly used and most accessible interface is RStudio, downloadable from the https://www.rstudio.com/
⮚RStudio uses a user-friendly interface to facilitate its use;
⮚Click on Download (RStudio);
⮚Choose the free version;
⮚Start the download;
⮚Integrated Development Environment (IDE) for R
⮚The RStudio working environment consists of 4 windows:
Code window (write//execute scripts)
Multi Tab Window
⮚Packages: allows you to download packages that allow you to perform statistical analysis, such as Analysis in Main Components.
Example: click Install and install the ggplot2 package
⮚Help: allows you to have the description of the package.
Example: Type ggplot2
⮚Files: allows you to quickly access saved files after creating an R project
⮚With RStudio you can create a project in order to define the working directory, have all the data, packages and codes inside.
⮚To create a new project, go to the menu at the top left and select File -> New Project
⮚Getting Started: Loading Data
⮚R can read different types of data (TXT, CSV, XLS, XLSX, SPSS, STATA), but the simplest and most immediate way is the CSV format (Comma Separated Value).
⮚To upload a CSV file select Environment from the menu on the top right -> Import Dataset -> From Text File, Then select the directory and file.
⮚They allow you to keep track of the codes and analyzes carried out within the R project and save them on the PC for further consultations.
R Notebook Allows you to create a report of a project by entering all the steps, operations and graphs created.
R Notebook: The commands must be inserted inside special chunk (ALT + CTRL + I), the descriptions out
Create a file where to insert all the codes useful for the appropriate analysis
⮚Codes can be selected all together and processed simultaneously
A first exploration of the distribution of the variables contained in the countries dataset is obtained through the summary command, which must be inserted in the window called Console.
summary(name dataset / or name variable)
Other Descriptive Statistics
You can assign a name to each column of interest:
The main synthesis indices for quantitative variables are:
⮚Media: mean(PIL) or mean(nazioni$PIL.pro.capite) or mean(nazioni[,3])
⮚Varianza: var(PIL) or var(nazioni$PIL.pro.capite) or var(nazioni[,3])
⮚SQM (Standard deviation): sd(PIL) or (nazioni$PIL.pro.capite) or sd(nazioni[,3])
The box-plot describes a quantitative variable through the graphical representation of the minimum, maximum, quartiles and median.
⮚boxplot(nazioni$PIL.pro.capite, main = "Box-Plot del PIL pro capite")
⮚boxplot (nazioni[,4], main = "Box-Plot del PIL pro capite")
⮚boxplot(PIL, main = "Box-plot del PIL pro capite")
⮚Performing an exploratory analysis on the type of relationship between two variables
⮚Example from the dataset: analyze the relationship between average age and life expectancy. Is there a relationship
⮚1) Name variables of interest
The command to prepare the scatterplot is:
plot(asp, eta, xlab="Aspettativa di vita", ylab="Età media")
SCATTER DIAGRAM: What can you say?
From the scatterplot there appears to be a relationship between the variables Life expectancy and Average age.
Specifically, as the average age increases, life expectancy increases.
⮚Load datasets ANAG
⮚Name the column Gender -> sesso<-ANAG$Sesso
⮚For qualitative variables, the first description concerns the frequency distribution analysis.
Create the frequency distribution for the variable «sesso» -> table(sesso)
⮚A mode of graphical representation of the distribution of qualitative characters is the piechart, whose segments are proportional to the frequencies of each category.
⮚Pie chart without percentages:
pie(x, main = "Grafico a torta sul sesso")
PIE CHART WITHOUT PERCENTAGES
PIE CHART WITH PERCENTAGES
labels <- c("Femmina", "Maschio", "N/A") #ADD LABELS
n<-lenght(ANAG) #IMPUTATION OF SAMPLE NUMBERS
pct <- round(x/n*100) #CALCULATION OF PERCENTAGES
lbls <- paste(labels, pct) # ADD PERCENTAGES TO LABELS
lbls <- paste(lbls,"%",sep="") # ADDS THE SIMBOL % TO LABELS
pie(x,labels = lbls, col=rainbow(length(lbls)),main= "Grafico a torta del genere dei rispondenti")
⮚Useful for qualitative characters and to highlight the absolute frequencies of each variable.
barplot(x, main="Genere dei rispondenti", border="blue", ylab="Frequenze Assolute")
BAR CHART: Calculate relative frequencies
This course presents the concept of RStudio Software. We will learn the history the computing environment Analysis Techniques Community, how to install it, and we will explore RStudio Creating a Project Notebook.