As an illustrative example, we solve the classification problem of transportation mode basing on age and income by LDA in R. This can be easily done by the “lda” function within the “mass” library. For all the analysis presented here, we will need to install and load the following R pakages:
The data studied comes in a csv file (called “trasnpor_example”), which can be easily imported to R by runing this piece of code:
In ordser to have a first impresion of the data, we can plot the sample in the form of a scatter plot as:
The code lines above produces the scatterplot shown in the introductory section of thid document. Alternatively, we could plot the data as a series of histograms as:
By running any of these two lines, we can have in a glimpse an idea on how transportation mode distribute across values og age and income. For example:
Obtaining:
Or:
LDA is conducted by simply running:
The typical output shows the initial means by group, the coefficients in the LD projections and the proportion of the between variance (trace) that each LD coordinate explains:
In our example, the first LD coordinate is positively correlated with income and negatively with age, and contains almost 90% of the inter-class variability. The second LD function shows positive but weaker correlation with both variables, and only accounts for approximately 10% of the between variability.
The new coordinates are produced projecting the original data points with the LDA coefficients by the expression . In these new coordinates, observations are more clearly separated across groups. In our example, we have two LD coordinates for each individual, given their age and income. The coordinates corresponding to the first LD function have the larger discriminant power. We can easily see this discriminant power by plotting in R an histogram, now putting the first LD coordinates in the horizontal axis:
Obtaining:
This plot shows how the amount of overlapping diminishes considerably. In other words, the first LD coordinate (remember that it is a “composite” that correlates negatively with age and positively with income) adequately discriminates among the transportation categories.