Featured image for the Principal Component Analysis (PCA) 102 blog.

Principal Component Analysis (PCA) 102

This is the second entry in our principal components analysis (PCA) in Excel series. In this tutorial, we will resume our discussion on dimension reduction using a subset of the principal components with a minimal loss of information. We will use NumXL and Excel to carry out our analysis, closely examining the different output elements in an attempt to develop a solid understanding of PCA, which will pave the way to more advanced treatment in future issues.

In this tutorial, we will continue to use the socioeconomic data provided by Harman (1976). The five variables represent the total population (“Population”), median school years (“School”), total employment (“Employment”), miscellaneous professional services (“Services”), and median house value (“House Value”). Each observation represents one of twelve census tracts in the Los Angeles Standard Metropolitan Statistical Area.

Process

Now we are ready to conduct our principal component analysis in Excel. First, select an empty cell in your worksheet where you wish the output to be generated, then locate and click on the principal component (PCA)icon in the NumXL tab (or toolbar).

The Regression Wizard will appear.

Select the cells to range for the five input variable values.

Next, select the “Options” tab.

Initially, the tab is set to the following values:

  • “Standardize Inputs” is checked. Leave this option checked.
  • “Principal Component Output” is checked. Uncheck it.
  • The significance level (aka. α) is set to 5%.
  • “Input Variables” is unchecked. Check this option.
  • Set “No. of PCs included” to 3. This action can be done now or altered later in the output tables, as our formulas are dynamic.
  • Under “Input Variables”, check the “Values” option, so the generated output tables include a fitted value for the input variables using a reduced set of components.


Now, click the “Missing Values” tab.

In this tab, you can select an approach to handle missing values in the data set (X and Y). By default, any missing value found in any observation would exclude the observation from the analysis.

This treatment is a good approach for our analysis, so let’s leave it unchanged.

Now, click “OK” to generate the output tables.

Analysis

1. Statistics

In this table, we show the percentage of the variance of each input variable accounted for (a.k.a final communality) using the first three (3) factors. Unlike the cumulative proportion, this statistic is related to one input variable at a time.

Using this table, we can detect which input variables are poorly presented (i.e., adversely affected) by our dimension reduction. In this example, the “median school years” has the lowest value, yet the final communality is still around 92%.

2. Loadings

In the loading table, we outline the weights of the principal component in each input variable:

To compute the values of an input variable using PC values, we use the weights above to linearly transform them back. For example, the population factor is expressed as follows:

$$\hat X_1=0.23PC_1-0.66PC_2-0.64PC_3$$

3. Fitted Values

Using the first three (3) principal components, NumXL calculates the fitted value for each input variable:

Let’s plot the population (highest final communality) and median school years (lowest final communality) for the original data and for the fitted one.

Conclusion

In this tutorial, we examined the dimension reduction proposition from 5 PCs to 3 PCs without significant loss of information.

What do we do now?

In the first two tutorials, we focused on delivering the key ideas behind the principal component analysis and, to some extent, the rationale behind the dimension reduction proposition. The cross-section socio-economic sample data, although not a time series, served to demonstrate the theory and to show NumXL’s different output tables.

In the third entry of this series, we are ready to look into a set of correlated time series and apply the PCA technique to derive a reduced core set of uncorrelated drivers. Next, we forecast the values (mean and standard error) for the uncorrelated drivers and using the PCA Loadings, imply the corresponding forecast (mean and error) for each input variable.

Leave a Reply

Your email address will not be published. Required fields are marked *

We are glad you have chosen to leave a comment. Please keep in mind that comments are moderated according to our comment policy.