Principal Component Analysis

A user's guide for the categorization method used in the paper Øieroset, M., M. Yamauchi, L. Liszka, and B. Hultqvist, 'Energetic ion outflow from the dayside ionosphere: Categorization, classification, and statistical study', Journal of Geophysical Research., 104, 24,915, 1999

by Marit Øieroset, February 24, 1999

Principal component analysis is a useful tool for categorization of data, since it separates the dominating features in the data set. We here present step by step the procedure used for the categorization of the Viking ion data presented in the papers Øieroset et al. [JGR 1999a,b]. The procedure can easily be generalized to be used on similar data sets. The programs have been developed at the Swedish Institute of Space Physics in Sorfors, Umea. Free copies are available for both UNIX and PC (contact L. Liszka, e-mail ludwik@irf.se). The following programs are needed for the analysis.
  • corr; the principal component analysis.
  • mxmult; the inverse principal component analysis.
  • drop; for manipulating rows and columns in big files.
  • The programs are self-explanatory. Possible options are displayed by writing the program name. The original data matrix 'viking.dat' consisted of approximately 23,000 rows (satellite spins) and 61 columns, where 9 of the columns were so-called 'support variables' and 52 of the columns were differential flux values (for 4 energy levels and 13 pitch angles). With support variables we mean orbit number, date, time of day, MLT, ILAT, etc. Principal component analysis can not be done on a matrix containing support variables, so the first step is to remove these. This is done with 'drop' and we are left with a 52-columns matrix 'flux.dat' containing differential flux values only.
    drop viking.dat flux.dat -e''10-61'' -ki
    Option '-e' selects the columns (here: 10-61), and option '-ki' means that the selected columns will be kept (using option '-k' instead would have meant that the selected columns would be removed). To further prepare for the principal component analysis, it is necessary to add some noise to the matrix, again using 'drop'. Noise is added because principal component analysis can not be done on a matrix containing zeros, which would sometimes be the case for the Viking ion data. Adding noise to a data set will not lead to loss of information.
    drop flux.dat fluxn.dat -n -e''1-52''
    Option '-n' adds Gaussian noise to the selected columns 1-52. We are now ready to perform the principal component analysis on the matrix 'fluxn.dat'.
    corr -a -e -f -z fluxn.dat
    Option '-a' computes all possible principal components, '-e' the eigenvalues, '-f' the transformation matrix, and '-z' will make sure that all rows are ended correctly. The output from 'corr' is four files:
  • fluxn.eig; the vector of eigenvalues for the old matrix ('fluxn.dat').
  • fluxn.fsc; the transformation matrix between the old and new coordinate system (called component score coefficients).
  • fluxn.sco; the matrix of the projections of the old variables on the new coordinate axis (called component scores).
  • fluxn.imp; used for causal analysis (not used here).
  • The new coordinate system will be 'principal component space' (here: 52 dimensional). The number of principal components will be 52, but not all will be significant. The number of significant principal components can be found by checking the eigenvectors in 'fluxn.eig'. The first row in this file contains the eigenvalues for each principal component for our data set, while the second row contains the eigenvalues for a matrix containing pure noise. The eigenvalues for each principal component corresponds to the amount of total variance in the data described by this component. A principal component is significant if its eigenvalue is greater than the corresponding value for the case of pure noise. Hence we compare the first and second row of the file 'fluxn.eig', and in the study presented here we found that the first six components were significant. This indicates that we have six categories in the data. The next step is to identify these six categories, and this is done using the inverse principal component analysis, 'mxmult'. Using the inverse principal component analysis it is possible to see how the particle flux would vary for each component. This is done by setting to zero the contributions from all other components except the one we are interested in.
    mxmult -a -i -f fluxn.fsc -b1 fluxn.sco
    Option '-a' means all components, '-i' inverse transform, '-f' the transformation matrix, and -b1 selects the first principal component while all others are set to zero. The output from 'mxmult' is a file fluxn.tad, which should be renamed to e.g., 'c1.dat' to keep track on each component. We can now do the principal component analysis on 'c1.dat'.
    corr c1.dat
    The output files will be 'c1.mvd', 52 rows (4 energy levels and 13 pitch angles) containing the average differential flux values characterizing this component, and 'c1.imp', used for causal analysis (not used here). In the present study we plotted the average differential flux values in a pitch angle versus energy plot, as shown in Figure 3. We could then recognize the categories, as explained in the categorization section above. The procedure above using 'mxmult' was repeated for each principal component and the results were plotted in Figure 3. Not all data sets are well suited for categorization with the principal component analysis. We need a large number of rows, and the columns should all be values of the same physical quantity (columns should all have the same unit). The interpretation of the principal components in terms of categories needs to be done with great care. It is important to take into account the magnitude of the flux values characterizing each component. This was done above by comparing the maximum average flux values for each component, and from this we found that some components described the variations of others. The theoretical flux values from the inverse transform may not correspond to the real flux values. However, comparing the theoretical flux values from each component is shown to be a well suited method for classification into the different categories. Only a few years ago principal component analysis was time-consuming and not well suited for practical implementations. This has changed with the last ten years' rapid development in computer technology. For the data matrix (23,000 rows times 52 columns) used in the present study, only a couple of minutes were needed for the full analysis. Also for larger data sets the method should be fast and provide a useful tool for categorization and classification purposes.
    Back to Marit's frontpage
    Marit Oieroset
    Last modified: Mon Nov 8 14:36:51 PST 1999