# Fast Sparse Regression and Classification

GPS with  R

(10/13/08)

Model building :  gpsbridge

Prediction:  gpspred

Get GPS solution :  gpssoln

Display coefficient paths :  gpspaths

Print model properties :  gpsmodel.info

Other  :  gpsversion

## Installation Instructions:

Open R

At the R command prompt enter:

> platform = "PLATFORM"
> gpshome = "GPSHOME"
> source("GPSHOME/GPS.r")

Here "PLATFORM" is either the text string "windows" or "linux" depending on the running operating system, and GPSHOME is a text string indicating the full path name (using forward slashes / ) of the directory where  GPS.r, GPStrain.exe and GPSpred.exe are stored. This will be the GPS home directory. (examples: gpshome = "/R_GPS"; gpshome = "/home/jhf/R_GPS")

Note: The above commands need not be entered on successive R sessions provided that on exit from R the "yes" option is selected at the "Save workspace image" prompt.

Overview:

GPS implements bridge regression with the generalized elastic net family of penalties for both squared-error and binary logistic regression loss, as described in  Fast Sparse Regression and Classification. Some familiarity with this paper is recommended.

The  R/GPS interface consists of the R procedures described below. The principal procedure is gpsbridge. Given the input data and various procedure parameters, it estimates the best penalty, among those specified, and it's corresponding path point.  The resulting linear model is invisibly returned  to R as a GPS model object (list) to be input to other GPS procedures for prediction and interpretation.

## gpsbridge: build a linear model via bridge regression using GPS

Usage:

gpsmod = gpsbridge (x, y, mode="regres", pens=c(0.0,0.1,0.2,0.5,1.0,1.5), nfold=-5, std="yes", sel="no", modsel=2, costs=c(1.0,1.0), maxin=ncol(x), obswts="equal", varpens="equal", xmiss=9.9e35, stepsize=0.01, maxstep=10000, qntl=0.025, impl="update", speed=1 , quiet=F, data.check=T)

Required arguments:

x = input predictor data matrix . Rows are observations (design points) and columns are variables. Must be a numeric matrix.

y = input response values. For regression (mode="regres") it is a numeric vector containing the response value at each design point. For classification (mode="class") it is a numeric matrix with two colums respectively containing the number of each of the two classes at the design points (rows).

Optional arguments:

mode = regression /classification flag: mode="regres" implies regression; the outcome or response variable is regarded as numerically valued and the  linear model is used to predict the value of the response. mode="class" implies binary classification. The model produces a numeric score representing the log-odds of  observing the class represented by the first column of y.

pens = numeric scalar/vector containing index/indicies of the generalized elastic net penalties to be considered. The procedure chooses the estimated best penalty among those specified in pens. All values must be 0.0 <= pens <= 2.0

nfold = number of cross-validation iterations (folds). Positive values imply ordinary multifold cross-validation. Negative values imply the use one-fold with validation test set size = nrow(x)/abs(nfold) (see comments below). Must be integer valued.

std = input variable standardization flag: The value std="yes" implies the use standardized variables to construct the path. The value std="no" implies the use of the original (unscaled) variables. In either case the solution output always references the original variables' locations and scales.

sel = selector flag. The value sel="yes" implies the use of the  regularized procedure as variable selection method, followed by an unregularized regression to estimate non zero coefficient values at each path point. The value sel="no" implies the use of the regularized (shrinking) procedure to directly estimate all coefficient values.

modsel  = model selection criterion: For regression (mode="regres"), modsel=1 implies using average absolute error as the criterion to chose the model (penalty and path point); modsel=2 implies using average squared error. For classification (mode="class"), modsel=1 implies using misclassification risk for model selection; modsel=2 implies using average squared error on the probability scale.

costs = misclassification costs: relative costs for misclassifying each respective class (column of y). Used only for classification (mode="class") and modsel=1. Ignored otherwise.

maxin  = maximum number of non zero coefficients (entered predictor variables) allowed.

obswts = training observation weights. The value obswts="equal" implies the use of equal weighting for all observations. Otherwise obswts is a numeric vector of length nrow(x) containing the user specified weights for the observations.

varpens = predictor variable coefficient relative penalties. The value varpens="equal" implies that all predictor variable coefficients are to be equally penalized. Otherwise varpens is a numeric vector of length ncol(x) containing the user specified relative penalty weights for each variable coefficient. In this context two values have special meaning: varpens(j) <= 0 specifies that the coefficient of the jth variable is not to be penalized; varpens(j) >= 9.9e30 specifies that the jth variable is not to be used (coefficient constrained to be zero).

xmiss = predictor variable missing value flag. Must be numeric and larger than any non missing predictor variable value in x.  Predictor variable values greater than or equal to xmiss are regarded as missing. Predictor variable data values of NA are internally set to the value of xmiss and thereby regarded as missing.

stepsize = step size scale factor: maximum fractional reduction of risk allowed at each step (see comments below).

maxstep = maximum number of steps. Increase maxstep and/or stepsize in response to "end of path not reached "warning (see comments below).

qntl = input predictor variable conditioning: this provides robustness to outliers among the predictor variables. Each input predictor variable value x(i, j) is replaced by max(low(j), min(high(j), x(i, j))) where low(j) is the qntl quantile of  x(1:nrow(x), j) and high(j) is the corresponding (1-qntl) quantile. A value of 0.0 produces no conditioning.

impl = algorithm implementation: impl="update" implies use covariance updating algorithm (usually faster); impl="naive" implies use of the straightforward algorithm. Used for regression (mode="regres") only. Ignored for classification (mode="class") which only uses the naive algorithm.

speed = speed/exactness trade-off parameter: increasing value increases speed of the algorithm by factor of speed, possibly decreasing fidelity to the exact path. Must be integer valued.

quiet = T/F implies do/don't print model summary at command line upon completion.

data.check = T/F implies do/don't check the input data for consistency.

Output:

gpsmod = GPS model object to be input to other GPS procedures (see below) for prediction and interpretation.

Printed output at the command line (quiet=F) summarizing the GPS bridge solution: the index of the estimated optimal generalized elastic net penalty, path point as expressed in terms of explainable risk on the training data (sel="no"), number of non zero coefficients, and the minimized value of the model selection criterion.

Examples:

# read in predictor data and convert to data matrix

# bridge least-squares regression

# read in response data and convert to a vector

reggps = gpsbridge(x, y)

# bridge least-squares regression: ten-fold cross-validation

reggps = gpsbridge(x, y, nfold=10)

# bridge logistic regression

# read in response data and convert to a data matrix

loggps = gpsbridge(x, y, mode="class")

# lasso logistic regression: ten-fold cross-validation

lasloggps= gpsbridge(x, y, mode=class, pens=1.0, nfold=10)

Comments:  The only  input parameter whose value might require care is stepsize. It regulates the size of the steps at each iteration of path construction. This in turn controls the number of iterations required to reach the end of the path. Larger values require fewer iterations thereby increasing speed. As the value of  stepsize approaches zero the GPS path approaches a smooth continuous curve in parameter space. Larger values produce a sequence of points that lie close to this path, where the degree of closeness (smoothness) depends on the value of stepsize and the number of non zero coefficients along the path. More non zero coefficients generally require smaller stepsize for the same smoothness. Accuracy is insensitive to stepsize provided it is not overly large. As a diagnostic one can view the coefficient paths with gpspaths to assess path smoothness.

The parameter maxstep controls the maximum number of steps. For sufficiently small stepsize, this number can be exceeded before the end of the path is reached. This is indicated by the warning message "end of path not reached". In this case, if the estimated optimal path point is close to the end of the generated path one can increase either the value of  maxstep or stepsize.

The parameter nfold controls the cross-validation procedure used for model selection. Setting it to a negative value uses a single test set drawn from the training data set of size nrow(x)/abs(nfold) for model evaluation. This is much faster than full cross-validation and can work well for large training data sets. For smaller training data sets, especially with a large number of predictor variables,  it can be unreliable and full cross-valation (nfold > 0) will likely produce better results.

Reference:

gpspred: predict using the GPS model

Usage:

yp = gpspred (gpsmod, xp, xmiss=9.9e35, quiet=T )

Required arguments:

gpsmod = GPS model object returned from gpsbridge.

xp = values of the  input variables for the observation(s) to be predicted. Must be a numeric vector (single observation) or  a numeric matrix (multiple observations).

Optional arguments:

xmiss = predictor variable missing value flag. Must be numeric and larger than any non missing predictor variable value in xp.  Predictor variable values greater than or equal to xmiss are regarded as missing. Predictor variable data values of NA are internally set to the value of xmiss and thereby regarded as missing.

quiet = T/F implies do/don't print messages at command line upon completion.

Output:

yp = vector of length nrow(xp) containing the output predictions for each of the observations.

Regression:  yp is used to predict the  response value(s).

Classification: yp is a numeric score representing the estimated log-odds of  observing the class represented by the first column of y input to gpspred. Corresponding probability estimates can be obtained by probs = 1.0/(1.0+exp(-yp)).

Examples:

yp = gpspred (reggps, xp)

probs=1.0/(1.0+exp(-gpspred (loggps, xp)))

gpssoln: obtain GPS solution

Usage:

soln = gpssoln (gpsmod, vars=1:20, ord="entry")

Required argument:

gpsmod = GPS model object returned from gpsbridge.

Optional arguments:

vars = vector specifying the variables whose solution coefficients are to be returned. Entries must be integer valued.

ord = flag specifying how the entries in vars are to be interpreted. The value ord="entry" references the order of entry along the path. The value ord="org" references the column numbers of the predictor matrix x.

Output: list

soln\$order = vector specifying the the variables  (data matrix columns) whose coefficient values are listed in soln\$coeffs, in the order of listing.

soln\$coeffs = vector of  coefficients for variables specified in vars listed in the order specified by ord.

soln\$intercept = solution intercept value.

Examples:

# get solution coefficient values for the first 20 variables that entered the model in entry order

soln = gpssoln (loggps)

# get solution coefficient values for the variables corresponding to the first 20 columns of the predictor data matrix

soln = gpssoln (loggps, ord="org")

gpspaths: plot solution coefficient values along regularization path

Usage:

gpspaths(gpsmod, vars=1:10, ord="entry", first=T, col="rainbow", title="GPS  paths", cex=0.9)

Required argument:

gpsmod = GPS model object returned from gpsbridge.

Optional arguments:

vars = vector specifying the variables whose solution coefficients are to be plotted. Entries must be integer valued.

ord = flag specifying how the entries in vars are to be interpreted. The value ord="entry" references the order of entry along the path. The value ord="org" references the column numbers of the predictor matrix x.

first = new plot flag. The value first=T specifies start a new plot. The value first=F specifies superimposing the paths over an existing plot.

title = title for plot

col = plotting color of the displayed lines (coefficient values along path).

cex= relative size of the numbers that identify the variables corresponding to each line (coefficient).

Output: plot showing the coefficient values specified in vars along the GPS generated path as a function of explainable risk on the training data (sel="no"), or as a function of the number of entered variables (sel="yes").

Examples:

# plot paths for the first 10 variables that entered the model

gpspaths(regmod)

# plot paths for the next 10 variables that entered the model

gpspaths(regmod, vars=11:20)

# plot paths of coefficients for selected variables (columns of predictor matrix)

gpspaths(logmod, vars=c(1,4,122,265), ord="org")

gpsmodel.info: print GPS model information

Usage:

gpsmodel.info (gpsmod, inputs=T)

Required argument:

gpsmod = GPS model object returned from gpsbridge.

Optional argument:

inputs = T/F implies do/don't print input parameter values that were used to create the model.

Output: Model information printed at the command line.

Example: gpsmodel.info (regmod)

gpsversion: print date and version number of current GPS installation

Usage:

gpsversion ()

arguments: none

Output:  date and version number printed at the command line.

Example: gpsversion ()

www@stat.stanford.edu