Two elements of sparse logistic regression (SLR): the (multinomial) logistic regression model (a) and the automatic relevance determination (ARD) (b). (a) Each class or label has its own discriminant function, which calculates the inner product of the weight parameter vector of the label (θ) and an input feature vector (x). The softmax function transforms the outputs of the discriminant functions to the probability of observing each label. The label with the maximum probability is chosen as the output label. Binary logistic regression is slightly different from multinomial logistic regression. The probability can be calculated by the logistic transformation of a single discriminant function that separates two classes (corresponding to (θ1−θ2)tx). SLR uses this conventional model for (multinomial) logistic regression, but the estimation of weight parameters involves a novel algorithm based on the automatic relevance determination. (b) SLR treats the weight parameters as random variables with prior distributions. The prior of each parameter θi is assumed to have a Gaussian distribution with mean 0. The precision (inverse variance) of the normal distribution is regarded as a hyper-parameter αi, called a relevance parameter, with a hyper-prior distribution defined by a gamma distribution. The relevance parameter controls the range of the corresponding weight parameter. If the relevance parameter is large, the probability sharply peaks at zero as prior knowledge (left panel), and thus the estimated weight parameter tends to be biased toward zero even after observation. On the other hand, if the relevance parameter is small, the probability is broadly distributed (right panel), and thus the estimated weight parameter can take a large value after observation. While our iterative algorithm computes the posterior distributions of the model, most relevance parameters diverge to infinity. Thus, the corresponding weight parameters become effectively zeros, and can be pruned from the model. This process of determining the relevance of parameters is called the ARD. For the details of the algorithm, see Appendix A.