2010 FactorizationMachines

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Factorization Machines.

Notes

Cited By

Quotes

factorization machine; sparse data; tensor factorization; support vector machine

Abstract

In this paper, we introduce Factorization Machines (FM) which are a new model class that combines the advantages of Support Vector Machines (SVM) with factorization models. Like SVMs, FMs are a general predictor working with any real valued feature vector. In contrast to SVMs, FMs model all interactions between variables using factorized parameters. Thus they are able to estimate interactions even in problems with huge sparsity (like recommender systems) where SVMs fail. We show that the model equation of FMs can be calculated in linear time and thus FMs can be optimized directly. So unlike nonlinear SVMs, a transformation in the dual form is not necessary and the model parameters can be estimated directly without the need of any support vector in the solution. We show the relationship to SVMs and the advantages of FMs for parameter estimation in sparse settings.

On the other hand there are many different factorization models like matrix factorization, parallel factor analysis or specialized models like SVD + +, PITF or FPMC. The drawback of these models is that they are not applicable for general prediction tasks but work only with special input data. Furthermore their model equations and optimization algorithms are derived individually for each task. We show that FMs can mimic these models just by specifying the input data (i.e. the feature vectors). This makes FMs easily applicable even for users without expert knowledge in factorization models.

I. INTRODUCTION

Support Vector Machines are one of the most popular predictors in machine learning and data mining. Nevertheless in settings like collaborative filtering, SVMs play no important role and the best models are either direct applications of standard matrix/ tensor factorization models like PARAFAC [1] or specialized models using factorized parameters [2], [3], [4]. In this paper, we show that the only reason why standard SVM predictors are not successful in these tasks is that they cannot learn reliable parameters (‘hyperplanes’) in complex (non-linear) kernel spaces under very sparse data. On the other hand, the drawback of tensor factorization models and even more for specialized factorization models is that (1) they are not applicable to standard prediction data (e.g. a real valued feature vector in Rn.) and (2) that specialized models are usually derived individually for a specific task requiring effort in modelling and design of a learning algorithm.

In this paper, we introduce a new predictor, the Factorization Machine (FM), that is a general predictor like SVMs but is also able to estimate reliable parameters under very high sparsity. The factorization machine models all nested variable interactions (comparable to a polynomial kernel in SVM), but uses a factorized parametrization instead of a dense parametrization like in SVMs. We show that the model equation of FMs can be computed in linear time and that it depends only on a linear number of parameters. This allows direct optimization and storage of model parameters without the need of storing any training data (e.g. support vectors) for prediction. In contrast to this, non-linear SVMs are usually optimized in the dual form and computing a prediction (the model equation) depends on parts of the training data (the support vectors). We also show that FMs subsume many of the most successful approaches for the task of collaborative filtering including biased MF, SVD++ [2], PITF [3] and FPMC [4].

In total, the advantages of our proposed FM are:

1) FMs allow parameter estimation under very sparse data where SVMs fail.

2) FMs have linear complexity, can be optimized in the primal and do not rely on support vectors like SVMs. We show that FMs scale to large datasets like Netflix with 100 millions of training instances.

3) FMs are a general predictor that can work with any real valued feature vector. In contrast to this, other state-ofthe- art factorization models work only on very restricted input data. We will show that just by defining the feature vectors of the input data, FMs can mimic state-of-the-art models like biased MF, SVD++, PITF or FPMC.

II. PREDICTION UNDER SPARSITY

The most common prediction task is to estimate a function y : Rn ! T from a real valued feature vector x 2 Rn to a target domain T (e.g. T = R for regression or T = {+,−} for classification). In supervised settings, it is assumed that there is a training dataset D = {(x(1), y(1)), (x(2), y(2)), . . .} of examples for the target function y given. We also investigate the ranking task where the function y with target T = R can be used to score feature vectors x and sort them according to their score. Scoring functions can be learned with pairwise training data [5], where a feature tuple (x(A), x(B)) 2 D means that x(A) should be ranked higher than x(B). As the pairwise ranking relation is antisymmetric, it is sufficient to use only positive training instances.

In this paper, we deal with problems where x is highly sparse, i.e. almost all of the elements x_i of a vector x are zero. Let m(x) be the number of non-zero elements in the feature vector x and mD be the average number of non-zero elements m(x) of all vectors x 2 D. Huge sparsity (mD � n) appears in many real-world data like feature vectors of event transactions (e.g. purchases in recommender systems) or text analysis (e.g. bag of word approach). One reason for huge sparsity is that the underlying problem deals with large categorical variable domains.

Example 1 Assume we have the transaction data of a movie review system. The system records which user u 2 U rates a movie (item) i 2 I at a certain time t 2 R with a rating r 2 {1, 2, 3, 4, 5}. Let the users U and items I be:

U = {Alice (A), Bob (B), Charlie (C), . . .}
I = {Titanic (TI), Notting Hill (NH), Star Wars (SW), Star Trek (ST), ...}

Let the observed data S be:

S = {(A, TI, 2010-1, 5), (A, NH, 2010-2, 3), (A, SW, 2010-4, 1), (B, SW, 2009-5, 4), (B, ST, 2009-8, 5), (C, TI, 2009-9, 1), (C, SW, 2009-12, 5)}

An example for a prediction task using this data, is to estimate a function ˆy that predicts the rating behaviour of a user for an item at a certain point in time.

Figure 1 shows one example of how feature vectors can be created from S for this task. To simplify readability, we will use categorical levels (e.g. Alice (A)) instead of numbers (e.g. 1) to identify elements in vectors wherever it makes sense (e.g. we write xA or xAlice instead of x1). Here, first there are |U| binary indicator variables (blue) that represent the active user of a transaction – there is always exactly one active user in each transaction (u, i, t, r) 2 S, e.g. user Alice in the first one (x(1) A = 1). The next |I| binary indicator variables (red) hold the active item – again there is always exactly one active item (e.g. x(1) TI = 1). The feature vectors in figure 1 also contain indicator variables (yellow) for all the other movies the user has ever rated. For each user, the variables are normalized such that they sum up to 1. E.g. Alice has rated Titanic, Notting Hill and Star Wars. Additionally the example contains a variable (green) holding the time in months starting from January, 2009. And finally the vector contains information of the last movie (brown) the user has rated before (s)he rated the active one – e.g. for x(2), Alice rated Titanic before she rated Notting Hill. In section V, we show how factorization machines using such feature vectors as input data are related to specialized state-of-the-art factorization models.

We will use this example data throughout the paper for illustration. However please note that FMs are general predictors like SVMs and thus are applicable to any real valued feature vectors and are not restricted to recommender systems.

Fig. 1. Example for sparse real valued feature vectors x that are created from the transactions of example 1. Every row represents a feature vector x(i) with its corresponding target y (i). The first 4 columns (blue) represent indicator variables for the active user; the next 5 (red) indicator variables for the active item. The next 5 columns (yellow) hold additional implicit indicators (i.e. other movies the user has rated). One feature (green) represents the time in months. The last 5 columns (brown) have indicators for the last movie the user has rated before the active one. The rightmost column is the target – here the rating.

III. FACTORIZATION MACHINES (FM)

In this section, we introduce factorization machines. We discuss the model equation in detail and show shortly how to apply FMs to several prediction tasks.

A. Factorization Machine Model

1) Model Equation: The model equation for a factorization machine of degree d = 2 is defined as: ˆy(x) := w0 + Xn i=1 wi x_i + Xn i=1 Xn j=i+1 hvi, vji x_i xj (1) where the model parameters that have to be estimated are: w0 2 R, w 2 Rn, V 2 Rn×k (2) And h·, ·i is the dot product of two vectors of size k: hvi, vji : = Xk f=1 vi,f · vj,f (3) A row vi within V describes the i-th variable with k factors. k 2 N+0 is a hyperparameter that defines the dimensionality of the factorization. A 2-way FM (degree d = 2) captures all single and pairwise interactions between variables: • w0 is the global bias. • wi models the strength of the i-th variable. • ˆ wi,j := hvi, vj i models the interaction between the ith and j-th variable. Instead of using an own model parameter wi,j 2 R for each interaction, the FM models the interaction by factorizing it. We will see later on, that this is the key point which allows high quality parameter estimates of higher-order interactions (d � 2) under sparsity.

2) Expressiveness: It is well known that for any positive definite matrix W, there exists a matrix V such that W = V · Vt provided that k is sufficiently large. This shows that a FM can express any interaction matrix W if k is chosen large enough. Nevertheless in sparse settings, typically a small k should be chosen because there is not enough data to estimate complex interactions W. Restricting k – and thus the expressiveness of the FM – leads to better generalization and thus improved interaction matrices under sparsity.

3) Parameter Estimation Under Sparsity: In sparse settings, there is usually not enough data to estimate interactions between variables directly and independently. Factorization machines can estimate interactions even in these settings well because they break the independence of the interaction parameters by factorizing them. In general this means that the data for one interaction helps also to estimate the parameters for related interactions. We will make the idea more clear with an example from the data in figure 1. Assume we want to estimate the interaction between Alice (A) and Star Trek (ST) for predicting the target y (here the rating). Obviously, there is no case x in the training data where both variables xA and xST are non-zero and thus a direct estimate would lead to no interaction (wA,ST = 0). But with the factorized interaction parameters hvA, vSTi we can estimate the interaction even in this case. First of all, Bob and Charlie will have similar factor vectors vB and vC because both have similar interactions with Star Wars (vSW) for predicting ratings – i.e. hvB, vSWi and hvC, vSWi have to be similar. Alice (vA) will have a different factor vector from Charlie (vC) because she has different interactions with the factors of Titanic and Star Wars for predicting ratings. Next, the factor vectors of Star Trek are likely to be similar to the one of Star Wars because Bob has similar interactions for both movies for predicting y. In total, this means that the dot product (i.e. the interaction) of the factor vectors of Alice and Star Trek will be similar to the one of Alice and Star Wars – which also makes intuitively sense.

4) Computation: Next, we show how to make FMs applicable from a computational point of view. The complexity of straight forward computation of eq. (1) is in O(k n2) because all pairwise interactions have to be computed. But with reformulating it drops to linear runtime. Lemma 3.1: The model equation of a factorization machine (eq. (1)) can be computed in linear time O(k n). Proof: Due to the factorization of the pairwise interactions, there is no model parameter that directly depends on two variables (e.g. a parameter with an index (i, j)). So the pairwise interactions can be reformulated: Xn i=1 Xn j=i+1 hvi, vji x_i xj = 1 2 Xn i=1 Xn j=1 hvi, vji x_i xj − 1 2 Xn i=1 hvi, vii x_i xi = 1 2 0 @ Xn i=1 Xn j=1 Xk f=1 vi,f vj,f x_i xj − Xn i=1 Xk f=1 vi,f vi,f x_i xi 1 A = 1 2 Xk f=1 0 @

Xn i=1 vi,f xi !0 @ Xn j=1 vj,f xj 1 A − Xn i=1 v2 i,f x2 i 1 A = 1 2 Xk f=1 0 @

Xn i=1 vi,f xi !2 − Xn i=1 v2 i,f x2 i 1 A This equation has only linear complexity in both k and n – i.e. its computation is in O(k n). Moreover, under sparsity most of the elements in x are 0 (i.e. m(x) is small) and thus, the sums have only to be computed over the non-zero elements. Thus in sparse applications, the computation of the factorization machine is in O(kmD) – e.g. mD = 2 for typical recommender systems like MF approaches (see section V-A). B. Factorization Machines as Predictors FM can be applied to a variety of prediction tasks. Among them are: • Regression: ˆy(x) can be used directly as the predictor and the optimization criterion is e.g. the minimal least square error on D. • Binary classification: the sign of ˆy(x) is used and the parameters are optimized for hinge loss or logit loss. • Ranking: the vectors x are ordered by the score of ˆy(x) and optimization is done over pairs of instance vectors (x(a), x(b)) 2 D with a pairwise classification loss (e.g. like in [5]). In all these cases, regularization terms like L2 are usually added to the optimization objective to prevent overfitting. C. Learning Factorization Machines As we have shown, FMs have a closed model equation that can be computed in linear time. Thus, the model parameters (w0, w and V) of FMs can be learned efficiently by gradient descent methods – e.g. stochastic gradient descent (SGD) – for a variety of losses, among them are square, logit or hinge loss. The gradient of the FM model is: @ @� ˆy(x) = 8>< >: 1, if � is w0 xi, if � is wi xi Pn j=1 vj,fxj − vi,fx2 i , if � is vi,f (4) The sum Pn j=1 vj,fxj is independent of i and thus can be precomputed (e.g. when computing ˆy(x)). In general, each gradient can be computed in constant time O(1). And all parameter updates for a case (x, y) can be done in O(k n) – or O(km(x)) under sparsity. We provide a generic implementation, LIBFM2, that uses SGD and supports both element-wise and pairwise losses. D. d-way Factorization Machine The 2-way FM described so far can easily be generalized to a d-way FM: ˆy(x) := w0 + Xn i=1 wi xi + Xd l=2 Xn i1=1 . . . Xn il=il−1+1 0 @ Yl j=1 xij 1 A 0 @ Xkl f=1 Yl j=1 v(l) ij ,f 1 A (5) 2http://www.libfm.org where the interaction parameters for the l-th interaction are factorized by the PARAFAC model [1] with the model parameters: V(l) 2 Rn×kl , kl 2 N+0 (6) The straight-forward complexity for computing eq. (5) is O(kd nd). But with the same arguments as in lemma 3.1, one can show that it can be computed in linear time. E. Summary FMs model all possible interactions between values in the feature vector x using factorized interactions instead of full parametrized ones. This has two main advantages: 1) The interactions between values can be estimated even under high sparsity. Especially, it is possible to generalize to unobserved interactions. 2) The number of parameters as well as the time for prediction and learning is linear. This makes direct optimization using SGD feasible and allows optimizing against a variety of loss functions. In the remainder of this paper, we will show the relationships between factorization machines and support vector machines as well as matrix, tensor and specialized factorization models.

IV. FMS VS. SVMS

A. SVM model The model equation of an SVM [6] can be expressed as the dot product between the transformed input x and model parameters w: ˆy(x) = h�(x),wi, where � is a mapping from the feature space Rn into a more complex space F. The mapping � is related to the kernel with: …

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2010 FactorizationMachinesSteffen RendleFactorization Machines10.1109/ICDM.2010.1272010