# Difference between revisions of "2016 FieldAwareFactorizationMachines"

Subject Headings: Field-Aware Factorization Machines (FFMs), CTR Prediction Algorithm, CTR Prediction.

## Quotes

### Author Keywords

Machine learning; Click-through rate prediction; Computational advertising; Factorization machines


### Abstract

Click-through rate (CTR) prediction plays an important role in computational advertising. Models based on degree-2 polynomial mappings and factorization machines (FMs) are widely used for this task. Recently, a variant of FMs, field-aware factorization machines (FFMs), outperforms existing models in some world-wide CTR-prediction competitions. Based on our experiences in winning two of them, in this paper we establish FFMs as an effective method for classifying large sparse data including those from CTR prediction. First, we propose efficient implementations for training FFMs. Then we comprehensively analyze FFMs and compare this approach with competing models. Experiments show that FFMs are very useful for certain classification problems. Finally, we have released a package of FFMs for public use.

### 1. INTRODUCTION

Click-through rate (CTR) prediction plays an important role in advertising industry [ Chappele et al., 2015, McMahan et al., 2013, Richardson et al., 2007 ]. Logistic regression is probably the most widely used model for this task (Richardson et al., 2007). Given a data set with m instances (yi; xi); i = 1;:::; m, where yi is the label and x_i is an n-dimensional feature vector, the model w is obtained by solving the following optimization problem.

min w � 2 kwk22 + Xm i=1 log(1 + exp(􀀀yi�LM(w; xi))): (1)

         Publisher Advertiser
+80 􀀀20 ESPN Nike
+10 􀀀90 ESPN Gucci
+15 􀀀85 Vogue Nike
+90 􀀀10 Vogue Gucci
+85 􀀀15 NBC Nike
+0 􀀀0 NBC Gucci

Table 1: An arti�cial CTR data set, where + (􀀀) represents the number of clicked (unclicked) impressions.

In problem (1), � is the regularization parameter, and in the loss function we consider the linear model:

�LM(w; x) = w � x:


Learning the e�ect of feature conjunctions seems to be crucial for CTR prediction; see, for example, [1]. Here, we consider an arti�cial data set in Table 1 to have a better understanding of feature conjunctions. An ad from Gucci has a particularly high CTR on Vogue. This information is however di�cult for linear models to learn because they learn the two weights Gucci and Vogue separately. To address this problem, two models have been used to learn the e�ect of feature conjunction. The first model, degree-2 polynomial mappings (Poly2) [4, 5], learns a dedicate weight for each feature conjunction. The second model, factorization machines (FMs) [6], learns the e�ect of feature conjunction by factorizing it into a product of two latent vectors. We will discuss details about Poly2 and FMs in Section 2.

A variant of FM called [[pairwise interaction tensor factorization (PITF)]] [7] was proposed for personalized tag recommendation. In KDD Cup 2012, a generalization of PITF called "factor model" was proposed by "Team Opera Solutions" [8]. Because this term is too general and may easily be confused with factorization machines, we refer to it as "field-aware factorization machines" (FFMs) in this paper. The difference between PITF and FFM is that PITF considers three special �elds including \user,"\item," and \tag," while FFM is more general. Because [8] is about the overall solution for the competition, its discussion of FFM is limited. We can conclude the following results in [8]:

1. They use stochastic gradient method (SG) to solve the optimization problem. To avoid over-fitting, they only train with one epoch.
2. FFM performs the best among six models they tried.

In this paper, we aim to concretely establish FFM as an e�ective approach for CTR prediction. Our major results are as follows.

• Though FFM is shown to be e�ective in [8], this work

may be the only published study of applying FFMs on CTR prediction problems. To further demonstrate the e�ectiveness of FFMs on CTR prediction, we present the use of FFM as our major model to win two world-wide CTR competitions hosted by Criteo and Avazu.

• We compare FFMs with two related models, Poly2 and

FMs. We first discuss conceptually why FFMs might be better than them, and conduct experiments to see the difference in terms of accuracy and training time.

• We present techniques for training FFMs. They include

an e�ective parallel optimization algorithm for FFMs and the use of early-stopping to avoid over-fitting.

• To make FFMs available for public use, we release an open

source software.

This paper is organized as follows. Before we present FFMs and its implementation in Section 3, we discuss the two existing models Poly2 and FMs in Section 2. Experiments comparing FFMs with other models are in Section 4.

Finally, conclusions and future directions are in Section 5.

Code used for experiments in this paper and the package LIBFFM are respectively available at:

http://www.csie.ntu.edu.tw/~cjlin/�m/exps
http://www.csie.ntu.edu.tw/~cjlin/lib�m

### 2. POLY2 AND FM

Chang et. al [4] have shown that a degree-2 polynomial mapping can often e�ectively capture the information of feature conjunctions. Further, they show that by applying a linear model on the explicit form of degree-2 mappings, the training and test time can be much faster than using kernel methods. This approach, referred to as Poly2, learns a weight for each feature pair:

�Poly2(w; x) = Xn j1=1 Xn j2=j1+1 wh(j1;j2)xj1xj2 ; (2)

where h(j1; j2) is a function encoding j1 and j2 into a natural number. The complexity of computing (2) is O(�n2), where �n is the average number of non-zero elements per instance. FMs proposed in [6] implicitly learn a latent vector for each feature. Each latent vector contains k latent factors, where k is a user-speciffed parameter. Then, the e�ect of feature conjunction is modelled by the inner product of two latent vectors:

�FM(w; x) = Xn j1=1 Xn j2=j1+1 (wj1 � wj2 )xj1xj2 : (3)

The number of variables is n � k, so directly computing (3) costs O(�n2k) time. Following [6], by re-writing (3) to

�FM(w; x) = 1 2 Xn j=1 (s 􀀀 wjxj) � wjxj ; where s = Xn j0=1 wj0xj0 ; the complexity is reduced to O(�nk).

Rendle [6] explains why FMs can be better than Poly2 when the data set is sparse. Here we give a similar illustration using the data set in Table 1. For example, there is only one negative training data for the pair (ESPN, Adidas). For Poly2, a very negative weight wESPN;Adidas might be learned for this pair. For FMs, because the prediction of (ESPN, Adidas) is determined by wESPN � wAdidas, and because wESPN and wAdidas are also learned from other pairs (e.g., (ESPN, Nike), (NBC, Adidas)), the prediction may be more accurate. Another example is that there is no training data for the pair (NBC, Gucci). For Poly2, the prediction on this pair is trivial, but for FMs, because wNBC and wGucci can be learned from other pairs, it is still possible to do meaningful prediction.

Note that in Poly2, the naive way to implement h(j1; j2) is to consider every pair of features as a new feature [4].1 This approach requires the model as large as O(n2), which is usually impractical for CTR prediction because of very large n. VowpalWabbit (VW) [9], a widely used machine learning package, solves this problem by hashing j1 and j2.2 Our implementation is similar to VW's approach. Speci�cally, h(j1; j2) = ( 1 2 (j1 + j2)(j1 + j2 + 1) + j2) mod B; where the model size B is a user-speciffed parameter. In this paper, for the simplicity of formulations, we do not include linear terms and bias term. However, in Section 4, we include them for some experiments.

### 3. FFM

The idea of FFM originates from PITF [7] proposed for recommender systems with personalized tags. In PITF, they assume three available �elds including User, Item, and Tag, and factorize (User, Item), (User, Tag), and (Item, Tag) in separate latent spaces. In [8], they generalize PITF for more �elds (e.g., AdID, AdvertiserID, UserID, QueryID) and effectively apply it on CTR prediction. Because [7] aims at recommender systems and is limited to three speci�c �elds (User, Item, and Tag), and [8] lacks detailed discussion on FFM, in this section we provide a more comprehensive study of FFMs on CTR prediction. For most CTR data sets like that in Table 1, \features" can be grouped into \�elds." In our example, three features ESPN, Vogue, and NBC, belong to the �eld Publisher, and the other three features Nike, Gucci, and Adidas, belong to the �eld Advertiser. FFM is a variant of FM that utilizes this information. To explain how FFM works, we consider the following new example:

Clicked Publisher (P) Advertiser (A) Gender (G) Yes ESPN Nike Male Recall that for FMs, �FM(w; x) is wESPN � wNike + wESPN � wMale + wNike � wMale:

1More precisely, [4] includes the original features as well, though we do not consider such a setting until the experiments.

2See http://github.com/JohnLangford/vowpal wabbit/wiki/Feature-interactions for details.

1. variables complexity

LM n O(�n) Poly2 B O(�n2) FM nk O(�nk) FFM nfk O(�n2k)

Table 2: Comparison of the number of variables and the complexity for prediction among LM, Poly2, FM, and FFM. In FMs, every feature has only one latent vector to learn the latent e�ect with any other features. Take ESPN as an example, wESPN is used to learn the latent e�ect with Nike (wESPN � wNike) and Male (wESPN � wMale). However, because Nike and Male belong to different �elds, the latent e�ects of (EPSN, Nike) and (EPSN, Male) may be different. In FFMs, each feature has several latent vectors. Depending on the �eld of other features, one of them is used to do the inner product. In our example, �FFM(w; x) is wESPN;A � wNike;P + wESPN;G � wMale;P + wNike;G � wMale;A: We see that to learn the latent e�ect of (ESPN, NIKE), wESPN;A is used because Nike belongs to the �eld Advertiser, and wNike;P is used because ESPN belongs to the �eld Publisher. Again, to learn the latent e�ect of (EPSN, Male), wESPN;G is used because Male belongs to the �eld Gender, and wMale;P is used because ESPN belongs to the �eld Publisher. Mathematically,

�FFM(w; x) = Xn j1=1 Xn j2=j1+1 (wj1;f2 � wj2;f1 )xj1xj2 ; (4)

where f1 and f2 are respectively the �elds of j1 and j2. If f is the number of �elds, then the number of variables of FFMs is nfk, and the complexity to compute (4) is O(�n2k). It is worth noting that in FFMs because each latent vector only needs to learn the e�ect with a speci�c �eld, usually kFFM � kFM:

Table 2 compares the number of variables and the computational complexity of different models.

3.1 Solving the Optimization Problem

The optimization problem is the same as (1) except that �LM(w; x) is replaced by �FFM(w; x). Following [7, 8], we use stochastic gradient methods (SG). Recently, some adaptive learning-rate schedules such as [10, 11] have been proposed to boost the training process of SG. We use AdaGrad [10] because [12] has shown its e�ectiveness on matrix factorization, which is a special case of FFMs.

At each step of SG a data point (y; x) is sampled for updating wj1;f2 and wj2;f1 in (4). Note that because x is highly sparse in our application, we only update dimensions with non-zero values. First, the sub-gradients are gj1;f2 � rwj1;f2 f(w) = � � wj1;f2 + � � wj2;f1xj1xj2 ; (5) gj2;f1 � rwj2;f1 f(w) = � � wj2;f1 + � � wj1;f2xj1xj2 ; (6) where � = @log(1 + exp(􀀀y�FFM(w; x))) @�FFM(w; x) = 􀀀y 1 + exp(y�FFM(w; x))

Algorithm 1 Training FFM using SG

1: Let G 2 Rn�f�k be a tensor of all ones 2: Run the following loop for t epochs 3: for i 2 f1; � � � ;mg do 4: Sample a data point (y; x) 5: caclulate � 6: for j1 2 non-zero terms in f1; � � � ; ng do 7: for j2 2 non-zero terms in fj1 + 1; � � � ; ng do 8: calculate sub-gradient by (5) and (6) 9: for d 2 f1; � � � ; kg do 10: Update the gradient sum by (7) and (8) 11: Update model by (9) and (10)

Second, for each coordinate d = 1; : : : ; k, the sum of squared gradient is accumulated:

(Gj1;f2 )d (Gj1;f2 )d + (gj1;f2 )2 d (7) (Gj2;f1 )d (Gj2;f1 )d + (gj2;f1 )2 d (8) Finally, (wj1;f2 )d and (wj2;f1 )d are updated by: (wj1;f2 )d (wj1;f2 )d 􀀀 � p (Gj1;f2 )d (gj1;f2 )d (9) (wj2;f1 )d (wj2;f1 )d 􀀀 � p (Gj2;f1 )d (gj2;f1 )d; (10) where � is a user-speciffed learning rate. The initial values of w are randomly sampled from a uniform distribution between [0; 1 = p k]. The initial values of G are set to one in order to prevent a large value of (Gj1;f2 ) 􀀀1 2 d . The overall procedure is presented in Algorithm 1.

Empirically, we find that normalizing each instance to have the unit length makes the test accuracy slightly better and insensitive to parameters.

3.2 Parallelization on Shared-memory Systems

Modern computers are widely equipped with multi-core CPUs. If these cores are fully utilized, the training time can be signi�cantly reduced. Many parallelization approaches for SG have been proposed. In this paper, we apply Hog- wild! [13], which allows each thread to run independently without any locking. Speci�cally, the for loop at line 3 of Algorithm 1 is parallelized. In Section 4.4 we run extensive experiments to investigate the e�ectiveness of parallelization.

Consider the widely used LIBSVM data format: label feat1:val1 feat2:val2 � � � , where each (feat, val) pair indicates feature index and value. For FFMs, we extend the above format to label �eld1:feat1:val1 �eld2:feat2:val2 � � � That is, we must assign the corresponding �eld to each feature. The assignment is easy on some kinds of features, but may not be possible for some others. We discuss this issue on three typical classes of features.

Categorical Features

For linear models, a categorical feature is commonly transformed to several binary features. For a data instance Yes P:ESPN A:Nike G:Male, we generate the following LIBSVM format. Yes P-ESPN:1 A-Nike:1 G-Male:1

Note that according to the number of possible values in a categorical feature, the same number of binary features are generated and every time only one of them has the value 1. In the LIBSVM format, features with zero values are not stored. We apply the same setting to all models, so in this paper, every categorical feature is transformed to several binary ones. To add the �eld information, we can consider each category as a �eld. Then the above instance becomes Yes P:P-ESPN:1 A:A-Nike:1 G:G-Male:1.

Numerical Features

Consider the following example to predict if a paper will be accepted by a conference. We use three numerical features \accept rate of the conference (AR),"\h-index of the author (Hidx)," and \number of citations of the author (Cite):" Accepted AR Hidx Cite Yes 45.73 2 3 No 1.04 100 50,000

There are two possible ways to assign �elds. A naive way is to treat each feature as a dummy �eld, so the generated data is:

Yes AR:AR:45.73 Hidx:Hidx:2 Cite:Cite:3 However, the dummy �elds may not be informative because they are merely duplicates of features.

Another possible way is to discretize each numerical feature to a categorical one. Then, we can use the same setting for categorical features to add �eld information. The generated data looks like:

Yes AR:45:1 Hidx:2:1 Cite:3:1, where the AR feature is rounded to an integer. The main drawback is that usually it is not easy to determine the best discretization setting. For example, we may transform 45.73 to \45.7," \45," \40," or even \int(log(45.73))." In addition, we may lose some information after discretization.

Single-field Features

On some data sets, all features belong to a single �eld and hence it is meaningless to assign �elds to features. Typically this situation happens on NLP data sets. Consider the following example of predicting if a sentence expresses a good mood or not: good mood sentence Yes Hooray! Our paper is accepted! No Well, our paper is rejected..

In this example the only �eld is \sentence." If we assign this �eld to all words, then FFMs is reduced to FMs. Readers may ask about assigning dummy �elds as we do for numerical features. Recall that the model size of FFMs is O(nfk). The use of dummy �elds is impractical because f = n and n is often huge.

### 4. EXPERIMENTS

In this section, we first provide the details about the experimental setting in Section 4.1. Then, we investigate the impact of parameters. We find that unlike LM or Poly2, FFM is sensitive to the number of epochs. Therefore, in Section 4.3, we discuss this issue in detail before proposing an early stopping trick. The speedup of parallelization is studied in Section 4.4.

After checking various properties of FFMs, in Sections 4.5- 4.6, we compare FFMs with other models including Poly2 and FMs. They are all implemented by the same SG method, so besides accuracy we can fairly compare their training time. Further in the comparison we include state-of-theart packages LIBLINEAR [14] and LIBFM [15] for training LM/Poly2 and FMs, respectively.

4.1 Experiment Settings

Data Sets

We mainly consider two CTR sets Criteo and Avazu from Kaggle competitions,3 though in Section 4.6 more sets are considered. For feature engineering, we mainly apply our winning solution but remove complicated components.4 For example, our winning solution for Avazu includes the ensemble of 20 models, but here we only use the simplest one. For other details please check our experimental code. A hashing trick is applied to generate 106 features. The statistics of the two data sets are:

Data Set # instances # features # �elds Criteo 45,840,617 107 39 Avazu 40,428,967 107 33

For both data sets, the labels in the test sets are not publicly available, so we split the available data to two sets for training and validation. The data split follows from how test sets are obtained: For Criteo, the last 6,040,618 lines are used as the validation set; for Avazu, we select the last 4,218,938 lines. We use the following terms to represent different sets of a problem.

� Va: The validation set mentioned above.

� Tr: The new training set after excluding the validation set from the original training data.

� TrVa: The original training set.

� Te: The original test set. The labels are not released, so we must submit our prediction to the original evaluation systems to get the score. To avoid over-fitting the test set, the competition organizers divide this data set to two subsets \public set" on which the score is visible during the competition and \private set" on which the score is available after the end of competition. The final rank is determined by the private set.

For example, CriteoVa means the validation set from Criteo.

3Criteo Display Advertising Challenge: http://www. kaggle.com/c/criteo-display-ad-challenge. Avazu Click- Through Rate Prediction: http://www.kaggle.com/c/ avazu-ctr-prediction.

4The code and documents of our winning solution to the two competitions can be found in http://github. com/guestwalk/kaggle-2014-criteo and http://github.com/ guestwalk/kaggle-avazu

Platform

All experiments are conducted on a Linux workstation with 12 physical cores on two Intel Xeon E5-2620 2.0GHz processors and 128 GB memory.

Evaluation

Depending on the model, we change �(w; x) in (1) to �LM(w; x), �Poly2(w; x), �FM(w; x), or �FFM(w; x) introduced in Sections 1-3. For the evaluation criterion, we consider the logistic loss de�ned as logloss = 1 m Xm i=1 log(1 + exp(􀀀yi�(w; xi))); where m is the number of test instances.

Implementation

We implement LMs, Poly2, FMs, and FFMs all in C++. For FMs and FFMs, we use SSE instructions to boost the e�ciency of inner products. The parallelization discussed in Section 3.2 is implemented by OpenMP [16]. Our implementations include linear terms and bias term as they improve performance in some data sets. These terms should be used in general as we seldom see them to be harmful.

Note that for code extensibility the �eld information is stored regardless of the model used. For non-FFM models, the implementation may become slightly faster by a simpler data structure without �eld information, but our conclusions from experiments should remain the same.

4.2 Impact of Parameters

We conduct experiments to investigate the impact of k, �, and �. The results can be found in Figure 1. Regarding the parameter k, results in Figure 1a show that it does not a�ect the logloss much. In Figure 1b, we present the relationship between � and logloss. If � is too large, the model is not able to achieve a good performance. On the contrary, with a small �, the model gets better results, but it easily over- �ts the data. We observe that the training logloss keeps decreasing. For the parameter �, Figure 1c shows that if we apply a small �, FFMs will obtain its best performance slowly. However, with a large �, FFMs are able to quickly reduce the logloss, but then over-fitting occurs. From the results in Figures 1b and 1c, there is a need of early-stopping that will be discussed in Section 4.3.

4.3 Early Stopping

Early stopping, which terminates the training process before reaching the best result on training data, can be used to avoid over-fitting for many machine learning problems [17, 18, 19]. For FFM, the strategy we use is:

1. Split the data set into a training set and a validation set.

2. At the end of each epoch, use the validation set to calculate the loss.

3. If the loss goes up, record the number of epochs. Stop or go to step 4.

4. If needed, use the full data set to re-train a model with the number of epochs obtained in step 3.

A di�culty in applying early stopping is that the logloss is sensitive to the number of epochs. Then the best epoch on the validation set may not be the best one on the test set. We have tried other approaches to avoid the overfitting k time logloss

1 27.236 0.45773 2 26.384 0.45715 4 27.875 0.45696 8 40.331 0.45690 16 70.164 0.45725

(a) The average running time (in seconds) per epoch and the best logloss with different values of k. Because we use SSE instructions, the running time of k = 1; 2; 4 is roughly the same.

Epochs 20 40 60 80 100 120 140 Logloss 0.44 0.45 0.46 0.47 0.48 0.49 0.5 6 = 1e − 6 6 = 1e − 5 6 = 1e − 4 6 = 1e − 3

(b) The impact of � Epochs

1 5 9 13 17 21 25 Logloss 0.45 0.46 0.47 0.48 0.49 0.5 2 = 0.01 2 = 0.02 2 = 0.05 2 = 0.1 2 = 0.2 2 = 0.5

(c) The impact of � Figure 1: The impact of �, �, and k on FFMs. To make experiments faster, we randomly select 10% instances from CriteoTr and CriteoVa as the training and the test sets, respectively.

such as lazy update5 and ALS-based optimization methods. However, results are not as successful as that by early stopping of using a validation set.

4.4 Speedup

Because the parallelization of SG may cause a different convergence behavior, we experiment with different numbers of threads in Figure 2. Results show that our parallelization still leads to similar convergence behavior. With this property we can de�ne the speedup as:

Running time of one epoch with a single thread Running time of one epoch with multiple threads

The result in Figure 3 shows a good speedup when the number of threads is small. However, if many threads are used, the speedup does not improve much. An explanation is that if two or more threads attempt to access the same memory address, one must wait for its term. This kind of con icts can happen more often when more threads are used.

Figure 2: The convergence of using different number of threads.

1 2 4 6 8 10 12 Speedup 1 1.5 2 2.5 3 3.5 4

Figure 3: The speedup of using multi-threading. We respectively use CriteoTr and CriteoVa as the training and the test sets.

4.5 Comparison with LMs, Poly2, and FMs on Two CTR Competition Data Sets

To have a fair comparison, we implement the same SG method for LMs, Poly2, FMs, and FFMs. Further, we compare with two state-of-the-art packages:

� LIBLINEAR: a widely used package for linear models. For L2-regularized logistic regression, it implements two optimization methods: Newton method to solve the primal problem, and coordinate descent (CD) method to solve the dual problem. We used both for checking how optimization methods a�ect the performance; see the discussion in the end of this sub-section. Further, the existing Poly2 extension of LIBLINEAR does not support the hashing trick,6 so we conduct suitable modi�cations and denote it as LIBLINEAR-Hash in this paper.

� LIBFM: As a widely used library for factorization machines, it supports three optimization approaches including stochastic gradient method (SG), alternating least squares (ALS), and Markov Chain Monte Carlo (MCMC). We tried all of them and found that ALS is signi�cantly better than the other two in terms of logloss. Therefore, we consider ALS in our experiments.

For the parameters in all models, from a grid of points we select those that lead to the best performance on the validation sets. Every optimization algorithm needs a stopping condition; we use the default setting for Newton method and 6https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#fast training testing for polynomial mappings of data coordinate descent (CD) method by LIBLINEAR. For each of the other models, we need a validation set to check which iteration leads to the best validation score. After we obtain the best number of iterations, we re-train the model up to that iteration. Results on Criteo and Avazu with the list of parameters used can be found in Table 3. Clearly, FFMs outperform other models in terms of logloss, but it also requires longer training time than LMs and FMs. On the other end, though the logloss of LMs is worse than other models, it is signi�cantly faster. These results show a clear trade-o� between logloss and speed. Poly2 is the slowest among all models. The reason might be the expensive computation of (2). FM is a good balance between logloss and speed. For LIBFM, it performs closely to our implementation of FMs in terms of logloss on Criteo.7 However, we see that our implementation is signi�cantly faster. We provide three possible reasons:

� The ALS algorithm used by LIBFM is more complicated than the SG algorithm we use.

� We use an adaptive learning rate strategy in SG.

� We use SSE instructions to boost inner product operations.

Because logistic regression is a convex problem, ideally, for either LM or Poly2, the three optimization methods (SG, Newton, and CD) should generate exactly the same model if they converge to the global optimum. However, practically results are slightly different. In particular, LM by SG is better than the two LIBLINEAR-based models on Avazu. In our implementation, LM via SG only loosely solves the optimization problem. Our experiments therefore indicate that the stopping condition of optimization methods can a�ect the performance of the resulting model even if the problem is convex.

4.6 Comparison on More Data Sets

In the previous section we focus on two competition data sets, but it is important to see how FFMs perform on other data sets. To answer this question, we consider more data sets for the comparison, where most of them are not CTR data. Note that following the discussion in Section 3.3, we do not consider data sets with single-�eld features. The reason is that depending on how we assign �elds, FFMs either become equivalent to FMs, or generate a huge model. Here we brie y introduce the data sets used.

� KDD2010-bridge:8 This data set includes both numerical and categorical features.

� KDD2012:9 This set contains both numerical and categorical features. Because our evaluation is logloss, we transform the original target value \number of clicks" to a binary value \clicked or not."

� cod-rna:10 This set contains only numerical features.

� ijcnn:10 This set contains only numerical features.

7The performance of LIBFM on AvazuVa is as good as the FM we have implemented, but the performance on AvazuTe is poor. It is not entirely clear what happened, so further investigation is needed.

9http://www.kddcup2012.org/c/kddcup2012-track2/data

10http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html

Model and implementation parameters training time public set private set (seconds) logloss rank logloss rank

LM-SG � = 0:2; � = 0; t = 13 527 0.46262 93 0.46224 91 LM-LIBLINEAR-CD s = 7; c = 2 1,417 0.46239 91 0.46201 89 LM-LIBLINEAR-Newton s = 0; c = 2 7,164 0.46602 225 0.46581 222 Poly2-SG � = 0:2; � = 0;B = 107; t = 10 12,064 0.44973 14 0.44956 14 Poly2-LIBLINEAR-Hash-CD s = 7; c = 2 24,771 0.44893 13 0.44873 13 FM � = 0:05; � = 2 � 10􀀀5; k = 40; t = 8 2,022 0.44930 14 0.44922 14 FM � = 0:05; � = 2 � 10􀀀5; k = 100; t = 9 4,020 0.44867 11 0.44847 11 LIBFM � = 40; k = 40; t = 20 23,700 0.45012 14 0.45000 15 LIBFM � = 40; k = 40; t = 50 131,000 0.44904 14 0.44887 14 LIBFM � = 40; k = 100; t = 20 54,320 0.44853 11 0.44834 11 LIBFM � = 40; k = 100; t = 50 398,800 0.44794 9 0.44778 8 FFM � = 0:2; � = 2 � 10􀀀5; k = 4; t = 9 6,587 0.44612 3 0.44603 3

(a) Criteo

Model and implementation parameters training time public set private set (seconds) logloss rank logloss rank

LM-SG � = 0:2; � = 0; t = 10 164 0.39018 57 0.38833 64
LM-LIBLINEAR-CD s = 7; c = 1 417 0.39131 115 0.38944 119
LM-LIBLINEAR-Newton s = 0; c = 1 650 0.39269 182 0.39079 183
Poly2-SG � = 0:2; � = 0;B = 107; t = 10 911 0.38554 10 0.38347 10
Poly2-LIBLINEAR-Hash-CD s = 7; c = 1 1,756 0.38516 10 0.38303 9
Poly2-LIBLINEAR-Hash-Newton s = 0; c = 1 27,292 0.38598 11 0.38393 11
FM � = 0:05; � = 2 � 10􀀀5; k = 40; t = 8 574 0.38621 11 0.38407 11
FM � = 0:05; � = 2 � 10􀀀5; k = 100; t = 9 1,277 0.38740 17 0.38531 15
LIBFM � = 40; k = 40; t = 20 18,712 0.39137 122 0.38963 127
LIBFM � = 40; k = 40; t = 50 41,720 0.39786 935 0.39635 943
LIBFM � = 40; k = 100; t = 20 39,719 0.39644 747 0.39470 755
LIBFM � = 40; k = 100; t = 50 91,210 0.40740 1,129 0.40585 1,126
FFM � = 0:2; � = 2 � 10􀀀5; k = 4; t = 4 340 0.38411 6 0.38223 6


(b) Avazu

Table 3: Comparison among models and implementations on data sets Criteo and Avazu. The training sets used here are CriteoTrVa and AvazuTrVa, and the test sets used here are CriteoTe and AvazuTe. For all experiments, a single thread is used. The public set is around 20% of the test data, while the private set contains the rest. For Criteo, we do not list the result of Poly2-LIBLINEAR-Hash-Newton, because the experiment does not finish after more than 10 days. Note that the we use different stopping conditions for different algorithms, so the training time is only for reference.

� phishing:11 This set contains only categorical features.

� adult:12 This data set includes both numerical and categorical features.

For KDD2010-bridge, KDD2012, and adult, we simply discretize all numerical features into 29, 13, and 94 bins respectively. For cod-rna and ijcnn, where features are all numerical, we try both approaches mentioned in Section 3.3 to obtain �eld information: applying dummy �elds and discretization. For the parameter selection, we follow the same procedure in Section 4.5. We split each set into training, validation, and test sets; then for predicting the test set, we use the model trained with parameters that achieve the best logloss on the validation set.

The statistics and experimental results of each data set are described in Table 4. FFMs signi�cantly outperform other models on KDD2010-bridge and KDD2012. The common characteristic among these data sets are:

� Most features are categorical.

11http://archive.ics.uci.edu/ml/datasets/Phishing+Websites

� The resulting set is highly sparse after transforming categorical features into many binary features.

However, on phishing and adult, FFM is not signi�cantly better. For phishing, the reason might be that the data is not sparse so FFM, FM, and Poly2 have close performance; for adult, it seems feature conjunction is not useful because all models perform similarly to the linear model. When a data set contains only numerical features, FFMs may not have an obvious advantage. If we use dummy �elds, then FFMs do not out-perform FMs, a result indicating that the �eld information is not helpful. On the other hand, if we discretize numerical features, though FFMs is the best among all models, the performance is much worse than that of using dummy �elds. We summarize a guideline of applying FFMs on different kinds of data sets: � FFMs should be e�ective for data sets that contain categorical features and are transformed to binary features. � If the transformed set is not sparse enough, FFMs seem to bring less bene�t. � It is more di�cult to apply FFMs on numerical data sets.

statistics logloss
Data set # instances # features # �elds LM Poly2 FM FFM
KDD2010-bridge 20,012,499 651,166 9 0.27947 0.2622 0.26372 0.25639
KDD2012 149,639,105 54,686,452 11 0.15069 0.15099 0.15004 0.14906
phishing 11,055 100 30 0.14211 0.11512 0.09229 0.1065
adult 48,842 308 14 0.3097 0.30655 0.30763 0.30565
cod-rna (dummy �elds) 331,152 8 8 0.13829 0.12874 0.12580 0.12914
cod-rna (discretization) 331,152 2,296 8 0.16455 0.17576 0.16570 0.14993
ijcnn (dummy �elds) 141,691 22 22 0.20093 0.08981 0.07087 0.0692
ijcnn (discretization) 141,691 69,867 22 0.21588 0.24578 0.20223 0.18608

Table 4: Comparison between LM, Poly2, FM, and FFMs. The best logloss is underlined.

### 5. CONCLUSIONS AND FUTURE WORKS

In this paper we discuss e�cient implementations of FFMs. We demonstrate that for certain kinds of data sets, FFMs outperform three well-known models, LM, Poly2, and FM, in terms of logloss, with a cost of longer training time.

For the future work, the over-fitting problem discussed in Section 4.3 is an issue that we plan to investigate. Besides, for the ease of implementation we use SG as the optimization method. It is interesting to see how other optimization methods (e.g., Newton method) work on FFMs.

## References

• 1. (Chappele et al., 2015) => Olivier Chapelle, Eren Manavoglu, Romer Rosales, Simple and Scalable Response Prediction for Display Advertising, ACM Transactions on Intelligent Systems and Technology (TIST), v.5 n.4, January 2015 doi:10.1145/2532128
• 2. (McMahan et al., 2013) => H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, Jeremy Kubica, Ad Click Prediction: A View from the Trenches, Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 11-14, 2013, Chicago, Illinois, USA doi:10.1145/2487575.2488200
• 3. (Richardson et al., 2007) => Matthew Richardson, Ewa Dominowska, Robert Ragno, Predicting Clicks: Estimating the Click-through Rate for New Ads, Proceedings of the 16th International Conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada doi:10.1145/1242572.1242643
• 4. Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin, Training and Testing Low-degree Polynomial Data Mappings via Linear SVM, The Journal of Machine Learning Research, 11, p.1471-1490, 3/1/2010
• 5. Taku Kudo, Yuji Matsumoto, Fast Methods for Kernel-based Text Analysis, Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, p.24-31, July 07-12, 2003, Sapporo, Japan doi:10.3115/1075096.1075100
• 6. Steffen Rendle, Factorization Machines, Proceedings of the 2010 IEEE International Conference on Data Mining, p.995-1000, December 13-17, 2010 doi:10.1109/ICDM.2010.127
• 7. Steffen Rendle, Lars Schmidt-Thieme, Pairwise Interaction Tensor Factorization for Personalized Tag Recommendation, Proceedings of the Third ACM International Conference on Web Search and Data Mining, February 04-06, 2010, New York, New York, USA doi:10.1145/1718487.1718498
• 8. M. Jahrer, A. Töscher, J.-Y. Lee, J. Deng, H. Zhang, and J. Spoelstra, "Ensemble of Collaborative Filtering and Feature Engineered Model for Click through Rate Prediction," in KDD Cup 2012 Workshop, ACM, 2012.
• 9. J. Langford, L. Li, and A. Strehl, "Vowpal Wabbit," 2007. Https://github.com/JohnLangford/vowpal_wabbit/wiki.
• 10. John Duchi, Elad Hazan, Yoram Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, The Journal of Machine Learning Research, 12, p.2121-2159, 2/1/2011
• 11. H. B. McMahan, "Follow-the-regularized-leader and Mirror Descent: Equivalence Theorems and L1 Regularization," in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
• 12. W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin, "A Learning-rate Schedule for Stochastic Gradient Methods to Matrix Factorization," in Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2015.
• 13. Feng Niu, Benjamin Recht, Christopher Re, Stephen J. Wright, HOGWILD!: A Lock-free Approach to Parallelizing Stochastic Gradient Descent, Proceedings of the 24th International Conference on Neural Information Processing Systems, p.693-701, December 12-15, 2011, Granada, Spain
• 14. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, Chih-Jen Lin, LIBLINEAR: A Library for Large Linear Classification, The Journal of Machine Learning Research, 9, p.1871-1874, 6/1/2008
• 15. Steffen Rendle, Factorization Machines with LibFM, ACM Transactions on Intelligent Systems and Technology (TIST), v.3 n.3, p.1-22, May 2012 doi:10.1145/2168752.2168771
• 16. Leonardo Dagum, Ramesh Menon, OpenMP: An Industry-Standard API for Shared-Memory Programming, IEEE Computational Science & Engineering, v.5 n.1, p.46-55, January 1998 doi:10.1109/99.660313
• 17. Christopher M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag New York, Inc., Secaucus, NJ, 2006
• 18. Garvesh Raskutti, Martin J. Wainwright, Bin Yu, Early Stopping and Non-parametric Regression: An Optimal Data-dependent Stopping Rule, The Journal of Machine Learning Research, v.15 n.1, p.335-366, January 2014
• 19. T. Zhang and B. Yu, "Boosting with Early Stopping: Convergence and Consistency," The Annals of Statistics, Vol. 33, No. 4, Pp. 1538--1579, 2005.

}};

volumeDate ValuetitletypejournaltitleUrldoinoteyear
2016 FieldAwareFactorizationMachinesField-aware Factorization Machines for CTR Prediction10.1145/2959100.29591342016
 Author Yuchin Juan +, Yong Zhuang +, Wei-Sheng Chin + and Chih-Jen Lin + doi 10.1145/2959100.2959134 + title Field-aware Factorization Machines for CTR Prediction + year 2016 +