The Review of Socionetwork Strategies Noise–Robust Sampling for Collaborative Metric Learning

Abundant research has been conducted recently on recommendation systems. A recommendation system is a subfield of inform

166 56 2MB

English Pages 307–332 [26] Year 2022

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Metric learning 9781627053655, 9781627053662

464 83 3MB Read more

Fundamentals of Nursing: Active Learning for Collaborative Practice 0323547397, 9780323547390

Yoost and Crawford's Fundamentals of Nursing is back for a second-edition encore! The text that made its name by fo

4,004 244 64MB Read more

20 Strategies for Collaborative School Leaders 9781317929871, 9781596670006

This book is for school leaders who handle the conflicts and commotions which arise from human nature issues in schools

180 10 1MB Read more

Collaborative School Leadership : Practical Strategies for Principals 9781475800593, 9781475800586

In Collaborative School Leadership, Nash and Hwang provide administrators with clear and focused ideas on making the mos

180 69 5MB Read more

Building Integrated Collaborative Relationships for Inclusive Learning Settings 9781799868163, 9781799868187

As a result of the mandates of the Individual with Disabilities Education Improvement Act (IDEIA), inclusive practices h

1,847 68 5MB Read more

Classroom Strategies for Interactive Learning 9781625311719, 1625311710

816 101 37MB Read more

Classroom Strategies for Interactive Learning 9781625311719, 1625311710

4,980 540 5MB Read more

Seven Strategies of Assessment for Learning [2 ed.]

1,760 157 9MB Read more

Some Theory of Sampling

724 49 11MB Read more

Reading Strategies for Elementary Students with Learning Difficulties : Strategies for RTI [2 ed.] 9781452223094, 9781412960694

"The authors have taken a huge amount of research and information, digested it, and organized it into clearly arran

191 116 2MB Read more

The Review of Socionetwork Strategies
Noise–Robust Sampling for Collaborative Metric Learning

Author / Uploaded
Ryo Matsui
Suguru Yaginuma
Taketo Naito
Kazuhide Nakata

Categories
Computers
Algorithms and Data Structures: Pattern Recognition

Table of contents :
Noise–Robust Sampling for Collaborative Metric Learning
Abstract
1 Introduction
2 Preliminaries
2.1 Problem Formulation and Notation
2.2 Matrix Factorization (MF)
2.3 Collaborative Metric Learning (CML)
2.4 Weighted Negative Sampling
2.5 2-Stage Negative Sampling
3 Proposed Method
3.1 Estimate Latent Probability & Calculate Noise Rate
3.2 Proposed Method A: Double-Weighted Sampling
3.3 Proposed Method B: Double-Clean Sampling
4 Theoretical Analysis
4.1 Validity of the Reliability
4.2 Scalability
5 Experiments
5.1 Datasets
5.2 Baselines and Proposed Model
5.3 Metrics
5.4 Hyperparameters
5.5 Results
5.6 Explainability of Embeddings
5.7 Discussion
6 Conclusion
References

Citation preview

The Review of Socionetwork Strategies (2022) 16:307–332 https://doi.org/10.1007/s12626-022-00131-x ARTICLE

Noise–Robust Sampling for Collaborative Metric Learning Ryo Matsui1 · Suguru Yaginuma2 · Taketo Naito3 · Kazuhide Nakata1 Received: 23 March 2022 / Accepted: 15 September 2022 / Published online: 9 October 2022 © Springer Japan KK, part of Springer Nature 2022

Abstract Abundant research has been conducted recently on recommendation systems. A recommendation system is a subfield of information retrieval and machine learning that aims to identify the value of objects and information for everyone. For example, recommendation systems on web services learn the latent preferences of users based on their behavioral history and display content that the system considers as s user favorite. Specifically, recommendation systems for web services have two main requirements: (1) use the embedding of users and items for predictions and (2) use implicit feedback data that does not require users’ active actions when learning. Recently, a method called collaborative metric learning (CML) has been developed to satisfy the first requirement. However, this method does not address noisy label issues caused by implicit feedback data in the second requirement. This study proposes a comprehensive and effective method to deal with noise in CML. The experimental results show that the proposed method significantly improves the performance of the two requirements compared with existing methods. Keywords Recommendation systems · Machine learning · Collaborative filtering · Metric learning · Implicit feedback Mathematics Subject Classification 68T99

* Ryo Matsui [email protected] Suguru Yaginuma [email protected] Taketo Naito [email protected] Kazuhide Nakata [email protected] 1

Tokyo Institute of Technology, 2‑12‑1 Ookayama, Meguro‑ku, Tokyo 152‑8550, Japan

2

MC Digital, Inc., Tokyo, Japan

3

SMN, Corp., Tokyo, Japan

13

Vol.:(0123456789)

308

The Review of Socionetwork Strategies (2022) 16:307–332

1 Introduction We distribute and receive a variety of content on the web, such as videos, music, news, and books. Users with Internet access find it more convenient to access such content, as they can be easily accessed anytime and anywhere. However, with a vast amount of content, users face difficulties finding their favorites. This is where recommendation systems become necessary. A recommendation system is a subfield of information retrieval and machine learning that aims to identify the value of objects and information for everyone. Recommendation systems in web services aim to support users in finding items (content) that match their individual preferences based on their attribute information and behavioral history. Moreover, modern recommendation systems are required to possess two significant features. The first requirement of recommendation systems on web services is to use embeddings of users and items when predicting their relations. There are two main advantages of using embeddings. The first advantage is low latency. If we do not use embeddings, we must predict user preferences for all items and sort them. This method is impractical for a service with a large number of items. However, if we use embeddings, we can find item vectors that are similar to the user vectors within a time frame, which does not damage the user experience by an approximate neighborhood search [1, 2]. Particularly, we can avoid calculating preferences for all items and find recommended items from a large set of items. The second advantage is that we can also represent the similarity between user–user and item–item pairs. By embedding users and items into a shared vector space, we can not only capture the similarity of user–item pairs but also user–user pairs and item–item pairs. Although the kNN-type based method using raw preference vectors can capture user–user or item–item relationships, the embedding-based method can capture all relationships, user–user, item–item, and user–item. This advantage enables a wide range of applications, such as marketing to capture customer segments, friend recommendations on SNS, and similar item recommendations on e-commerce sites [3–5]. The standard method for recommendations with embeddings is the matrix factorization (MF)-based model. However, previous research has indicated that this approach may fail to capture fine-grained preference information owing to the loss function of MF [6–8]. This method has been optimized for predicting user–item interactions; however, it neglects the similarity between user–user and item-item pairs. To solve this issue, Hsieh et al. developed an algorithm called collaborative metric learning (CML) [6]. CML learns the distance metric of the joint vector space of users and items. As described in Sect. 2, the triangle inequality of distance metrics enables the CML to learn the relationships between user–user pairs and item–item pairs. Given that this approach addresses the user–user and item–item relationship problem, it has been applied and extended [9–14]. The second requirement of recommendation systems for web services is the use of implicit feedback when learning the relationship between users and items. In today’s Web services, it is easy to collect log data without any active action

13

The Review of Socionetwork Strategies (2022) 16:307–332

309

by the user [15]. Thus, implicit feedback, such as clicking on a webpage, has been widely used in recommendation systems. [16–20]. However, this approach of using implicit feedback suffers from noisy-label issues. Prior studies [21–23] have indicated differences between implicit feedback and actual user satisfaction. For more details, there are two types of noise. The first is (1) positive noise (false-positive): An event in which a user interacts with an item but is not interested in the item. Such an event occurs because the first impression of a caption or headline causes users to click [23, 24]. This type of noise can lead to low-quality recommendations because the system tends to suggest an item that the user does not like. The second type is (2) negative noise (false-negative): An event in which users miss their favorite items. This is also referred to as the positiveunlabeled problem [20, 25, 26]. Particularly, a lack of interaction does not always indicate a negative response from the users. Thus, the system should not learn such potentially positive but unobserved items as entirely negative. In summary, to build a modern recommendation system, we should develop an algorithm to capture user–item, user–user, and item–item relationships, and consider noisy labels associated with implicit feedback. Tran et al. [27] addressed the abovementioned challenge by considering noise during the CML training. As CML is based on triplet loss, it repeats the training with a batch of triplets selected via uniform sampling. This, however, is a disadvantage of CML. Uniform sampling is inefficient for learning user–item relationships because it selects noisy data with a certain probability. To resolve this problem, [27] proposed a two-stage negative sampling strategy weighted by the noise rate of the negative items. However, there are two main problems with [27]. The first problem is that weighted sampling might not be optimal for CML. Although weighted sampling is less likely to select noisy negative pairs, a negative sample set that includes even one noisy pair degrades the overall performance owing to triangle inequality. In particular, because CML uses the properties of triangle inequalities, the harmful effects of noise are likely to spread to surrounding users and items. Therefore, we might not use extremely noisy negative pairs for training. The second problem is that previous works do not consider positive noise. Although the extension of negative sampling can reduce negative noise, positive noise also has significantly harmful effects on learning. Therefore, we need to develop an algorithm that handles both negative and positive pairs. To solve these problems, we propose two approaches. One is called “doubleweighted sampling,” which weighs positive and negative sampling by the latent observed probability estimated by other models. The other is called “double-clean sampling,” which eliminates user–item pairs with extreme noise before CML. To judge extremely noisy pairs, we used the latent observed probability and estimated the noise rate. These methods make it easy to handle positive and negative noise. Confident learning [28], which is a state-of-the-art image classification method with noisy labels, supports this noise rate estimation method using noisy data and by cleaning the data. In addition, we theoretically analyzed the validity of using latent observation probability and discuss its applicability to large-scale data.

13

310

The Review of Socionetwork Strategies (2022) 16:307–332

In our experiments, these methods were applied to datasets with noisy training data but no noisy test data. The results indicate that the proposed method improved the three ranking metrics by more than 10%. In comparing “double-weighted sampling” and “double-clean sampling,” as expected, the performance of “double-clean sampling” was more stable. Beyond improving the metrics for the usual user–item recommendations, we investigated a method for capturing user–user and item–item relationships. After calculating these recommendation metrics, we visualized how close the embeddings of users and items with similar features were and evaluated how accurately the SVM classified the features with them. The results show that although naive methods were not representative of the features, the proposed method alleviates this problem. In addition, the error rate for classification improved from 7.3 to 5.6% for user gender, and from 1.7 to 0.7% for the target gender for clothing.

2 Preliminaries This section describes the formulation of a recommendation system with implicit feedback and summarize related works. 2.1 Problem Formulation and Notation Let U be the set of users and I be the set of items. We assume that the system recommends the top K items i1 , i2 , … , ik ∈ I that are relevant to user ux ∈ U when the user arrives. We consider Yu,i = +1 if an interaction between user u ∈ U and item i ∈ I occurs before the arrival of the user, and Yu,i = −1 otherwise. The set of all interactions collected at this point is denoted by S = {(u, i)|Yu.i = +1} , and the set of all user–item pairs not included in this set is denoted by D = U × I ⧵ S = {(u, i)|Yu.i = −1} . Furthermore, let Du = {i|Yu.i = −1} be the item set that is not included in S for user u. Using this information, the system learns the embeddings vu ∈ ℝl , ∀u ∈ U and vi ∈ ℝl , ∀i ∈ I for users and items, respectively. This section denotes the set of all embeddings to be trained as 𝛩 = {vc }c∈U∪I . Then, the system recommends items based on the similarity Sim(vu , vi ) . However, note that the observed interaction Yu,i is not necessarily equal to the grand truth relevance Ru,i ∈ {1, −1} between user u and item i. We assume that relevance Ru,i is not observable and that Yu,i is observed with noise. Therefore, it is necessary to consider noise when learning the embeddings. We summarize the symbols used in the formulation and their definitions in Table 1. 2.2 Matrix Factorization (MF) Collaborative filtering based on matrix factorization is commonly used in recommendation systems. This method considers a similarity function, Sim(u, i) = vTu vi .

13

The Review of Socionetwork Strategies (2022) 16:307–332

311

Table 1 Notations and definition Symbol

Definition

U

User set

I

Item set

u∈U

User

i∈I

Item

K∈ℕ

Number of recommended items

Ru,i ∈ {+1, −1}

Grand truth rating (= relevance) of an item i for user u

Yu,i ∈ {+1, −1}

Observed rating ( interaction) of item i for user u

vu

An embedding of the user u

vi

An embedding of the item i

𝛩 = {vc }c∈U∪I

A set of all embeddings

S = {(u, i)|Yu,i = +1}

Set of user–item pairs with positive observed ratings

D=U×I⧵S

Set of user–item pairs with negative observed ratings

Du = {i|Yu,i = −1}

Set of items that have negative observed ratings for user u

Sim(⋅, ⋅)

Similarity functions

d(⋅, ⋅)

Euclidean distance function

Hu et al. formulated weighted matrix factorization based on cu,i , which is the reliability of Yu,i , as follows [21]: ∑∑ min cu,i (Yu,i − Ŷ u,i )2 , where Ŷ u,i = vTu vi 𝛩

u∈U i∈I

2.3 Collaborative Metric Learning (CML) The above matrix factorization-based methods mainly attempt to represent the relationship between the inner product of the user and item embeddings. However, the inner products of the user–user pairs and item–item pairs do not represent the strength of their relationships. Hsieh et al. [6] pointed out that this property strongly affects the quality and interpretability of recommendations, and proposed CML as a method to solve this problem. CML learns embeddings by setting the loss function as ∑ ∑ [ ] L(𝛩) = wu,i m + d(vu , vi )2 − d(vu , vj )2 + (1) (u,i)∈S j∈Du

and solving for

min L(𝛩), s.t. ‖v‖2 ≤ 1, ∀v ∈ 𝛩. 𝛩

Here, d represents the Euclidean distance, which is the similarity function in the CML, wu,i is the weight, and m represents the maximum margin. The triangle

13

312

The Review of Socionetwork Strategies (2022) 16:307–332

Fig. 1 Illustration of the effect of triangle inequality in CML

inequality of the distance function enables CML to capture user–user/item–item relationships. This property comes from

user–user pair : d(u1 , i1 ) + d(u2 , i1 ) ≥ d(u1 , u2 ), item–item pair : d(u1 , i1 ) + d(u1 , i2 ) ≥ d(i1 , i2 ). If we observe that (u1 , i1 ), (u1 , i2 ), (u2 , i1 ) are positive and reduce the distance between these pairs, the distances of the user–user pair (u1 , u2 ) and item–item pair (i1 , i2 ) are also reduced. Moreover, the triangle inequality allows CML to learn the unobserved user–item pairs as relevant pairs from

d(u1 , i1 ) + d(u1 , i2 ) + d(u2 , i1 ) ≥ d(i1 , i2 ) + d(u2 , i1 ) ≥ d(u2 , i2 ). As Figure 1shows, the distance of the unobserved pair (u2 , i2 ) is also reduced when Yu1 ,i1 = Yu1 ,i2 = Yu2 ,i1 = +1. However, it is practically difficult to determine the global optima of Lcml for a large number of combinations of S, D . Subsequently, CML finds local optima by stochastic gradient descent. For an initialized embedding vector, the CML repeatedly (1) samples mini-batches, (2) calculates the loss, and (3) updates the parameters until the loss converges. There are two processes in the mini-batch sampling phase: positive and negative. First, in positive sampling, we randomly extract a fixed number of user–item pairs (u, i) with positive feedback from S . We denote a set of pairs as B ⊂ S and call it a positive sample. Next, for all user–item pairs (u, i) in B , we randomly extract a fixed number of items j from Du . We denote a set of negative items for user–item pairs (u, i) ∈ B as Nu,i ⊂ Du and a set of Nu,i as N = {Nu,i }(u,i)∈B , and call it a negative sample. Note that both are samplings with replacement. The loss function of the mini batch is also modified for stochastic gradient descent. Specifically, the loss function is defined as ∑ Lbatch (𝛩, B, N) = wu,i [d2 (vu , vi ) + min d2 (vu , vj ) + m]+ (2) j∈Nu,i (u,i)∈B

using B, N obtained through positive and negative sampling. The difference from Eq.(1) is the loss on item j. While Eq.(1) sums the losses for all items in Du for user u, we only consider the loss for negative items j located nearest to user u in Eq.(2). This formulation leverages the fact that we obtain a better solution if we focus on moving negative items that are located closer to the user.

13

The Review of Socionetwork Strategies (2022) 16:307–332

313

2.4 Weighted Negative Sampling The abovementioned method neglects the causes of noise. By contrast, weighted negative sampling additionally considers the negative noise caused by the popularity of items. This method weighs the negative sampling probability using the number of observed interactions of an item in a dataset. Previous studies on representation learning have demonstrated that this method is effective [29, 30]. If the number of observations in a dataset is f(j), the sampling probability of item j in the negative sampling, Pr(j) , is expressed as follows:

Pr(j) = ∑

f (j)𝛽 j�

f (j� )𝛽

𝛽 represents the weight hardness. We can interpret this method as an assumption that P(Ru,i = −1|Yu,i = −1) depends on the popularity of an item.

2.5 2‑Stage Negative Sampling Tran et al.[27] extended weighted negative sampling inspired by global orthogonal regularization (GOR). GOR is a regularization term proposed by Zhang et al.[31] based on the idea that the embeddings should spread-out sufficiently in the latent space. They showed that the probability density function (pdf) of inner product between vectors p1 and p2 uniformly distributed in a d-dimensional hypersphere is d−1 −1 ⎧ 2 2 � ⎪ (1−s� ) � if − 1 ≤ s ≤ 1 � T d−1 1 p p1 p2 = s = ⎨ Beta 2 , 2 ⎪0 otherwise ⎩

(3)

and that the expected value of the inner product 𝔼[pT1 p2 ] and square of the inner product 𝔼[(pT1 p2 )2 ] are 0 and 1/d, respectively. Here, Beta(a,b) denotes the beta distribution with parameters a and b. Zhang et al. further introduced regularization into the representation learning such that the inner product of any pair of embeddings approaches 0 and the square of the inner product is close to 1/d. This technique allowed the embeddings to spread-out sufficiently to improve their representation. We can introduce this regularization in CML as well. Tran et al. introduced this regularization to CML and proposed a method of negative sampling based on the inverse of the pdf in Eq.(3). This sampling method enables us to focus on negative items that are close in distance to positive items, resulting in more effective learning of embeddings. However, since it is challenging to calculate the inner product for all item combinations, we must perform popularity-based weighted sampling in the first stage and sampling based on GOR in the second. Specifically, the two stages are as follows: 1. Select a negative sample candidate, Cu,i ⊂ I , in weighted negative sampling.

13

314

The Review of Socionetwork Strategies (2022) 16:307–332

2. Select a highly informative sample, Nu,i ⊂ Cu,i , among the candidates on the basis of the probability defined below. { 1 ) ( 0≤s≤1 p(vTi vj =s) Pr j ∣ vi ∝ 0 otherwise This sampling reduces the harmful effects of false-negative noise. This is because the method finds more informative negative items, and for this purpose they use the information of positive items. However, this method only considers false-negative noise and not false-positive noise.

3 Proposed Method Weighted negative sampling and 2-stage negative sampling have two problems in terms of noise. First, weighted sampling cannot eliminate the harmful effects of negative noise. The second is that they cannot handle the positive noise. To address these problems, we propose sampling methods that weigh or clean both positive and negative feedback data based on confidence learning [28]. Confident learning is a method for cleaning data using the results of learning noisy data for image classification. First, this method builds a model that predicts the probability of belonging to each label. p̂ (y|xi ) = g(x;𝛷), g ∶ ℝd → [0, 1]M . It uses dataset D = {xi , ỹ i }Ni=1 , whose instances are features xi ∈ ℝd and noisy labels ỹ i ∈ {1, 2, … , M} . Next, we removed samples with a high noise rate 𝜌j,xi = p̂ (y = j|xi ) − p̂ (y = ỹ i |xi ), (j ≠ ỹ i ) from D . Finally, it performs training again using clean data. Then, we consider similar steps for CML with implicit feedback. In the first step, we estimate the latent probability and calculate the noise rate. In the second step, we proposed two approaches. One is called “double-weighted sampling,” which weighs positive and negative sampling by the latent observed probability estimated by other models. The other is called “double-clean sampling,” which eliminates user–item pairs with extreme noise before the CML. 3.1 Estimate Latent Probability & Calculate Noise Rate In the first step, which is common to both approaches, we estimate the noise rate. The positive noise rate 𝜌+u,i in the positive sample S = {(u, i)|Yu,i = 1} and the negative noise rate 𝜌−u,i in the negative sample D = {(u, i)|Yu,i = −1} can be expressed as

13

̂ u,i = −1) − P(Y ̂ u,i = +1) = 1 − 2P(Y ̂ u,i = +1), 𝜌+u,i = P(Y

(4)

̂ u,i = +1) − P(Y ̂ u,i = −1) = 2P(Y ̂ u,i = +1) − 1, 𝜌−u,i = P(Y

(5)

The Review of Socionetwork Strategies (2022) 16:307–332

315

which is similar to confident learning. Thus, we can calculate these noise rates by ̂ u,i = +1) for all user–item pairs, (u, i) ∈ U × I . estimating P(Y ̂ u,i = +1) using the given label Yu,i . To calculate the noise rates, we estimated P(Y In other words, for any model g ∶ U × I → ℝ parameterized by 𝛷 , certain algorithms learn 𝛷 as

̂ u,i = +1) = 𝜎(g(u, i;𝛷)). P(Y Note that 𝜎 is a sigmoid function. Assuming that P(Yu,i = +1) follows a binomial distribution, the maximum likelihood estimation is a minimization problem of ∑∑ min −Yu,i log 𝜎(g(u, i;𝛷)) − (1 − Yu,i ) log 𝜎(−g(u, i;𝛷)). (6) 𝛷 u∈U i∈I

We can model g more flexibly than the CML model because the model need not satisfy the requirements of a recommendation system in a web service. Thus, classical algorithms can be used for recommendation systems, such as logistic matrix factorization (LMF) [32] and factorization machines [33]. For example, LMF defines a model,

glmf (u, i;𝛷lmf ) = vTu vi + bu + bi , where 𝛷lmf = {vc , bc }c∈U∪I , by embedding vu ∈ ℝd (∀u ∈ U), vi ∈ ℝd (∀i ∈ I) and the bias term bu ∈ ℝ (∀u ∈ U) and bi ∈ ℝ (∀i ∈ I) . It uses a gradient descent method such as Adam[34] to find the local optima. When the user set U and item set I are extremely large, optimizing all user-item pairs is challenging. Here, we learn by stochastic gradient descent with sampling as in the CML mini-batch. Although it is inefficient to implement LMF as a recommendation model in modern web services because it cannot be applied to approximate nearest neighbor searches, it is sufficient as a latent probability estimation model. ̂ u,i = +1) and noise rate 𝜌+ , 𝜌− . These procedures allow us to compute P(Y u,i u,i 3.2 Proposed Method A: Double‑Weighted Sampling A second step in the proposed method is double-weighted sampling. Here, the objective of weighting positive sampling is to focus more on user–item pairs that are more likely to have Ru,i = +1 , indicating that user u potentially likes item i. Similarly, the objective of weighting negative sampling is to focus more on the user–item pairs that are more likely to have Ru, = −1 , indicating that user u potentially does not like item i. However, we do not obtain Ru,i , and only obtain the noisy Yu,i . Therefore, we must estimate the probabilities P(Ru,i = +1) and P(Ru,i = −1) using only the data of Yu,i . Previous studies focused only on negative sampling and assumed that P(Ru,i = −1) is proportional to the popularity of item i. In other words, we weigh f (j) = |{u|Yu,j = +1}| by f(j). However, this method implicitly assumes that there is no noise in the positive label Yu,j = +1 ; therefore, it might not work in noisy situations. Therefore, we propose estimating the probability P(Yu,i = ±1) as an alternative to P(Ru,i = ±1) . The validity of this alternative is discussed in section 4.

13

316

The Review of Socionetwork Strategies (2022) 16:307–332

̂ u,i = ±1) was obtained in the previous subsection, we can Because estimator P(Y simply implement the weighting method. In positive sampling, we set the probability Pr(u, i) to extract a user–item pair (u, i) as a mini-batch from the positive feedback set S to ̂ u,i = +1) P(Y

Pr(u, i) = ∑

(u,i� )∈S

̂ u,i� = +1) P(Y

.

(7)

In negative sampling, for each user–item pair (u, i) drawn in positive sampling, we set the probability of extracting item j, Pru (j) , as

Pru (j) = ∑

̂ u,i = −1) P(Y j� ∈Du

̂ u,j� = −1) P(Y

.

(8)

̂ u,i = ±1) in the numerator, Equations 7 and 8 set the weights proportional to P(Y and the denominators are the normalization terms. This weighting allows us to focus more on reliable user–item pairs in positive and negative sampling. The above procedure is shown in Algorithm 1.

3.3 Proposed Method B: Double‑Clean Sampling Another approach in the second step of the proposed method is double-clean sampling. We use 𝜌+u,i , 𝜌−u,i calculated in the previous section to determine the user–item pairs to be removed during positive and negative sampling in the CML. First, we remove noisy user–item pairs from S . We set the proportion p of the data to be deleted and calculate the threshold, t+ , as

13

The Review of Socionetwork Strategies (2022) 16:307–332

317

t+ = Percentile({𝜌+u,i }(u,i)∈S , p), where Percentile(A, p) denotes the top p percentile of scalar element set A . Then, the set of a clean positive user–item pairs, S′ , is defined as

S� = {(u, i)|(u, i) ∈ S ∧ 𝜌+u,i ≤ t+ }. Next, we remove noisy user–item pairs from D . Similar to the noise in positive sampling, the threshold, t− , for the proportion q of the data to be removed is obtained as

t− = Percentile({𝜌−u,i }(u,i)∈D , q). Then, we can obtain clean negative user–item pairs,

D� = {(u, i)|(u, i) ∈ D ∧ 𝜌−u,i ≤ t− }. Furthermore, subset D′u ⊂ I of the low-noise itemset that does not observe any interaction with user u is denoted by D�u = {i|(u, i) ∈ D� }. Then, sampling the clean datasets S′ and D�u (∀u ∈ U) obtained in this manner prevents us from drawing extremely noisy data in B and Nu,i . This approach is called double-clean sampling, and it is shown in Algorithm 2.

13

318

The Review of Socionetwork Strategies (2022) 16:307–332

4 Theoretical Analysis This section discusses the theoretical properties of our proposed methods. In ̂ u,i = ±1) as an alternative to Subsection 4.1, we discuss the validity of using P(Y P(Ru,i = ±1) for weighing and cleaning and their scalability to large-scale data in subsection 4.2.

4.1 Validity of the Reliability

̂ u,i = ±1) employed as an This subsection discusses the validity of reliability P(Y alternative to P(Ru,i = ±1) . As mentioned earlier, the first proposed method, doû u,i = ±1) instead of ble–weighted sampling, weighs the sampling probability by P(Y

P(Ru,i = ±1) . For double–clean, we determined the user–item pairs that should be removed from the sampling target by the noise rate expressed in Eqs. (4) and (5). ̂ u,i = +1) and 𝜌− In this definition, because 𝜌+ monotonically decreases with P(Y u,i

u,i

̂ u,i = +1) , we can observe a method that removes monotonically increases with P(Y ̂ u,i = +1) from positive sampling and user–item user–item pairs that have small P(Y ̂ u,i = +1) from negative sampling. Therefore, we can say pairs that have large P(Y that both the proposed methods estimate P(Yu,i = ±1) and apply it to noise processing as if it represents P(Ru,i = ±1) . We then show how P(Yu,i = ±1) works as an alternative to P(Ru,i = ±1) . In particular, we show the upper and lower bounds of the error P(Ru,i = ±1) − P(Yu,i = ±1) and summarize their properties when applied to weighing and cleaning. For this purpose, we first define the grand-truth potential noise rate as + 𝜎u,i = P(Yu,i = +1|Ru,i = −1) ∈ [0, 0.5), − = P(Yu,i = −1|Ru,i = +1) ∈ [0, 0.5) 𝜎u,i

Using these notations, we can derive + − ≤ P(Ru,i = +1) − P(Yu,i = +1) ≤ 𝜎u,i , −𝜎u,i

(9)

+ − . −𝜎u,i ≤ P(Ru,i = −1) − P(Yu,i = −1) ≤ 𝜎u,i

(10)

Proof (Equations (9) and (10)) + − We can express P(Yu,i = +1) by using P(Ru,i = +1) and 𝜎u,i as follows: , 𝜎u,i

13

The Review of Socionetwork Strategies (2022) 16:307–332

319

P(Yu,i = +1) = P(Ru,i = +1)P(Yu,i = +1|Ru,i = +1) + P(Ru,i = −1)P(Yu,i = +1|Ru,i = −1) − + ) + (1 − P(Ru,i = +1))𝜎u,i = P(Ru,i = +1)(1 − 𝜎u,i

(11)

+ − + = P(Ru,i = +1)(1 − 𝜎u,i − 𝜎u,i ) + 𝜎u,i

We can also express P(Yu,i = −1) as follows.

P(Yu,i = −1) = P(Ru,i = −1)P(Yu,i = −1|Ru,i = −1) + P(Ru,i = +1)P(Yu,i = −1|Ru,i = +1) − + ) + (1 − P(Ru,i = −1))𝜎u,i = P(Ru,i = −1)(1 − 𝜎u,i

(12)

− + − = P(Ru,i = −1)(1 − 𝜎u,i − 𝜎u,i ) + 𝜎u,i .

By solving Eq. (11) for P(Ru,i = +1) , we get

P(Ru,i = +1) =

− P(Yu,i = +1) − 𝜎u,i + − 1 − 𝜎u,i − 𝜎u,i

− ≥ P(Yu,i = +1) − 𝜎u,i

− ⇔ P(Ru,i = +1) − P(Yu,i = +1) ≥ −𝜎u,i .

(13)

− + ∵0 < 1 − 𝜎u,i − 𝜎u,i ≤1

Similarly, we get

P(Ru,i = −1) =

+ P(Yu,i = −1) − 𝜎u,i + − 1 − 𝜎u,i − 𝜎u,i

+ ≥ P(Yu,i = −1) − 𝜎u,i

+ ⇔ P(Ru,i = −1) − P(Yu,i = −1) ≥ −𝜎u,i

(14)

+ − ∵0 < 1 − 𝜎u,i − 𝜎u,i ≤1

by solving Eq. (12) for P(Ru,i = −1) . Furthermore, substituting

P(Ru,i = −1) = 1 − P(Ru,i = +1) P(Yu,i = −1) = 1 − P(Yu,i = +1) into the Eq. (14), we get + {1 − P(Ru,i = +1)} − {1 − P(Yu,i = −1)} ≥ −𝜎u,i + ⇔ P(Ru,i = +1) − P(Yu,i = +1) ≤ 𝜎u,i .

(15)

Then, substituting

P(Ru,i = +1) = 1 − P(Ru,i = −1) P(Yu,i = +1) = 1 − P(Yu,i = −1) into the Eq. (13), we get

13

320

The Review of Socionetwork Strategies (2022) 16:307–332 − {1 − P(Ru,i = −1)} − {1 − P(Yu,i = +1)} ≥ −𝜎u,i − ⇔ P(Ru,i = −1) − P(Yu,i = −1) ≤ 𝜎u,i

(16)

Thus, Eq. (9) is obtained from Eqs. (13) and (15), and Eq. (10) is obtained from + − Eqs. (14) and (16). Note that the equality is satisfied when 𝜎u,i = 𝜎u,i = 0 . ◻ Equations (9) and (10) show that the error in P(Yu,i = ±1) relative to P(Ru,i = ±1) + − is limited by the noise rate, 𝜎u,i . This property implies that the difference between , 𝜎u,i P(Ru,i = ±1) and P(Yu,i = ±1) becomes negligible when the noise rate is sufficiently low. Therefore, we can say that we sufficiently denoise the noise generated despite the low potential noise rate by replacing P(Ru,i = ±1) with P(Yu,i = ±1) for weighing and cleaning. However, user–item pairs with a serious noise rate result in a sig+ − nificant error. Specifically, a large 𝜎u,i might overestimate P(Yu,i = +1) and large 𝜎u,i might overestimate P(Yu,i = −1) . Particularly, it may be vulnerable to noise that was bound to occur. However, we cannot measure the potential noise rate in general. It is difficult to perform simulations because the mechanism of noise generation in the real world is nontrivial. Therefore, this study evaluated the practicality of these methods by verifying their performance on several datasets collected in the real world. 4.2 Scalability This subsection discusses the scalability of the proposed methods. We should pay attention to time and space complexity when applying recommendation algorithms with large-scale data. In the context of a general recommendation system, O(|S|) is accepted as complex. Therefore, it is necessary to check whether the proposed method is also acceptable. The baseline of this study, CML, is scalable to large-scale data and consists mainly of – Samplings of mini-batch, – Calculation of loss function, – Update of parameters. The proposed methods improve only the first point. In these components, the proposed methods improve only the first point and are the same as CML for the other two points. The following section focuses on the computational complexity of minibatch sampling. First, in terms of time complexity, the proposed method differs from the standard CML in that it calculates the sampling probability by estimating the reliability and changes to weighted sampling. Although the time complexity of the reliabil̃ , ity estimation depends on the setting of the reliability estimation model g(u, i|𝛷) the time complexity of the logistic matrix factorization employed in our experiments is the same as that of the standard CML. Thus, the time complexity of the proposed method is at most two times worse, which is acceptable. In addition,

13

The Review of Socionetwork Strategies (2022) 16:307–332 Table 2 Time complexity of positive and negative sampling for each method

Table 3 Space complexity of positive and negative sampling for each method

321 Positive Sampling

Negative Sampling

CML

O(|B|)

O(|B|)

Double–weighted

O(|B| log |S|)

O(|B| ⋅ |Nu,i | log |I|)

Double–clean

O(|B|)

O(|B|)

Positive Sampling Negative Sampling CML

O(1)

O(|S|)

Double– weighted

O(|S|)

O(|S| + |B| ⋅ |I|)

Double–clean

O(1)

O((1 − q) ⋅ |S| + q ⋅ |U| ⋅ |I|)

weighted sampling requires O(m log n) for the total number of candidates n and number of extracted samples m, whereas uniform sampling requires O(m) . Therefore, double–weighted sampling requires O(|B| log |S|) for positive sampling and O(|B| ⋅ |Nu,i | log |I|) for negative sampling. However, because the sizes of B and Nu,i are set to be sufficiently small, the proposed method is also acceptable in terms of computation time. Table 2 summarizes the time complexity for each sampling method. Next, we verify space complexity. For positive sampling, only the memory of the sampling probabilities for double-weighted sampling differs from that of the standard CML. Therefore, the space complexity required for positive sampling is O(|S|) for double-weighted sampling and O(1) for others. For negative sampling, all methods sample from the set of items Du that have not observed any interactions with user u. Thus, a naive implementation would need to store items that have not interacted with for any given user, which ∑ would require a spatial computation of u∈U �Du � . Fortunately, however, S is generally sparse, and the itemset Su = {i|Yu,i = +1} = I ⧵ Du is sufficiently small compared with the itemset I . Sparseness means that instead of storing Du for any user u, it is more efficient to store Su than Du to achieve the same purpose with a smaller capacity. In other words, it is only necessary to store ∑ u∈U �Su � = �S� as the “Blocklist” that we should not extract in negative sampling. In addition, negative sampling in double–weighted requires a memory of |B| ⋅ |I| for the probability of extracting any item for any user–item pair (u, i) ∈ B . By contrast, negative sampling in double–clean requires a memory of q ⋅ |Du | = q ⋅ (|U| ⋅ |I| − |S|) when q is a proportion to delete. Therefore, the space complexity required for negative sampling is |S| + |B| ⋅ |I| for double–weighted, and |S| + q ⋅ (|U| ⋅ |I| − |S|) = (1 − q) ⋅ |S| + q ⋅ |U| ⋅ |I| for double–clean. Table 3 summarizes the space complexity for each sampling method. As shown in this table, there is no problem with a positive sampling of spatial complexity, but we need to be careful about negative sampling. In double–weighted sampling, if the itemset I is large, the size of the positive sample B

13

322

The Review of Socionetwork Strategies (2022) 16:307–332

must be reduced. In stochastic gradient descent, the larger the size of the minibatch, the more stable the learning becomes, and this is a disadvantage compared with the standard CML. In addition, double–clean is affected by the deletion rate q. In particular, when the number of users and items is large, complexity O(|U| ⋅ |I|) becomes unfeasible. Therefore, it is necessary to set the deletion ratio |S| q to the same level as sparsity r = |U|⋅|I| . If q = r , the space complexity is O((2 − r) ⋅ |S|) = O(|S|) and even large-scale data can be accepted. In conclusion, we can apply the proposed methods to large-scale data if we consider the setting of the mini-batch size and deletion rate.

5 Experiments This section describes the experiments conducted to verify the performance of the proposed method using real-world datasets. In particular, we discuss whether the proposed method can improve the ranking metrics compared with the 2-stage negative sampling and weighted sampling. The experiments were conducted from April to May 2021 using the Colaboratory’s GPU and a regular memory instance, ensuring that the GPU was always an NVIDIA Tesla P100. The code for reproducing the results of the experiments is available here1. 5.1 Datasets The experiment used the Yahoo! R32 and coat3 datasets, which were used to evaluate recommendation systems with bias and noise removal. These datasets have been previously divided into training and test data. The training data contained the ratings of items that were freely selected by each user, and the test data contained the ratings of items that were randomly selected for each user. Particularly, the training data contained noise and bias, similar to other datasets, whereas the test data contained almost no noise or bias. As these datasets have five ratings, we performed common preprocessing steps on these datasets, as discussed later. This enabled us to simulate the positive and negative noises of the implicit feedback on the training dataset. Then, we evaluated whether we can learn accurately with noisy labels. The preprocessing was carried out as follows: 1. First, we generated the relevant data for implicit feedback by applying the following treatments to the five ratings of the training and test datasets. – Ru,i = +1 for (u, i) with a rating of 4 or 5. – Ru,i = −1 for (u, i) with a rating of 1 to 3.

1

https://colab.research.google.com/drive/1_4hegtR2BORSN8a9PjLTfTsHdiN55Z4p?usp=sharing. https://webscope.sandbox.yahoo.com/catalog.php?datatype=r. 3 https://www.cs.cornell.edu/~schnabts/mnar/. 2

13

The Review of Socionetwork Strategies (2022) 16:307–332 Table 4 Statistics of the datasets after preprocessing

323

Datasets

# Users

# Items

# Interactions

Yahoo! R3

15,400

1,000

311,704

Coat

290

300

6960

– Ru,i is missing for (u, i) with no rating. We denote the relevance of the training and test data as Rtrain and Rtest , respecu,i u,i tively. 2. Next, we created the noisy label, Yu,i , observed in the training data using the following equation: { +1 (Ru,i is not missing), Yu,i = −1 (Ru,i is missing.) These preprocessing steps enabled us to include two types of noise in the training data. – Positive noise (false-positive): Yu,i = +1 and Rtrain = −1. u,i – Negative noise (false-negative): Yu,i = −1 and Rtest = +1. u,i Because we used Rtest for validation, we were able to verify the recommendation u,i system using data that contain no noise. Table 4 shows the dataset statistics after the preprocessing. 5.2 Baselines and Proposed Model In the experiment, we used the following previous methods as the baselines: – Uniform: Completely uniform sampling. – 1 Stage: Weighted negative sampling by popularity. – 2 Stage: 2-stage negative sampling. The first proposed method, double-weighted sampling, and ablation studies are as follows. – pos–weighted: Positive sampling with latent probability. – neg–weighted: Negative sampling with latent probability. – double–weighted: 1st proposed method. CML with both of these factors. The second proposed method, double–clean sampling, and ablation studies are as follows. – pos–clean: Top p% noisy pairs are excluded from S .

13

324

The Review of Socionetwork Strategies (2022) 16:307–332

– neg–clean: Top q% noisy pairs are excluded from D. – double–clean : 2nd proposed method. CML with both of these factors.

5.3 Metrics We used three ranking metrics that represent the prediction accuracy of ranking Rtest = +1 items for each user. We computed the ranking using the distance between u,i the user and item vectors calculated by the CML model. Based on previous studies [15, 20], which used the same datasets, we adopted the following three ranking indices: { } ̂u,i ≤ K 𝕀 Z ∑ ∑ 1 1 DCG@K = ( ) |U| u∈U |Stest | i∈Stest log Z ̂u,i + 1 u u } { ̂ K 𝕀 Z ≤ k u,i 1 ∑ 1 ∑ ∑ MAP@K = test |U| u∈U |Su | test k=1 k i∈Su { } 1 ∑ 1 ∑ ̂u,i ≤ K Z Recall@K = 𝕀 |U| u∈U |Stest u | i∈Stest u

̂u,i represents the predicted ranking of item i for user u, and where Z test = {i|R Stest = +1} . In addition, we used nDCG@K to denote the normalized u u,i value of DCG@K at the ideal prediction rank. Furthermore, we removed users for | = 0 because we have judged that the recommendation system cannot whom |Stest u improve the experience of these users. Previous studies that used the same datasets set the scores for such users to zero. Thus, our experimental scores were higher overall, but we did not compromise this order. 5.4 Hyperparameters We set the learning rate as 1e-3, the dimension, l, of v ∈ 𝛩 as 10, and M+ = 256 for the training data. For the Yahoo dataset, we used 50k minibatch iterations and M− = 10 . For the coat dataset, we used 10k minibatch iterations and M− = 5 . We use LMF to estimate the noise rate. We implemented the LMF in a unified manner by changing the CML loss from Eq. (1) to Eq. (6). In LMF, we performed 20k and 3k minibatch iterations for the Yahoo and Coat datasets, respectively. Finally, we set the percentage of data to be eliminated to p = q = 5% in the proposed method. Because it is difficult to prepare two or more test datasets, hyperparameter tuning will be considered in future work.

13

The Review of Socionetwork Strategies (2022) 16:307–332

325

Table 5 Ranking metrics and run time for each method on Yahoo! R3 dataset nDCG@3

MAP@3

Recall@3

nDCG@5

MAP@5

Recall@5

Run time [s]

Uniform

0.494

0.550

0.523

0.567

0.564

0.702

3.90 × 102

1stage

0.450

0.493

0.502

0.534

0.516

0.704

4.26 × 102

2stage

0.416

0.459

0.464

0.505

0.487

0.680

5.05 × 102

pos–weighted

0.539

0.591

0.572

0.609

0.600

0.746

6.02 × 102

neg–weighted

0.530

0.585

0.559

0.599

0.594

0.730

6.51 × 102

double-weighted

0.541

0.596

0.571

0.613

0.605

0.748

6.60 × 102

pos–clean

0.551

0.604

0.582

0.620

0.611

0.755

7.21 × 102

neg–clean

0.559

0.612

0.590

0.628

0.620

0.763

7.43 × 102

0.626

0.619

0.760

7.53 × 102

double–clean

0.559

0.612

0.590

The bold and underlined values that indicate the highest and second-highest scores

Table 6 Ranking metrics and run time for each method on coat dataset nDCG@3

MAP@3

Recall@3

nDCG@5

MAP@5

Recall@5

run time [s]

Uniform

0.271

0.355

0.219

0.312

0.371

0.337

0.61 × 102

1stage

0.253

0.326

0.209

0.296

0.349

0.333

0.63 × 102

2stage

0.296

0.390

0.234

0.347

0.416

0.375

0.62 × 102

pos–weighted

0.341

0.446

0.282

0.393

0.466

0.432

1.00 × 102

neg–weighted

0.347

0.453

0.294

0.394

0.469

0.439

1.02 × 102

Double–weighted 0.347

0.445

0.289

0.394

0.467

0.431

1.02 × 102

pos–clean

0.340

0.448

0.281

0.393

0.462

0.432

1.19 × 102

neg–clean

0.346

0.449

0.290

0.397

0.467

0.438

1.06 × 102

Double-clean

0.346

0.450

0.290

0.402

0.471

0.450

1.18 × 102

The bold and underlined values that indicate the highest and second-highest scores

5.5 Results Table 5 shows the scores of the three ranking metrics for the two values of k and runtime (including noise rate estimation) for the baseline and proposed methods on the Yahoo! R3 dataset. Moreover, Table 6 shows the scores and runtime for the coat dataset. Bold and underlined values indicate the highest and second-highest scores, respectively. These scores were the averages of five training runs performed on the training dataset, from which we randomly selected 90% of the positive interactions and calculated each metric. The results show that weighting or cleaning based on the estimated noise rate considerably improves the scores of all metrics over the uniform and existing techniques (1stage, and 2stage). Figures 2 and 3 depict the mean and confidence intervals of the nDCG@3 and Recall@5 scores for each method over the learning process, respectively. As shown in these figures, the confidence interval at the end of the training was small, and the performance improvement was significant. Furthermore, these results show that

13

326

The Review of Socionetwork Strategies (2022) 16:307–332

Fig. 3 Means and confidence intervals of Recall@5 for the learning progress of each method. The horizontal axis represents the percentage of progress from the beginning to the end of the CML training. (left) Results from the Yahoo! R3 dataset. (right) Results for the coat dataset

Fig. 2 Means and confidence intervals of nDCG@3 for the learning progress of each method. The horizontal axis represents the percentage of progress from the beginning to the end of the CML training. (left) Results from the Yahoo! R3 dataset. (right) Results from the coat dataset

all the methods converge at the end of the iterations, and no further improvement is expected even if the training time is increased. Therefore, based on the runtime results in Tables 5 and 6, we can conclude that the performance of the proposed method can be improved using approximately twice the computational time of uniform. More specifically, it is evident that neg–clean and double–clean are the best and second-best methods on the Yahoo! R3 dataset, whereas neg–weighted and double–clean are the best and second-best methods on the Coat dataset, respectively. By contrast, neg–weighted, which is the best on the coat dataset, is not better than any of the cleaning methods (pos–clean, neg–clean, and double–clean) on the Yahoo! R3 dataset, and neg–clean, which is the best on the Yahoo! R3 dataset, is not the best method on the coat dataset. Thus, the most stable and efficient method was the double–clean method. In many situations, it is more effective to use both positive and negative noises to remove extremely noisy data instead of weighting them. Compared to uniform, the worse scores for 1st stage and 2nd stage on the Yahoo! R3 dataset were caused by false-positive noise. Previously, in the experiments that highlighted the effectiveness of methods using item popularity, there was no noise in positive feedback. Therefore, it was assumed that correct popularity was available. However, only noisy feedback was available in this experiment; therefore, correct popularity was not calculated. Thus, popularity considers noisy items overly popular. Therefore, the model heavily samples noisy user–item pairs during negative

13

The Review of Socionetwork Strategies (2022) 16:307–332

327

Fig. 4 Scatterplot of user embeddings color-coded by gender. (Left) User embedding when uniform is used. (Right) User embedding using double–clean

Fig. 5 Scatterplot of item embeddings color-coded by target gender. (Left) Item embeddings using uniform. (Right) Item embeddings using double–clean

sampling. This result suggests that it is not adequate to rely solely on implicit feedback to calculate popularity in the real world. 5.6 Explainability of Embeddings This subsection presents the determination of whether the performance improvement of the proposed method, discussed in the previous subsection, is because of the capture of user–user and item–item relationships. For this purpose, we evaluated whether users and items with similar attributes were embedded at short distances. This experiment used the user’s gender and the item’s target gender information in the coat dataset as attributes. It should be noted that the item in this dataset represents clothing. Figure 4 shows scatter plots of the user embeddings learned by uniform and double–clean, compressed into two dimensions by t-SNE [35] and color-coded by gender. Figure 5 also presents scatter plots of the item embedding color-coded by target gender. There are few interactions across genders in clothing, and thus it is easy to express attributes; nevertheless, these figures show that the embedding generated by uniform is not ideal in certain areas.

13

328 Table 7 Error rate (%) when SVM classifies labels of attributes using embeddings as features

The Review of Socionetwork Strategies (2022) 16:307–332 User Avg.

Item Max.

Avg.

Max.

Uniform

7.24

12.07

1.67

5.00

2stage

8.28

13.79

1.67

3.33

8.62

0.67

Double–weight Double–clean

5.52

5.52

8.62

1.33

3.33

1.67

The bold values that indicate the highest scores

To quantify this evaluation, we used SVM [36] to evaluate how well it can classify attributes. Hence, we computed the error rate for classifying the labels of the attributes using user and item embeddings as features. As SVM is a classifier that predicts labels based on the distance between the samples in a vector space, it is a suitable algorithm for evaluating the embeddings produced by the CML. Table 7 shows the results of evaluating the error rate of classification via a 5-fold cross-validation. The two columns on the left are for user classification, and the two columns on the right are for item classification. On the left is the average error rate, and on the right is the maximum error rate (worst case). These results show that the error rate of double–clean is significantly improved compared with that of uniform by approximately 3/4 for users and 1/2 for items. Notably, in the worst-case scenario, the improvement was even more significant. Thus, the improvement in the ranking metrics discussed in the previous subsection can be explained by the better capturing of the user–user and item–item relationships. This result implies that the proposed methods, double–weighted and double–clean, meets the requirements of modern recommendation systems. 5.7 Discussion This experiment has two limitations owing to the scarcity of datasets that match the problem setting. Hyperparameter tuning is impossible and the dataset to be validated has a limited variety. Hyperparameter tuning is challenging mainly because of the small test data size. The proposed method is a recommendation system that works even when the training data contain noise. Therefore, to demonstrate the superiority of our method, we must validate it on training data with noise and test data without noise. However, collecting noiseless data is generally expensive. The Yahoo! R3 and coat datasets used in this experiment are rare datasets satisfying this requirement. Nevertheless, both datasets still have a small test dataset. Here, the main methods to create a validation set are (1) random splits of the test data and (2) random splits of the train data. However, both methods are difficult. For (1), both the validation set, and the test set cannot have sufficient data. For (2), the validation data, just like the train data, does not indicate whether each feedback is noise or not, so the performance of the validation data is not robust to noise.

13

The Review of Socionetwork Strategies (2022) 16:307–332

329

The main reason for the small variety of data sets is attributable to the above factors. There are only two datasets that contain real-world feedback without noise. We could also generate some kind of synthetic data, but as the mechanism of noise generation has not been clarified, a realistic validation is almost impossible. In contrast, there are three main reasons why our results are valuable even with these limitations. – The hyperparameters p and q are not tuned in this experiment, and are fixed at 5%, which is widely used as a ratio indicating an abnormality in hypothesis testing. – This study shows that when large test data can be obtained, tuning is easy and that there is most likely some good hyperparameters. – Research on unbiased recommendation systems has demonstrated excellent results with these data sets [15, 20, 37–39].

6 Conclusion This study proposed algorithms that satisfy two requirements of recommendation systems in web services : (1) using embeddings of users and items when predicting relationships between users and items, and (2) using implicit feedback when learning the relations between users and items. Specifically, we introduced sampling methods that comprehensively consider the noise caused by implicit feedback, which is the second requirement in CML developed to satisfy the first requirement. Two methods were proposed, namely “double–weighted sampling,” which weighs the probability-based on reliability, and “double–clean,” which removes data with a high noise rate from the sampling target. Because they involve trade-offs between general information loss and vulnerability to extreme noise, we evaluated the algorithms with real-world datasets and settings. The two methods significantly outperformed the previous methods for requirements (1) and (2). Additionally, the performance of “double–clean” was more stable on the datasets used in our experiments. There are two main possible future directions for this study. One is developing a method to overcome the limitations of the data set, also discussed in subsection 5.7. We recognize the tuning of deletion ratios p and q with noisy train data as the most significant barrier to practical use. We will investigate a method for determining optimal ratios based on trade-off between information loss and noise. The other future work develops an estimator of P(Ru,i = ±1) with better statistical properties, such as unbiasedness and consistency. This study estimated P(Yu,i = ±1) as an alternative to P(Ru,i = ±1) . However, as shown in Section 4, bias occurs when the potential noise rate is large. In addition, we could demonstrate consistency. Therefore, if we can develop a better estimator that can be computed only from the observed values of Yu,i , it will be possible to handle the noise more suitably.

13

330

The Review of Socionetwork Strategies (2022) 16:307–332

References 1. Wieschollek, P., Wang, O., Sorkine-Hornung, A., & Lensch, H. (2016). Efficient large-scale approximate nearest neighbor search on the GPU. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, 2027–2035. 2. Huang, J.-T., Sharma, A., Sun, S., Xia, L., Zhang, D., Pronin, P., Padmanabhan, J., Ottaviano, G., Yang, L. (2020). Embedding-based retrieval in facebook search. In: KDD ’20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2553–2561. 3. Grbovic, M., & Cheng, H. (2018). Real-time personalization using embeddings for search ranking at airbnb. In: KDD ’18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 311–320. 4. Guo, D., Xu, J., Zhang, J., Xu, M., Cui, Y., & He, X. (2017). User relationship strength modeling for friend recommendation on instagram. Neurocomputing, 239, 9–18. 5. Brovman, Y. M., Jacob, M., Srinivasan, N., Neola, S., Galron, D., Snyder, R., & Wang, P. (2016). Optimizing similar item recommendations in a semi-structured marketplace to maximize conversion. In: RecSys ’16: Proceedings of the 10th ACM Conference on Recommender Systems, pp. 199–202. 6. Hsieh, C.-K., Yang, L., Cui, Y., Lin, T.-Y., Belongie, S., & Estrin, D. (2017). Collaborative metric learning. In: WWW ’17 : Proceedings of the 26th International Conference on World Wide Web, pp. 193–201. 7. Yu, L., Zhang, C., Pei, S., Sun, G., & Zhang, X. (2018). Walkranker: A unified pairwise ranking model with multiple relations for item recommendation. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), pp. 2596–2603. 8. Chen, C.-M., Wang, C.-J., Tsai, M.-F., & Yang, Y.-H.(2019). Collaborative similarity embedding for recommender systems. In: WWW ’19: The World Wide Web Conference, pp. 2637–2643. 9. Campo, M., Espinoza, J., Rieger, J., & Taliyan, A. (2018). Collaborative metric learning recommendation system: Application to theatrical movie releases. arXiv preprint arXiv:1803.00202. 10. Tay, Y., Tuan, L. A., & Hui, S. C. (2018). Latent relational metric learning via memory-based attention for collaborative ranking. In: WWW ’18: Proceedings of the 2018 World Wide Web Conference, pp. 729–739. 11. Zhou, X., Liu, D., Lian, J., & Xie, X. (2019). Collaborative metric learning with memory network for multi-relational recommender systems. In: IJCAI’19: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 4454–4460. 12. Park, C., Kim, D., Xie, X., & Yu, H. (2018). Collaborative translational metric learning. IEEE International Conference on Data Mining (ICDM), 2018, 367–376. 13. Vinh T. L., Tay, Y., Zhang, S., Cong, G., & Li, X. (2020). Hyperml: A boosting metric learning approach in hyperbolic space for recommender systems. In: WSDM ’20: Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 609–617. 14. Tran, V.-A., Salha-Galvan, G., Hennequin, R., & Moussallam, M. (2021). Hierarchical latent relation modeling for collaborative metric learning. arXiv preprint arXiv:2108.04655. 15. Saito, Y. (2020). Unbiased pairwise learning from biased implicit feedback. In: ICTIR’20 : Proceedings of the 2020 ACM SIGIR International Conference on Theory of Information Retrieval, pp. 5–12. 16. He, X., Liao, L., Zhang, H., Nie, L., Hu, X., & Chua, T.-S. (2017). Neural collaborative filtering. In: WWW ’17 : Proceedings of the 26th International Conference on World Wide Web, pp. 173–182. 17. Rendle, S., Freudenthaler, C., Gantner, Z., & Schmidt-Thieme, L. (2009). BPR: Bayesian personalized ranking from implicit feedback. In: UAI ’09: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 452–461. 18. Covington, P., Adams, J., & Sargin, E. (2016). Deep neural networks for youtube recommendations. In: RecSys ’16: Proceedings of the 10th ACM Conference on Recommender Systems, pp. 191–198. 19. Chen, J., Zhang, H., He, X., Nie, L., Liu, W., & Chua, T.-S. (2017).Attentive collaborative filtering: Multimedia recommendation with item- and component-level attention. In: SIGIR ’17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 335–344.

13

The Review of Socionetwork Strategies (2022) 16:307–332

331

20. Saito, Y., Yaginuma, S., Nishino, Y., Sakata, H., & Nakata, K. (2020). Unbiased recommender learning from missing-not-at-random implicit feedback. In: WSDM ’20: Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 501–509. 21. Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. In: ICDM ’08: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pp. 263–272. 22. Lu, H., Zhang, M., & Ma, S. (2018). Between clicks and satisfaction: Study on multi-phase user preferences and satisfaction for online news reading. In: SIGIR ’18: The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 435–444. 23. Wang, W., Feng, F., He, X., Nie, L., & Chua, T.-S. (2021). Denoising implicit feedback for recommendation. In: WSDM ’21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 373–381. 24. Wang, W., Feng, F., He, X., Zhang, H., & Chua, T.-S. (2021). Clicks can be cheating: Counterfactual recommendation for mitigating clickbait issue. In: SIGIR ’21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval 25. Yu, W., & Qin, Z. (2020). Sampler design for implicit feedback data by noisy-label robust learning. In: SIGIR ’20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 861–870. 26. Wang, N., Qin, Z., Wang, X., & Wang, H. (2021). Non-clicks mean irrelevant? Propensity ratio scoring as a correction. In: WSDM ’21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 481–489. 27. Tran, V.-A., Hennequin, R., Royo-Letelier, J., & Moussallam, M. (2019). Improving collaborative metric learning with efficient negative sampling. In: SIGIR’19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1201–1204. 28. Northcutt, C. G., Jiang, L., & Chuang, I. L. (2021). Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70, 1373–1411. 29. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, 26, 3111–3119. 30. Barkan, O., & Koenigstein, N. (2016). Item2vec: Neural item embedding for collaborative filtering. IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), 2016, 1–6. 31. Zhang, X., Yu, F. X., Kumar, S., & Chang, S.-F. (2017). Learning spread-out local feature descriptors. IEEE International Conference on Computer Vision (ICCV), 2017, 4605–4613. 32. Johnson, C. C. (2014). Logistic matrix factorization for implicit feedback data. In: Advances in neural information processing systems 27 : 28th Annual Conference on Neural Information Processing Systems 2014 (NIPS), pp. 1–9. 33. Rendle, S. (2010). Factorization machines. In: ICDM ’10: Proceedings of the 2010 IEEE International Conference on Data Mining, pp. 995–1000. 34. Kingma, D. P., & Ba, J. L. (2015). Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings. 35. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(11), 2579–2605. 36. Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5), 988–999. 37. Lee, J.-W., Park, S., & Lee, J. (2021). Dual unbiased recommender learning for implicit feedback. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1647–1651. 38. Yu, J., Zhu, H., Chang, C.-Y., Feng, X., Yuan, B., He, X., & Dong, Z. (2020). Influence function for unbiased recommendation. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1929–1932. 39. Wang, Y., Liang, D., Charlin, L., & Blei, D. M. (2020). Causal inference for recommender systems. In: Fourteenth ACM Conference on Recommender Systems, pp. 426–431. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

13

332

The Review of Socionetwork Strategies (2022) 16:307–332

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

13