Citation preview

Instructor’s Solution Manual for “Outlier Ensembles: An Introduction” Charu C. Aggarwal and Saket Sathe IBM T. J. Watson Research Center Yorktown Heights, NY August 31, 2016

ii

Contents 1 An Introduction to Outlier Ensembles

1

2 Theory of Outlier Ensembles

3

3 Variance Reduction in Outlier Ensembles

5

4 Bias Reduction in Outlier Ensembles: The Guessing Game

7

5 Model Combination in Outlier Ensembles

9

6 Which Outlier Detection Algorithm Should I Use?

vii

11

viii

CONTENTS

Chapter 1

An Introduction to Outlier Ensembles 1. Consider a randomized algorithm that predicts whether a data point is an outlier or non-outlier correctly with a probability of 0.6. Furthermore, all executions of this algorithm are completely independent of one another. A majority vote of 5 independent predictions of the algorithm is used as the final prediction. What is the probability that the correct answer is returned for a given data point? The correct answer is returned, if the prediction is correct a majority of the time. Therefore, we need to add the probability that the prediction is correct 3 times, 4 times, and 5 times.  The probability of being correct three times is equal to 53 (0.6)3 (0.4)2 . The probability of being correct four times is equal to 5(0.6)4 (0.4). The probability of being correct five times is 0.65 . The sum of these values is 0.68256. Note that the prediction accuracy has increased by using five ensemble components. 2. Discuss the similarities and differences between rare-class detection and outlier detection. What do these similarities imply in terms of being able to adapt the ensembles for rare-class detection to outlier analysis? The outlier detection problem is similar to the rare-class detection problem, except that the labels are unobserved. This means that any ensemble algorithm that does not explicitly use the ground-truth can be generalized to outlier detection. Therefore, most variance-reduction techniques like bagging and subsampling can be generalized to outlier detection. 3. Discuss how one can use clustering ensembles to measure similarities between data points. How can these computed similarities be useful for outlier detection? 1

2

CHAPTER 1. AN INTRODUCTION TO OUTLIER ENSEMBLES One can repeatedly apply a randomized clustering algorithm to a data set and compute the fraction of times that two points lie in the same cluster. Such an approach provides a data-dependent similarity measure, and it can be used for distance-based outlier detection.

Chapter 2

Theory of Outlier Ensembles 1. Consider a randomized outlier detection algorithm, g(X, D), which is almost ideal in the sense that it correctly learns the function f (X) most of the time. The value of f (X) is known to be finite. At the same time, because of a small bug in the program, the randomized detector g(X, D) outputs an ∞ score about 0.00001% of the time. Furthermore, every test point is equally likely to receive such a score, although this situation occurs only 0.00001% of the time for any particular test instance. What is the model-centric bias of the bug-infested base detector g(X, D)? Even though each run of the base detector is extremely accurate, the model-centric bias of such a base detector is ∞. This is because each point receives an infinite score in expectation, whereas the true function to be learned is finite. 2. Would you recommend running the randomized base detector multiple times, and averaging the predictions of the test instance? How about using the median? I would not recommend running the randomized base detector multiple times and averaging the scores. This is because running the base detector multiple times and averaging the scores only leads to more points receiving infinite scores. However, the use of the median can lead to more robust predictions, because the effect of the infinite scores would be minimal in such a case. 3. Does the data-centric variance of an average k-nearest neighbor outlier detector increase or decrease with k? What about the bias? The variance of a k-nearest neighbor detector always decreases with k because of increased number of points over which the distance is calculated. The bias is unpredictable and might increase or decrease with k.

3

4

CHAPTER 2. THEORY OF OUTLIER ENSEMBLES

Chapter 3

Variance Reduction in Outlier Ensembles 1. The variable subsampling box-plots in the chapter show that one obtains a much larger improvement with LOF as compared to the average k-nearest neighbor detector. Explain the possible reasons for this phenomenon. The average k-nearest neighbor detector is a more stable detector than LOF. As a result, there is very little variance to reduce in the first place. Therefore, the box-plots are thin, and there is very little jump in the performance. 2. Explain the intuitive similarity between the isolation forest and densitybased detectors. In an isolation tree, the volume of a node reduces by an expected factor of 2 in each iteration. Furthermore, a leaf node contains the fraction of the expected maximal volume expected to contain a single point. Therefore, the number of splits can be viewed as a surrogate for the log-likelihood density. 3. Show how both random forests and isolation forests can be used to compute similarities between pairs of points. What is the difference in these computed similarities? How does this result relate an isolation forest to distance-based detectors? The similarity between a pair of points is equal to the average length of the common path between pairs of nodes. A similar approach can be used for random forests, and the result is shown in Brieman and Cutler’s original manual1 on the random forest. The main difference is that random forests are constructed in a supervised way, and therefore the distances are supervised as well. This also directly relates the isolation forest to 1 https://www.stat.berkeley.edu/

~breiman/Using_random_forests_v4.0.pdf

5

6

CHAPTER 3. VARIANCE REDUCTION IN OUTLIER ENSEMBLES distance-based detectors because the length of a path to isolate a point is reported as the outlier score. 4. Implement an expectation maximization clustering algorithm in which the log-likelihood fit of a point is reported as the outlier score. Implement an ensemble variant in which the scores over multiple instantiations of the algorithms with different numbers of mixture components are averaged. This is an implementation exercise.

Chapter 4

Bias Reduction in Outlier Ensembles: The Guessing Game 1. Suppose you had an oracle, which could tell you the ROC AUC of any particular outlier detection algorithm and you are allowed only 10 queries to this oracle. Show how you can combine this oracle with a subsampling method to reduce bias. [The answer to this question is not unique.] Consider the case in which one used a k-nearest neighbor detector at a particular value of k. The bias will depend on the subsampling rate. One can try 10 different subsampling rates and pick the result with the lowest ensemble error. 2. Implement the biased subsampling method discussed in this chapter. This is an implementation exercise.

7

8CHAPTER 4. BIAS REDUCTION IN OUTLIER ENSEMBLES: THE GUESSING GAME

Chapter 5

Model Combination in Outlier Ensembles 1. Discuss how the factorized consensus scheme is a generalization of the averaging scheme for outlier ensembles. The factorized consensus approach weights the different models based on the learned factors. The averaging scheme simply uses weights of one unit. 2. Implement an algorithm that takes as input a set of scores from an algorithms and then (i) normalizes them using standardization, (ii) combines them using the averaging and maximization schemes, and (iii) combines them using the AOM and Thresh schemes. This is an implementation exercise.

9

10

CHAPTER 5. MODEL COMBINATION IN OUTLIER ENSEMBLES

Chapter 6

Which Outlier Detection Algorithm Should I Use? 1. The shared-nearest neighbor distance is equal to the number of neighbors among the k-nearest neighbors that are shared by two points. Discuss why the shared nearest neighbor distance is locality sensitive, and can be used with the exact k-nearest neighbor detector as an alternative to LOF. What is the disadvantage of this method in terms of the number of parameters and the computational complexity? The shared nearest neighbor distance is sensitive to the density of the data locality. Two points have to be very close (according to Euclidean distance) in dense regions in order to share any neighbors, whereas they can be further away in sparse regions in order to share neighbors. This is the reason that one can pair the shared nearest-neighbor distance with an exact k-nearest neighbor detector as an alternative to LOF. One problem in terms of parametrization is that one now has two parameters corresponding to the distance computation and the actual implementation of the algorithm. Furthermore, the computation of the distances between a pair of points is O(n) Euclidean distance computations instead of O(1) distance computations, which slows down the approach. 2. Use ensembles to design an efficient way to compute the shared-nearest neighbor distance between two points. One can compute the shared nearest neighbor distances with respect to a subsample of the points. The shared nearest neighbors are computed for all pairs of points, but the neighbors are counted only among the subsamples. Different subsamples are repeatedly drawn and the scores are averaged. 3. Discuss how one can use the kernel Mahalanobis method to perform outlier detection in arbitrary data types. Once the similarity matrix between pairs of objects has been constructed, one can use exactly the same steps discussed in the chapter. The only 11

12CHAPTER 6. WHICH OUTLIER DETECTION ALGORITHM SHOULD I USE? restriction on the similarity matrix is that it should be positive semidefinite. 4. Propose an ensemble-centric variant of subsampling for time-series data. In this case, the values at individual time-points are sampled. Any outlier detection method can be applied to the subsampled time-series data.

http://www.springer.com/978-3-319-54764-0