FlexSketch: Estimation of Probability Density for Stationary and Non-Stationary Data Streams

Efficient and accurate estimation of the probability distribution of a data stream is an important problem in many sensor systems. It is especially challenging when the data stream is non-stationary, i.e., its probability distribution changes over time. Statistical models for non-stationary data streams demand agile adaptation for concept drift while tolerating temporal fluctuations. To this end, a statistical model needs to forget old data samples and to detect concept drift swiftly. In this paper, we propose FlexSketch, an online probability density estimation algorithm for data streams. Our algorithm uses an ensemble of histograms, each of which represents a different length of data history. FlexSketch updates each histogram for a new data sample and generates probability distribution by combining the ensemble of histograms while monitoring discrepancy between recent data and existing models periodically. When it detects concept drift, a new histogram is added to the ensemble and the oldest histogram is removed. This allows us to estimate the probability density function with high update speed and high accuracy using only limited memory. Experimental results demonstrate that our algorithm shows improved speed and accuracy compared to existing methods for both stationary and non-stationary data streams.


Introduction
Estimating the probability density function (PDF) of a random variable based on a stream of data samples from sensors is a fundamental problem arising in a broad range of areas such as machine learning [1], data structures [2], and systems [3]. There are two recent challenges in this problem.
First, the characteristics of a data stream often change. This might be due to the accidental fluctuation caused by an insufficient number of samples. In this case, a sufficient amount of data could reduce the fluctuation gradually. In some other cases, the data stream itself is non-stationary. In other words, the probability distribution of a random variable over the data stream, called concept, changes over time, which is called concept drift [4]. Concept drift occurs in many types of data such as temporal sensor data [5], video [6], and spatiotemporal data [7]. A static model constructed with the assumption of stationarity of the data stream may lead to an erroneous conclusion under the presence of concept drift. Therefore, there is a need for a method that can estimate PDFs adaptively according to concept drift for practical applications.
Second, real-world applications need to be able to handle increasing amounts of data and high-speed data streams while keeping low latency. Therefore, the demand for an online algorithm to estimate PDFs with high speed and high accuracy using only a small amount of memory is ever-increasing. While there exist online PDF estimation algorithms in literature [8][9][10][11], they usually cannot update probability models at high speed. Furthermore, they cannot adapt well to various types of data streams including those with concept drift.
In order to deal with these challenges, we propose FlexSketch, which is an online probability density estimation algorithm that achieves high update speed and high accu-racy with only a small amount of memory for both stationary and non-stationary data streams. As shown in Figure 1, FlexSketch estimates the PDF by using an ensemble structure composed of several statistical models. In particular, we exploit histogram for the statistical models, which allows fast and low-memory operations for data streams. Each histogram represents statistics of a different length of data history. FlexSketch updates each histogram for a new data sample and builds a new histogram when it detects the concept drift of input data. By decoupling updating each statistical model from amending the composition of the ensemble, FlexSketch achieves high accuracy both for stationary and non-stationary data streams. To adapt non-stationary data streams in an efficient way, a single operation updating the estimated PDF according to the data stream is divided into two elementary operations: "fast and minor update operation" (MINORUPDATE) and "slow and major update operation" (MAJORUPDATE). The type of operation to be performed varies depending on the characteristics of the data stream. When minor changes occur in a data stream, FlexSketch updates the PDF for the data stream at high speed by using MINORUPDATE. This operation simply updates each model in FlexSketch. In contrast, when major changes occur in the data stream, FlexSketch updates the PDF by using MAJORUPDATE, which builds a new model including the recent data stream and adds it to FlexSketch. Finally, FlexSketch has multiple versions of a model, ranging from a version representing only recent data to a version representing both recent and old data. FlexSketch constructs the PDF by linear combination of these coarse models.
FlexSketch dynamically decides when to forget old data and to build a new statistical model by measuring divergence between the current model and recently sampled data. This allows FlexSketch to stay stable when the concept of the data stream does not change, and to tolerate temporal out-linear data. In addition, FlexSketch achieves agile adaptation to sudden or incremental concept drift since MAJORUPDATE integrates a newly built histogram, which amends the statistical model. Though the histogram is a compact data structure and easy to maintain, it may provide coarse information about probability distribution. FlexSketch alleviates this problem by incorporating an ensemble of histograms.
The current implementation of FlexSketch only supports one-dimensional data. There are many applications relying on statistical modeling of one-dimensional data streams such as online anomaly detection [5,12], fault-detection [13] and DDoS detection [14], which are potential areas where FlexSketch is applicable. While many online sensor-based applications handle one-dimensional data, other applications need to deal with multi-dimensional data streams. Since the limited dimension of FlexSketch circumscribes the application of FlexSketch in some areas such as machine learning for high dimensional data [15] and anomaly detection based on multivariate data [16], FlexSketch needs to be expanded to handle multi-dimensional data for wider applications.
The experimental results demonstrate that FlexSketch updates PDFs for data streams up to 16ˆfaster than the alternatives while using only a limited amount of memory. Moreover, FlexSketch adapts well to various types of concept drift, and it is more accurate than the alternatives.
The main contributions of FlexSketch are as follows: • We propose a new method to estimate probability distribution for data streams with concept drift. • FlexSketch decouples adapting to concept drift from adjusting the statistical model for stationary data by incorporating two separate operations. • FlexSketch achieves low computational overhead and high throughput, which are critical for processing of stream data, using an ensemble of compact histograms.
The remainder of the paper is organized as follows. Section 2 briefly surveys the related work. In Section 3, the proposed FlexSketch is described in detail. Section 4 presents extensive experimental results. Finally, the conclusions are given in Section 5.

Related Work
There are multiple research topics to deal with data streams with concept drift [17]. Supervised classification of data streams with concept drift is also studied extensively, e.g., [18][19][20][21]. The core parts of these researches are how to detect concept drift, how to forget old data, and how to rebuild a new statistical model. Some methods [9,22] gradually update statistical model without explicit detection of concept drift. Other studies attempt to detect concept drift in batch-based methods [23][24][25][26] and online methods [27][28][29]. While [30,31] rely on process control, FlexSketch detects concept drift using a multiple-window-based method like [32]. Researches also focused on how to measure difference in distribution between recent data and old data. The common methods are based on entropy or KLdivergence [33][34][35]. We introduce an error based metric to detect concept drift in Section 3.1.
To deal with non-stationary data, statistical models should be able to forget old data or to depreciate their contribution. Some methods [9,36,37] decay the importance of old data linearly or exponentially. This approach is good for tolerating temporal fluctuations. Since gradual decaying is slow to adapt to sudden concept drift, [38,39] use sliding window mechanisms to keep some recent data and to discard old data. FlexSketch deploys both gradual decaying and abrupt forgetting. When the input data is stationary, FlexSketch depreciates old data exponentially. However, when FlexSketch detects concept drift, it discards the oldest histogram and incorporates a new histogram, which allows FlexSketch agile adaptation.
There exist many kinds of density estimation algorithms for data streams. Traditionally, kernel density estimation is performed as a batch-processing algorithm for density estimation of a dataset. There are some algorithms [9][10][11] generalized to online processing for adaptive density estimation. They usually constitute a Gaussian mixture model by assigning a Gaussian kernel to newly added data and thereafter merge kernels based on certain rules. Particularly, the online kernel density estimation (denoted by oKDE in this paper) [9] can adapt to concept drift by enabling to forget past data. However, this method is slow in updating the estimated probability density due to the requirement of relatively massive calculation. In contrast, we focus on developing an efficient mechanism that can update the probability density adaptively by exploiting updating operations having different levels of computational complexity.
There also exist methods to estimate the distribution of a data stream based on histograms, including the streaming parallel decision tree (denoted by SPDT in this paper) [8], variations of the V-optimal histogram algorithm [40,41] and quantile summarization algorithms [42][43][44][45]. However, they cannot forget the contribution of the past data and adapt to various types of concept drift. To solve this problem, model maintenance strategies using fixed and variable size sliding windows [46][47][48] for histograms can be used. [7] proposes a histogram-based sketch mechanism with gradual forgetting. However, it is unknown whether they guarantee satisfactory performance when different types of concept drift occurs. On the other hand, FlexSketch efficiently updates the statistical model for both high accuracy and high efficiency by amending the composition of the ensemble.
There are some studies using ensemble methods, e.g., [49][50][51]. While some methods [33,34] use hierarchical structure, FlexSketch uses flat combination of compact and simple data structure like [19,52]. Most of previous works focus on improving accuracy of supervised classification for data streams with concept drift. On the other hand, FlexSketch uses the ensemble technique to improve the update speed of density estimation.

Proposed Method
The goal of our method is to estimate the PDF of stationary (i.e., without concept drift) and non-stationary (i.e., with concept drift) data stream at high speed and high accuracy while using a small amount of memory. Here, the meaning of accurately estimating the PDF for a stationary data stream is straightforward. On the other hand, for a non-stationary case, it is not simple and has several aspects. We consider the accuracy of density estimation for a non-stationary data stream from three points of view. First, the estimated PDF should forget old concepts quickly after concept drift occurs. Second, the estimated PDF should adapt to the latest concept as soon as a concept drift occurs. Third, the estimated PDF should remain stable even if an accidental outlier occurs in the data stream.
To fulfil these requirements, the proposed method is built on the following ideas. Our method uses an ensemble data structure consisting of several histograms. The ensemble structure can compensate for inaccuracy of a histogram. (c) We design two adaptation techniques. If the data stream is stationary or there are only minor changes in it, FlexSketch updates the models, i.e., histograms. On the other hand, if there are major changes in the data stream, updating the models may not guarantee sufficient accuracy. To address this issue, we generate a new model that represents the changed data stream and adds it to the data structure.
Let S denote the data structure of the FlexSketch framework. S consists of a recent data stream and multiple versions of a statistical model as follows: where Q is the buffer for the recent dataset given through the input data stream, M i is the ith histogram, N M ě 2 is the total number of histograms, and n i is the number of data used to update M i . Since we build a new histogram when a major change in the data stream is detected, the histograms in S are created at different times. As a convention, M 1 is the most recently added one and M N M is the oldest one. This means that M i`1 is older than M i and thus undergoes more updates. Therefore, n i`1 ą n i . Each histogram is a set of disjoint intervals called bins (I j ) and the frequency count (m j ) for each bin, where j " 1, . . . , N B : Two important operations of FlexSketch are (a) the update operation for the data stream online and (b) the query operation to obtain the probability for a certain data, which are explained below.

Update Operation
Algorithm 1 summarizes the operation to update S with a given data sub-stream X whose number of data is |X|. First, S is updated at a minor level using operation MINORUPDATE and X is appended in the buffer. If the size of the buffer Q exceeds a threshold N Q , i.e., if only the minor update has been performed with a certain number of data, the adequacy of the most recently added model M 1 is examined by operation DIAGNOSE. If a large discrepancy is found from the operation between the recent data stream stored in Q and M 1 , the major update operation (MAJORUPDATE) is performed and the buffer is cleared.

Algorithm 1: Update operation.
Input: S " tQ, M 1 , . . . , M N M , n 1 , . . . , n N M u, sub-stream: X, size limit of Q: N Q , number of models: First, it searches the bin I j whose interval contains each data in X, x. For fast search, we use the red-black tree method. Second, it increases the count of I j , m j , by 1. If there is no bin for x, x is ignored. This causes a discrepancy between the histogram and X, which is resolved by the MAJORUPDATE operation.
The DIAGNOSE operation measures the amount of discrepancy between the most recent model M 1 and the data stream in Q. If the output of the operation is larger than a threshold γ, we consider that concept drift occurs and the MAJORUPDATE operation needs to be performed.
There are some requirements for the DIAGNOSE operation. First, its result should be invariant under scaling transformation of the data, so that the threshold is independent of the scale of the data. Second, the result of DIAGNOSE should be stable even when only a small number of data are given; otherwise, MAJORUPDATE is performed too frequently, which results in increased computational complexity, and the PDF estimation becomes inaccurate. Third, the result of DIAGNOSE must be finite even if the input dataset is not included in the domain of the histogram; otherwise, the result will diverge whenever accidental outliers deviate from the domain.
To design a DIAGNOSE operation satisfying these requirements, we first define the error function ∆pxq between X and M as the absolute difference between the cumulative distribution function (CDF) of M, CDF M pxq, and the empirical distribution function (EDF) for X, EDF X pxq (Figure 2a): Then, a representative value of the error function serves as the output of the DIAGNOSE operation. We consider two options, i.e., the maximum value given by and the mean value given by where pp∆q is the PDF for ∆. q ε or s ε may be used as the result of DIAGNOSE directly. Note that since ∆pxq is the difference between two probability distributions, its value can take only between 0 and 1, which is the range of the integration. However, we note that the range of q ε and s ε, which is between 0 and 1, is too narrow for practical use. Thus, for convenience, we scale them to obtain the final output of DIAGNOSE as follows: δ " ε 1´ε (6) Note that δ " 0 for ε " 0, δ » ε for small ε (i.e., ε ! 1), and δ " 8 for ε " 1. Therefore, the output of DIAGNOSE ranges from 0 to 8 through this scaling. The MAJORUPDATE operation is shown in lines 18 to 21 in Algorithm 1. It first builds a new histogram M 1 with X through operation BUILD in order to accommodate a big change (i.e., concept drift) in the data stream. Then, the new histogram is enqueued to S as the first model of the ensemble and the oldest one (M N M ) is dequeued from S.
The BUILD operation is shown in Algorithm 2. It basically creates a new histogram that covers the data ranges of both the existing histograms in S and the recent data, so that the new histogram can account for the characteristics of the recent data. First, the boundaries of the bins of the new histogram are obtained so as to uniformly split the range of the combined CDF (CDF of S and EDF of X) (lines 2 to 4 of Algorithm 2). Then, the EDF of X is used to obtain the count in each bin (lines 5 to 8 of Algorithm 2).

Query Operation
The query operation is to obtain the probability of a certain input data using S that has been established using the past data through the update operations explained above. As mentioned before, we employ an ensemble approach for this using the histograms contained in S. In other words, the probability of a given data x is calculated from S as a linear combination of the PDFs represented by the histograms M i , i.e., where p M i pxq is the probability of x from histogram M i and α i is the weight of M i . The former is given by the proportion of the data count for the bin to which x belongs, i.e., where |I j | is the size of an interval and we omit the subscript i for simplicity. The weight α i is determined in a way that a newer histogram receives a higher weight. Then, the final probability (7) depends more on the histograms that have been built more recently. For this, we use that an older histogram has been updated with more data, i.e., n i`1 ą n i . Specifically, the weight α i is set to be negatively proportional to n i , where the exponential function is used to ensure the weight value remains positive and, at the same time, to forget an old concept exponentially: with n 0 " 0. Here, λ is a hyperparameter. Note that ř N M i"1 α i " 1 due to the normalization and 1 ą α i ą α i`1 ą 0 because n i ă n i`1 .
To see how this works, let us consider the situation where concept drift occurs continuously so that we can assume that n " n i`1´ni and α " α i`1 {α i " expp´nλ{N Q q for all i. And, let a PDF of concept C a changing over time be p C a for positive integer a with C 1 being the latest one. Then, p M 1 " p C 1 , p M 2 " p C 1`p C 2 , . . . , p M N M " p C 2`. . . since MINORUPDATE does not update the oldest model. Then, p S 9p1`α`¨¨¨`α N M´1 qp C 1p α`¨¨¨`α N M qp C 2`¨¨¨h olds. In other words, the contribution of the concept decreases at a rate of α. This means that FlexSketch forgets an old concept exponentially.

Experiments
We evaluate the computation time, accuracy, and memory usage of the proposed method for various types of stationary and non-stationary data streams. In particular, we compare our method with the two representative existing density estimation algorithms, oKDE [9] and SPDT [8].

Datasets
Stationary datasets: We consider three distributions. The first is a standard normal distribution, N p0, 1q, which appears frequently. The second is a bimodal distribution, 1 {2N p´2, 1q`1{2N p2, 1q, which is used to test if a density estimation algorithm can recognize multiple modes. The third is a log-normal distribution, ln N p0, 1q, which is used to test if an algorithm can estimate a long-tailed distribution. For all cases, one million data are randomly generated to follow the distributions.
Non-stationary datasets: For non-stationary datasets, we consider three types of concept drift as follows. For each case, we use one million data randomly sampled from the distribution.

(a)
Sudden concept drift is defined as the case where the distribution of the data stream changes suddenly. It is to test how well a density estimation algorithm forgets old concepts after concept drift occurs. The underlying distribution is a normal distribution whose mean value changes abruptly, i.e., N`xptq, 1˘, where xptq " 0 for t ă t 1 and xptq " x 1 for t ě t 1 . We consider t 1 " 300 and x 1 " 5 as shown in Figure 3a.
Incremental concept drift is defined as the case where the distribution of the data stream changes incrementally. It is to test how well a density estimation algorithm adapts to the latest concept. The underlying distribution is a normal distribution whose mean value moves at a constant speed, i.e., N`xptq, 1˘, where xptq " 0 for t ď t 1 and xptq " v 1¨p t´t 1 q for t ą t 1 . We set t 1 " 300 and v 1 " 0.01, as shown in Figure 3b.
Blip concept drift is defined as the case where the distribution of data stream suddenly changes and returns to the original state in a short time. It is to test how well the estimated PDF remains stable even if an outlier occurs. The underlying distribution is a normal distribution whose mean value changes suddenly and returns, i.e., N`xptq, 1˘, where xptq " x 1 for t 1 ă t ď t 1`t and xptq " 0 otherwise. t is the duration of blip concept drift, which is set to t " 3. We also set x 1 " 5 and t 1 " 300, as shown in Figure 3c.

Implementation
The parameters of FlexSketch are selected such that it has similar accuracy to oKDE and SPDT for the stationary data streams as follows: N M " 3, N Q " 30, λ " 2.5, and γ " 0.4. FlexSketch is implemented in Scala, which is publicly available at GitHub [53]. For oKDE, we use the JAVA implementation available in [54]. For SPDT, we use the Scala implementation in [55]. Note that the accuracy of SPDT decreases when concept drift occurs since SPDT stores the entire frequencies of the data stream. To address this issue, we modify SPDT by using a sliding window, which is referred to as SPDTw. The window size is set to 100 (this value was tuned such that SPDTw would exhibit similar accuracy to SPDT for stationary data. Increases in window size favor accuracy of stationary (or slowly changing) data streams to sudden concept drift, which reductions cause the inverse. Therefore, we calibrated these comparative methods for equitable results), for which SPDTw shows similar accuracy to FlexSketch for non-stationary data streams.
We perform all experiments on a machine with 4-core Intel CPU i7-7700K @ 4.2 GHz and 16 GB memory. The experiments run on a single thread. The version of Scala is 2.12.5 and the version of Java is 1.8.0.

Performance Metrics
Throughput We evaluate the computation times of the update and query operations of FlexSketch in million operations per second (Mops), which indicates the number of times per second our benchmark operation can be executed. There is a performance degradation in JVM in the first few iterations. Thus, we start to record the throughput after 20 iterations to warm up. Then, we record the mean value of the throughputs for the subsequent 30 iterations to minimize accidental deviations.
Error We measure the discrepancy between the estimated PDF and the ground truth distribution. We adopt the scaled mean average error (scaled MAE) of CDF, which is defined in (5) with scaling in (6), i.e., δ " s ε L p1´s εq. Adaptability When concept drift occurs, the PDF estimated by a density estimation algorithm changes over time, so does the error. Thus, the mean of the error for a given time interval is not a sufficient metric for the accuracy of the algorithm for non-stationary data streams. Instead, we measure the adaptability of the algorithm using how the error changes over time. For this, we introduce a damped harmonic oscillator model in classical mechanics (e.g., a vibrating mass connected to a spring under damping) to represent the change in the error of the density estimation. In other words, the stability against outliers is equivalent to the resistance force (or frictional force) and the UPDATE operation is equivalent to the restoring force. The density estimation algorithm tries to make the error smaller as the error increases and to keep the error unchanged as the error suddenly increases. Then, the governing equation for the time-dependent error δptq can be written as: : δptq`: δ 0 ptq˘`c¨`9 δptq`9 δ 0 ptql oooooooomoooooooon resistance force`k¨δ ptq loomoon restoring force " 0 (10) where δ 0 ptq is the error between before and after the data distribution changes, 9 δ and : δ are the first-and second-order time derivatives of δ, respectively, and k and c are model coefficients. k and c are determined by fitting the observed values of δptq and δ 0 ptq to the model (10) under the assumption of over-damped oscillation (i.e., pc{2q 2 ą k). The solution is given by  (11) where A 1 and A 2 are constants. Based on the fitted model, the following four performance metrics are derived.

(a)
Half-life: In order to measure the adaptability of an algorithm under sudden concept drift, we measure the time taken until the error at the time of concept drift is reduced by a half, which is denoted as half-life: This metric basically measures how quickly an old concept is forgotten in the short term.
Lifetime: Similarly, we also quantify how long the contribution of the past data stays, or equivalently, how quickly the old concept is forgotten in the long term. The lifetime is defined as the time required for a long-lived term in (11) to reduce to 1{e times its initial value, which is given by (c) Lag: The lag measures how well the estimated PDF adapts to the data stream under incremental concept drift. It is defined as the absolute ratio of δ and the derivative of δ 0 at t Ñ 8, which can be obtained byˇˇˇˇδ If an algorithm does not adapt well to the concept drift, the accumulated error makes the algorithm lag behind more and more. (d) Instability The instability measures how fast the estimated PDF moves for a short duration when blip concept drift occurs. It is defined as the velocity of the error, which can be approximated as Memory Usage The PDF estimated using the density estimation algorithm continues to use memory. After the estimation, this result or its changing history is recorded in the disk if necessary. Therefore, we record only the memory usage of the estimated PDF, but not the whole memory usage consumed by the UPDATE or QUERY operation.

Throughput
We compare the throughput performance of the existing and our methods for the two key operations, i.e., updating the estimated density and producing the probability for a given data, which correspond to the UPDATE and QUERY operations in our method, respectively.
UPDATE Figure 4a shows the throughputs of the update operation of different density estimation algorithms for different types of data streams. The throughput of FlexSketch is 1.1 Mops, which is 16ˆ, 16ˆ, and 1800ˆhigher than that of oKDE, SPDT, and SPDTw, respectively, for the stationary data streams. For the non-stationary data streams, the throughput of FlexSketch is 0.37 Mops, which is 5.8ˆ, 5.7ˆ, and 570ˆhigher than that of oKDE, SPDT, and SPDTw, respectively. For the mixture data streams, the throughput of FlexSketch is 0.61 Mops, which is 9.7ˆ, 9.3ˆ, and 1000ˆhigher than that of oKDE, SPDT, and SPDTw, respectively. We also perform the one-sample Wilcoxon signed-rank test under the hypothesis that the median of the throughput differences between the proposed method and the existing methods is zero, which confirms the significance of the differences (p ă 0.005). This result demonstrates that the additional computation time to manage multiple models is significantly smaller than the computation time to represent the data stream elaborately. This effect becomes more prominent when major concept drift does not occur. However, it shows a noticeable improvement even for a data stream with frequent major concept drift.  QUERY Figure 4b shows the throughputs of different density estimation algorithms for the query operation. The throughput of FlexSketch is 0.47 Mops, which is similar to that of SPDT and SPDTw and smaller than that of oKDE. As shown in (7), FlexSketch linearly combines multiple models for each query. In order to improve the querying speed, we can add a caching algorithm, although it consumes 20-30% more memory. The throughput of FlexSketch is significantly improved with cache up to 4.1 Mops, which is 1.2ˆ, 9.2ˆ, and 9.4ˆhigher than that of oKDE, SPDT, and SPDTw, respectively.

Accuracy (Error and Adaptability)
We compare the accuracy of FlexSketch with that of the alternatives by measuring errors for the stationary and non-stationary data streams. Figure 5 compares the estimation error of each algorithm after performing the update operation for three different stationary data streams. The error of FlexSketch for the normal distribution is 0.012, which is 0.75ˆ, 1.2ˆ, and 3.2ˆless than those of oKDE, SPDT, and SPDTw, respectively. It is intuitive that oKDE records the lowest error because it estimates the distribution by using a mixture of Gaussian distributions. SPDTw is less accurate than FlexSketch since the number of data used for update by SPDTw is limited to a fixed size within its window (note that the parameters of SPDTw are deliberately selected so as to have similar accuracy as FlexSketch when concept drift occurs, as mentioned in Section 4.2).

Normal
Bimodal Log-normal The error of FlexSketch for the bimodal distribution is 0.018, which is 0.98ˆ, 1.2ˆ, and 2.5ˆsmaller than those of oKDE, SPDT, and SPDTw, respectively. oKDE using a Gaussian kernel shows the best result, as in the case of the normal distribution. And, the performance of FlexSketch is equal to that of oKDE within a margin of error. This indicates that the BUILD operation successfully constructs a new model that recognizes different modes well. Again, SPDTw is less accurate than FlexSketch for the aforementioned reason.
The error of FlexSketch for the log-normal distribution is 0.025, which is 1.4ˆ, 0.93ˆ, and 2.0ˆsmaller than those of oKDE, SPDT, and SPDTw, respectively. Contrary to the results for the normal and bimodal distributions, FlexSketch and SPDT, which have high degrees of freedom, show smaller errors than oKDE. Figure 6 shows the error and adaptability performance of different methods under sudden concept drift, i.e., the errors over time between the PDFs estimated using different algorithms and the underlying distribution of the data stream in Figure 6a, and the half-life and lifetime in Figure 6b. When the concept drift occurs at t " 300, the errors jump to 1.0 or higher for all methods. As soon as the PDFs adapts to the new concept, the PDF forgets the old concept and the errors slowly fall to zero. In the short term, the error of oKDE decreases more quickly compared to SPDTw and FlexSketch, resulting in the shortest half-life by oKDE. However, oKDE shows the longest lifetime, indicating that it is affected by the old concept for a long time. In the long term, FlexSketch shows the smallest error in Figure 6a and also the shortest lifetime in Figure 6b. We also measure the accuracy of the three methods over data generated by the MOA framework [56] for sudden concept change. Function 2 and 3 of the SEA generator [57] is used with a narrow drift-window (100 samples) to produce data streams and the error is measured against the mean of each cluster at the 2000 sample point after the drift. FlexSketch, SPDTw and oKDE exhibit errors of 0.27, 0.21 and 0.57, respectively. Though the overall trend is similar, FlexSketch shows a slightly higher error than SPDTw since the window size of SPDTw is small enough to evade from the effect of old data.

Non-Stationary Case
In Figure 7, the performance of different methods under incremental drift is shown. The error of FlexSketch is saturated at 0.11, while those of SPDTw and oKDE are saturated at 0.14 and 0.58, respectively, as shown in Figure 7a. In addition, Figure 7b shows that FlexSketch has the smallest lag (5.3ˆand 1.2ˆsmaller than those of oKDE and SPDTw, respectively). This demonstrates that FlexSketch can not only speed up computation, but also adapt more accurately to the changes in the data stream. This is also consistent with the observation for sudden concept drift that FlexSketch forgets the past concept faster than oKDE in the long term.   Figure 9 compares the amount of memory used in the data structures of different density estimation algorithms for the stationary data stream. Our FlexSketch consumes 6.2 kbytes of memory, which is 1.3ˆ, 1.4ˆand 0.68ˆmore than that of oKDE, SPDT, and SPDTw, respectively. Since we set N M " 3, one could expect that FlexSketch requires 3ˆmore memory usage than the others. However, the increased amount of memory consumption is much less than such an expectation by using efficient histogram computation.

Effects of Parameters
We investigate the effects of the algorithm parameters, i.e., N M , N Q , λ, and γ, on the performance in terms of throughput of the UPDATE operation, error, and memory usage for the stationary data stream and the non-stationary data stream with incremental concept drift. The ranges of the parameters are as follows: 2 to 10 for N M , 10 to 150 for N Q , 0.2 to 3.0 for λ, and 0.01 to 2.3 for γ. Experimental results with the combinations of these parameter values are analyzed below. Figure 10a is a three-dimensional representation of the throughput, error, and memory usage for the stationary data stream as the parameters are changed. Many of the data points are located in the upper left side, indicating that FlexSketch achieves high throughputs and low errors over various combinations of the parameter values. Figure 11 presents the effect of each parameter separately by increasing one of the four parameters while the others remain fixed. The following observations can be made. First, as N M increases, the throughput tends to decrease because of increased computational complexity for more histograms, while the error does not change (Figure 11a). Second, increasing N Q and γ results in decreased errors and increased throughputs in Figure 11b,d, respectively. The improved throughput is because using a larger Q, or increasing the threshold γ allows the computationally intensive MAJORUPDATE operation to be performed less frequently. Since the MAJORUPDATE operation adds a model representing the latest data to the data structure, performing less MAJORUPDATE operations reduces the dependence on the latest data, which improves the accuracy for the stationary data stream. Third, the value of λ does not affect much on the performance (Figure 11c). Figure 10b shows the throughput, error, and memory usage for the non-stationary data stream with incremental concept drift as the parameters are changed. Depending on the values of the parameters, the performance of FlexSketch may become degraded (i.e., larger errors, lower throughput, or larger memory consumption). Figure 12 shows how each parameter affects to the performance, from which we draw the following observations. First, as shown in Figure 12a, increasing N M results in lowering the throughput without changing the error much, which is due to the increased number of histograms as in the case of the stationary data stream. Second, when N Q or γ increases, a trade-off relationship is observed, i.e., the throughput increases but the error also increases (Figure 12b,d). Suppressing the MAJORUPDATE operation with increased N Q or γ improves the throughput, but prevents FlexSketch from accurately adapting to the concept drift. Third, by increasing λ, the contribution of past data is reduced and thus the error can be reduced.

Conclusions
In this paper, we have proposed the FlexSketch framework, which is an online algorithm based on an ensemble of histograms and consists of three operations: MINORUPDATE, MAJORUPDATE, and DIAGNOSE. Since it dynamically determines when to forget old data by observing divergence, it estimates probability distributions stably for stationary data streams without invoking the MAJORUPDATE operation. FlexSketch adapts to concept drift swiftly for non-stationary data streams by updating underlying model rapidly using MA-JORUPDATE. As shown in Section 4.5.2, FlexSketch estimates probability distribution with high accuracy for data streams with sudden and incremental concept drift. Because FlexSketch utilizse simple histogram as the elemental data structure, it achieves high throughput update and query operations using only limited memory. The experimental results demonstrated the advantages of the method we propose in this paper. FlexSketch exhibits significantly improved speed compared to its alternatives. Moreover, FlexSketch adapts well to various non-stationary data streams while maintaining stability over temporal fluctuations. Nevertheless, FlexSketch has a disadvantage since it has multiple parameters. As discussed in Section 4.7, FlexSketch exposes some changes in throughput and accuracy according to parameters, which could be a burden of design choice in domain specific applications. While FlexSketch exhibits preferable characteristics, it needs to be extended to overcome the current limitation of supporting one-dimensional data only, which could be a drawback for some applications. In our future work, we plan to extend our method for multi-dimensional data streams. Because histogram is a simple and efficient underlying data structure for ensemble methods as shown in this paper, we will try to incorporate multi-dimensional histogram [58,59] to accommodate multi-dimensional data. In addition, we will explore applications that utilize probability estimation as a core building block. Drifting data stream classification [35] and anomaly detection in non-stationary data stream [60] would be good candidates to deploy FlexSketch for practical applications.