The project hast three main thrusts:
- Sketch based Aproximate Query Processing
- Approximation of Aggregates in Sensor Networks
- Analysis of Classification Methods based on Moment Analysis
- Histograms
Details for each sub-project are provided below. The papers produced as a result of the research in these sub-projects can be found on the publications page.
Sketch based Aproximate Query Processing
AGMS sketches are an interesting approximation technique that consist in combining +-1 pseudo random numbers with the data in a manner that allows computation of aggregates. Our contributions to this research area are:
- We have analyzed the problem of generating large classes of +-1 random numbers efficiently on modern processors. This is a fundamental problem for approximating results of queries over streaming or distributed data when using AMS sketches and influences most of the work on this subject (including our own previous work). The research outcome is theoretical and empirical proof that a previously proposed scheme, EH3, has 'ideal' properties for sketch computation: it can be implemented very efficiently on modern architectures and it is fast range-summable (this is required for a number of applications). All this properties were unknown and not exploited in previous work. A side benefit, for example, is a factor of 8 improvement in relative error for approximation of spatial size of join over the state of the art. A direct result of this work is making sketch based methods accessible to other researchers since the problem of generating these random numbers prevented other research groups from using such techniques.
- We extended the above work to a simpler generation scheme, BCH3, for which the fast-range summation can be performed very fast (about 20 times faster than for the EH3 scheme). The downside of BCH3 is that the error increases considerably if the distribution is uniform but it still has applications for skewed data where the boost in speed is significant. This second piece of work complements the first and presents a complete picture of all practical fast range-summable random number generation schemes for sketch approximation.
- We have conducted a detailed statistical and empirical study of all existing sketching techniques for approximate aggregate estimation: AGMS, Fast-AGMS, Fast-Count and Count-Min. The existing work only provides theoretical error guarantees for these sketching techniques, guarantees that are based on distribution independent bounds. Such guarantees are usually pessimistic; the suspicion is that they are too conservative and thus result in wasted resources (the memory requirements are usually set up such that a specific precision is achieved; it the bounds are conservative, more memory than necessary is used). To this end, we analyzed the four sketching techniques as statistical estimators and revealed an interesting conclusion. The Fast-AGMS sketches are superior to the other types. When the skew is small they have about the same performance as Fast-Count and AGMS and much better performance than Count-Min. For large skew they are much better than AGMS and Fast-Count and about as accurate as Count-Min. The Fast-AGMS sketches 'naturally' adapt to the situation and provide an overall better solution.
- We have further enhanced the sketch techniques by combining them with sampling methods. This is work performed by the PI with Florin Rusu. The main idea is to sketch only a sample of the data, instead of the entire data-stream. While this idea is simple, sophisticated analysis is required to characterize the hybrid sketch-sampling estimator. Interestingly, we were able to show theoretically that the variance of the hybrid estimator is the sum of the variance of the sampling estimator, sketch estimator and an interaction term. Theoretical results of this form were obtained for all three major types of uniform sampling: Bernoulli sampling, sampling without replacement and sampling with replacement.
Approximation of Aggregates in Sensor Networks
The work on aggregation in sensor networks is a colaboration with Laukik Chitnis (student funded by this project) and Sanjay Ranka. The interest for this line of work comes from the fact that the data-streaming algorithms readily provide distributed computation algorithms that is useful in performing queries in sensor networks.
- We investigating how aggregation can be performed efficiently in sensor networks. The main challenge in sensor networks is the fact that the communication and computation are severely restricted. The work focused on enhancing Gossip algorithms that allow the computation of aggregates in sensor networks. The main idea is to combine Gossip algorithms and tree based algorithms to obtain the best performance. Thee aggregation has great efficiency if the number of failures is small but performance degrades exponentially when nodes start to fail. Gossiping can work even under heavy node failures but is substantially slower than tree aggregation. We have shown that, by building a hybrid protocol with Gossip on groups of nodes and tree aggregation among the groups, the performance improves overall throughout the spectrum of failure rates.
- The work on sensor networks was extended with to heterogeneous networks. In such situations, some more capable nodes are available. These nodes have more computational power and better connectivity. The fundamental question we asked is how to better use these super-nodes to design better performing networks. This work is part of Laukik Chitnis' thesis and was published in JPDC 2009.
Analysis of Classification Methods based on Moment Analysis
The type of mathematical analysis required for sketch analysis can be useful for characterizing machine learning algorithms as well. To this end, the PI and Amit Dhurandhar made significant progress in analyzing classification algorithms by computing moments of the generalization error and of empirical estimates of the generalization error. The method is surprisingly powerful and allows interesting insight into how learning methods work. The more surprising thing is how universal the mathematical analysis used to analyze sketches is and how far reaching impact it can have. The analysis has been used successfully to analyze the naive Bayesian classifier, random decision trees and k-nearest neighbors.