Amortizing distribution (and percentile) calculation applicable to App Engine?

This is applicable to Google App Engine, but not required.

In Google App Engine, the database is not relational, so no aggregated functions (such as sum, average, etc.) can be implemented. Each line is independent of each other. To calculate the sum and average, the application simply has to amortize its calculation by recalculating for each separate new record in the database so that it is always updated.

How to go about calculating the percentage and frequency distribution (i.e. density)? I would like to plot the density of a field of values, and this set of values ​​is probably on the order of millions. It could loop through the entire dataset (limit for each request is 1000 rows) and calculate based on that, but I would prefer to take a sane approach.

Is there some algorithm for calculating or approximating the density / frequency / percentile distribution that can be calculated over a period of time?

By the way, the data is uncertain that the maximum and minimum can be everywhere. Thus, the distribution would have to take about 95% of the data and based on this density alone.

+1


a source to share


2 answers


Getting an entire string (with this limit of 1000 at a time ...) over and over to get one number per string is necessarily unattractive. So, denormalize the data by writing that single number in a separate object that contains a list of numbers (up to the limit, in my opinion, 1MB per request, so with 4 byte numbers there are no more than 250,000 numbers in the list).

Thus, when adding a number, the last "added data values" object is also retrieved, if it is done completely, add a new number instead, save it. There is probably no need to be transactional unless a tiny error in statistics is the killer that you seem to imply.



If the data for an item can be changed, separate objects of the same type write "deleted" data values; to change the value of one element from 23 to 45, add 23 to the last list of "deleted values" and 45 to the last "added values" - this also includes deleting the element.

+2


a source


It is possible to loop through the entire dataset (limit for each request is 1000 rows) and calculate based on that, but I would prefer to take a sane approach.



This is the most obvious approach to me, why are you trying to avoid it?

0


a source







All Articles