How much does a "large" dataset cost?

Supposed infinite storage, where size / volume / physics (metrics, gigabytes / terabytes) will not matter only for the number of elements and their labels, the statistical pattern should appear already on 30 subsets, but can you agree that less than 1000 subsets is too small for testing, and at least 10,000 different subsets / "items", "records" / entities are a "big dataset". Or more? Thanks to

0


a source to share


1 answer


I'm not sure if I understand your question, but it looks like you are trying to ask how many elements of a dataset you need to try in order to ensure a certain degree of accuracy (30 is the magic number from Central Limit Theorem that often comes into play) ...

If so, the required sample size depends on the level of confidence and confidence interval. If you want a 95% confidence level and a 5% confidence interval (i.e. you want to be 95% sure that the fraction that you determine from your sample is no more than 5% of the fraction in the full dataset) , you need a sample size of no more than 385 elements. The higher the confidence level and the smaller the confidence interval you want to create, the larger the sample size you need.



Here's a good discussion on sample size math and a handy rough sizing calculator if you just want to run numbers.

+3


a source







All Articles