Data in the context of Statistics

Understanding data types we deal with on regular basis

Data in the context of Statistics

Photo by Vackground on Unsplash

Statistics improves the ability to make timely and efficient decisions.

Without statistical procedures it's challenging to make sense of the huge data being generated. Statistics makes it possible to make better decisions by using information obtained from data. The purpose of statistics is to extract that information.

Data Collection

The collection of data is the most crucial, important, difficult, expensive and time-consuming part of the whole process. It's like collecting best ingredients to make best food. Collecting the data is hard but it's worth the time as it's the foundation for the entire project. You can cook really good dishes with eggs that are fresh and perfect, if you have some expired eggs the rotten smell can damage the whole dish. So it's crucial to go extra mile to get the best ingredients.

Types of Data

Quantitative data

Variables whose value is measured using numbers are of quantitative data type. Example: Height, No of hours etc.

Qualitative data

Qualitative data is stored in variables when its classifying, qualitative data couldn't be measured like quantitative. Example: You can't measure type of blood group but we can measure volume of blood. the type of blood is a qualitative data and the volume of blood is quantitative data. It's important to understand the difference between the two.

Further categories of Data:

  1. Qualitative
    1. Nominal: The data consists of categorical data, where there is no logical order. Example: name, gender, address etc. The categorical data doesn't mean it's necessary to have few categories like 'gender' we can call 'name' or 'address' as nominal data even if the dataset contains unique addresses and names.
    2. Ordinal: An ordinal data consist of categories with a logical order. A popular example is rating: Good, Poor, Average all there are categorical while also containing a logical order ie Good > Average > Poor.
  2. Quantitative
    Here the measurements are bigger or smaller in relation to others. The measurements which are close in quantity are close in nature.
    1. Interval: Interval scale has numerical values where the distance between any two successive numbers is of measurable constant size. Also Data measured in interval scale have arbitrary zero points. Example: 2015, 2020, 2025 etc. We can know not only what year is it but also how many years.
    2. Ratio: Ratio is same as interval but has fixed or non-arbitrary zero point and the ratio of two numbers has some meaning. Example: lifetime of a bulb we can say a bulb lasted twice than other bulb.

The hierarchy of scales of measurement:

  • Ratio
  • Interval
  • Ordinal
  • Nominal

If you observed as we go down the levels the amount of information drops, we can convert one to another in one way only. Ratio has the highest information and nominal has the least information. Example: the lifetime of a bulb can be converted to ordinal with fair, good, average values and then still be converted to nominal as good or bad.

Analysts usually convert ordinal to numerical by assigning a value to each category without checking whether equal intervals condition. A rating column can be converted to 1,2,3,4,5 in respective order but if we applied methods on unequal intervals it can lead to errors.

It's important to know the type of data to apply appropriate statistical methods.

Sources of Data

Primary Data

Data that must be collected is Primary data. You need to collect data from scratch. Some common techniques to gather data are Focus Groups(10-20 diverse sample that represents your population), Telephone (fast, inexpensive easy but can't take long time), Mail Questionnaire (when you have a mailing list, expect low return rates), Door-to-door(people intensive), Mall intercept(randomly selecting and asking questions in popular malls), Registration (when buying a new product or service), Observation (designing conditions to observe effects of added changes), Interviews (when you need in-depth opinions, quality but costly), Experiments (varying one variable to study its effect on another variable). Care must be taken in all these to avoid bias particularly with mail questionnaires, registrations, telephones because not everyone would take their time to fill a survey which may introduce a bias.

Secondary Data

Data that is readily available is secondary data. The data that's available on open sources like government statistics website, kaggle etc.

Population and Samples

The population, sample, census are often the terms you would hear in a study that involves data. A population is the entire set of items of interest for your study. A census is an attempt to measure every item in the population of interest. A sample is a subset of population. Getting census of a population is often difficult unless the population is low or you can easily measure them, it's difficult, costly, impossible and expensive to get a census. That's why we chose a sample of population that best represents the population. The important thing here is choosing the right sample. Ex: oversampling anomalies in anomaly detection systems.

Sampling Techniques

There are many methods of sampling a population, these are categorized into two random (probability) sampling and non-random (non-probability) sampling.

Random Sampling

Simple Random Sampling

Simple random sampling is simple, easy, convenient that doesn't introduce sampling bias. Each item is chosen by chance it's like drawing names from a hat. A simple random sample results when n elements are selected from a population such that every possible combination of n elements has an equal probability of being selected.

Systematic Sampling

Selecting an item at regular intervals in a population is systematic sampling. The population should be ordered in a random sequence. The interval gap is given by N/n. For example in a population of 100 (N) size to choose a sample of 10 (n) we would select every 10th item in the population.

Stratified sampling

The population is broken down into subgroups or strata with similar characteristics and each group is randomly sampled. For example we may stratify the population by sex, income level, age etc. This improves the accuracy by representing correct proportions of variables of interest.

Clustered Sampling

In clustered sampling the population is divided into clusters each representing the population. Instead of individual item, cluster of items are selected to be in the sample.

Non-Random Sampling

Judgement Sampling

items are selected based on researchers judgement.

Convenience Sampling

The convenient sample is taken, the resulting sample is prone to significant bias.

Quota Sampling

When there are specific guidelines on which and how many items should be selected.

Snowball Sampling

Existing subjects are asked to nominate further subjects known to them

Did you find this article valuable?

Support Dataset Stories by becoming a sponsor. Any amount is appreciated!