Keywords : Abstract :
bias, inference, population, data quality, representation
Recommendations for researchers:
The analysis of representativeness of a data set belongs to the standard quality assurance procedures in survey research. This FORS Guide challenges current practices of the analysis of representativity and suggests a framework to analyse the risk for representation bias taking into account different uses of data.
- Avoid the term “representative”. If it needs to be used, explain clearly what is meant, revealing the context for which the statement is made. Only use it when it refers to probability sampling and do not make a general claim.
- Be creative. Instead of trusting one indicator, use several indicators linked to the analysis that is or will be made.
- Be specific. If having to inform generally on a data set, cover multiple uses of the data, never make general claims and base recommendations on the findings of the analysis.
- Be prudent. Reflect possible biases with regard to results of substantive analyses.
- Be scientific. Take plausible assumptions, be consistent, be simple and comprehensible, do not over-generalise, remain within the scope of the analysis.
- Stay focused. Keep an eye on what the goal is; what is the correlation of the test variables with the variables and statistics of interest? This correlation sets the limits of influence of the test variables on the statistics of interest.
- Be inclusive. Use as much information as is available. Whenever possible use advanced statistical models to account for uncertainty due to (unit and item) nonresponse, such as full information maximum likelihood or multiple imputation.
- Big data are not representative for a general population. It is usually not the goal of analysing big data to draw conclusions regarding the general population. Rather, it is the analysis of all available data on a subject matter. It is not a sample and certainly not a probabilistic one, therefore inference cannot be made. Big data is very useful but not for claims regarding the general population. For example, an analysis of gender-neutral pronouns in Twitter data is very interesting but does not reflect the use of gender-neutral pronouns in other contexts, nor does the same analysis using the complete set of the most prestigious newspaper articles in the same time period. However, such data can be used to start formulating theories regarding the general population which then is studied using other data; or the results from Twitter and the newspaper can be compared and interpreted fully. This is highly interesting, but the concept of representativeness does not make sense in such contexts.
© the authors 2021. This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0)