abstract depiction of data

Making sense of data: Understanding complications with data collection 


Making sense of data: Understanding complications with data collection is the third in a series, presented by our partner  SAS, exploring the role of data in understanding our world. SAS is a pioneer in the data management and analytics field.  

As we have seen, statistics and visual representations of data can be misleading. But what happens when the data itself is misleading? And if data is supposed to be based on fact, you might wonder how data can be misleading. It comes down to the way it is collected. It is essential to have a strict process of collecting data before analyzing or presenting it. To ensure the data is accurate and as representative as possible, we must pay special attention to how data is collected.  

Here are some of the most important questions to consider when understanding how data is collected:  

  1. Who or what is represented in this data?  
  2. What questions are being asked?  

Sample selection  and  data collection  

Without collecting data on an entire population, it’s nearly impossible to report it with complete accuracy because of sampling limitations.  Suppose we want to better understand the eating habits  of AmericansThe only way to ensure we  have an accurate picture of  American eating habits is to monitor every single American, every second of the day, and record  everything they eat. Since this  is impossible, researchers will oft en use a sample, or a small portion of the population of interest. When the sample selected isn’t representative of  the larger group, you get misleading data. 

Consider how this might play out if someone was conducting a dietary study of Americans. In this case, the study asks 100 people about their eating habits.  But how are those people selected? Options are endless: 

  • Collect data from 100 friends. That’s a convenient sample, but  most people’s friends are about their age and eat similar types of foods.  
  • Gather data from a local restaurant or grocery store. Again, this might impact the type of data collected. For example, surveying people in a fast-food restaurant may give very different answers than surveying people in an upscale restaurant or a health food store.  
  • Conduct surveys at a non-food establishment, such as a library. This could be problematic, as librarygoers might eat differently than the rest of the population. But even more concerning, those librarygoers all come from the same area. The type of food people eat varies by locale. Those who live in cities likely eat different foods than those who live in rural areas. Food preferences can also vary depending on a person’s background or culture 

All of these are confounding factors or present possible issues with data.  If we want a representative sample, we need to gather data from a cross section of age, gender, race, residence, income level, and so on. Finding such a representative sample can be incredibly difficult, and  so it doesn’t often happen. Researchers typically report the population used in samples. This helps the reader understand who is reflected in the sample and the impact that might have on the results. As a consumer of data, it’s important to pay close attention to this piece of information. Ask yourself if the results presented by the researchers apply to the whole population or if those results only apply to the population sampled.   


Figure 1. Biased Sample


Additionally, there can be issues when how the data is collected, or the questions asked, only tell part of the story. We said before that the best way to see what people are eating is to consistently monitor what they do, but getting firsthand access to information like this is often impossible or unethical. Instead, researchers design studies or questions to gather similar information. Consider the following scenarios:  

  1. Researchers ask participants to keep a food log for a week that details everything they eat and track total servings of fruits, vegetables, meat, etc. 
  2. Researchers ask participants “In general how many servings per day do you have of fruits, vegetables, meat, etc.?” 
  3. Researchers ask participants “What kind of foods do you usually eat?” 

Each of these scenarios is trying to answer the same question: What do people eat? But the information is being gathered in very different ways.  

Scenario 1 seems closest to our observation study, but there are some ways that the data may be biased. One concern is that people know they’re recording their foods, and this may lead them to eat differently for the duration of the study.  The data could also vary depending on the time of year. Many people make different food choices in the summer compared to the winter.   

Scenario 2 also presents problems. This question asks people to think more holistically but relies on memory and judgment. Individual estimates of what is typical may vary from what is actually eaten. People may intentionally or accidentally make themselves appear to be healthier eaters than they really are. It can also be difficult to accurately judge your own behavior.   

In Scenario 3, the question isn’t specific enough to gather good information. While people might report the amount of fruits and vegetables they eat, the question leaves room for general or unrelated answers, such as cuisine type (Italian, Mexican or others), or a preference to eat out or at home.  

people of different races discussing something on a tablet screen

Figure 2. Conducting a Survey


As you can see, the way questions are asked, and who is asked those questions, makes a big difference in the kind of information collected. Some questions are better than others. When interpreting data, see if you can find the questions asked by the researchers. Are they good questions? And are the results influenced by how the researcher asked them or how they gathered the data?  

Test yourself: Take our data quiz (here or below)!

Related articles:

powered by Typeform

SAS logoAbout SAS: Through innovative analytics software and services, SAS helps customers around the world transform data into intelligence.

More Updates

NLP’s Veiga on CNN: How to detect AI-generated news stories

In a CNN interview, the News Literacy Project offered strategies for determining whether news coverage is AI-generated. Christina Veiga, NLP’s senior director of media relations, explained how to read laterally – leaving one online source to read what others have to say about a topic or issue – and how to conduct a reverse image…

NLP in the News