We love data! But, we also know that collecting, analyzing, and reporting with data can be daunting. The person we turn to when we have questions is Diana Aleman - our Editor Extraordinaire for SAGE Stats and U.S. Political Stats. And now Diana is bringing her trials, tribulations, and expertise with data to you in a brand new monthly blog, Tips with Diana. Stay tuned for Diana's experiences, tips, and tricks with finding, analyzing and visualizing data.
July 28, 2017
The Three Stages of Data Analysis
#2 – Cleaning your data
The term “data cleaning,” the second stage of the data analysis process, is usually met with some confusion. I mentioned to a friend that the most recent SAGE Stats data update required a lot of cleaning, which was taking up a significant amount of time. She asked, “So what exactly is data cleaning?” An excellent question!
Data cleaning or “scrubbing” consists of taking disorganized, messy data and transforming it into a format that enables easier analysis and visualizations. Depending on your formatting or metadata requirements and how big the data file is, it can take days to clean a file into submission.
Since I began working on SAGE Stats, I’ve learned many Excel tricks that can be applied to any kind of data cleaning situation. To avoid information overload, I’ll stick to the tricks I’ve successfully used in the past two years.
Top 10 Tips on Cleaning Your Data
- Read the data documentation. This will tell you what each component of the data file represents and help you identify what data is most relevant to your research interests and what data you can avoid.
- Excel’s “Text-to-Columns” feature. Especially large data files are often stored in “csv” or “comma separated value” formats and can be imported into Excel using this handy feature.
- VLOOKUP formula. My holy grail of Excel formulas. Do you want to pull multiple values from a workbook into another workbook? VLOOKUP has your back.
- COUNTIF formula. Are you looking for duplicate values in a range or checking whether values in one workbook are present in another workbook? COUNTIF counts the number of times a value occurs in a range!
- LEFT and RIGHT formulas. These are very useful when you need to parse out specific characters from the beginning or end of a value. For instance if “092017” represents September 2017, but I only need the year, then I can use the RIGHT formula to collect the last four digits.
- TRIM formula. Frustrated by inexplicable extra spaces that follow the value you want? This formula “trims” those out for you.
- CONCATENATE formula = “&”. Concatenate is a fancy word for linking two values together – you can use the formula for this or insert an ampersand between the two cell references, e.g. =A1&B1.
- I don’t think Excel’s filters get enough credit. Are you looking for multiple misspellings of New York? The filters help you quickly identify and correct them.
- Nest your formulas. Find ways to combine formulas to reduce the number of steps you have to complete! For instance, do you need to look up values in Workbook 1 that are associated to a value’s last five characters in Workbook 2? Nest the RIGHT and VLOOKUP formulas to quickly get your answer.
- Work off a copy of the original data file. You don’t want to be in a situation where you have mistakenly deleted data values and then have to download the data file again. Keep the original version handy as a backup.
This is a lot of work. Why do I need to clean the data file at all?
Sometimes a data set is so simple that it requires no cleaning at all; however, that’s not usually the case. These days you will typically encounter a file with all data merged into one column, which you then have to unmerge or parse out by yourself. Then you find that you need to concatenate some values back together. And then you realize that some values occur multiple times and you want to find out how many times each one occurs in the file. All this when you only want a snippet of that information! Data cleaning is a necessary evil at times in order to get your data in shape for easier visualizations and more accurate information.
The best way to learn these tricks (and even more advanced tricks) is to dive in head first and try them out with a specific data set. In Excel’s case, doing is better than reading or listening. After all, no one starts out as an expert, and I am no exception! My tips above are suggestions and may not work with your specific needs, but they can be applied in almost every kind of data situation. If you use them often enough, then they practically become muscle memory.
June 7, 2017
The Three Stages of Data Analysis
#1 – Evaluating raw data
A friend I haven’t seen in a while asked me what I do for a living, and I talked about SAGE Stats and the work that goes into maintaining and building the collection. Instead of his eyes glazing over (like most people’s would) he asked me, “Ok. Not to seem like an idiot, but what is data analysis? Like what does it cover?” If you’ve had similar thoughts, never fear! I think I can safely say I’ve received multiple variations of this question before. My typical answer: what doesn’t it cover?
Data analysis covers everything from reading the source methodology behind a data collection to creating a data visualization of the statistic you have extracted. All the steps in-between include deciphering variable descriptions, performing data quality checks, correcting spelling irregularities, reformatting the file layout to fit your needs, figuring out which statistic is best to describe the data, and figuring out the best formulas and methods to calculate the statistic you want. Phew. Still with me?
These steps and many others fall into three stages of the data analysis process: evaluate, clean, and summarize.
Let’s take some time with Stage 1: Evaluate. We’ll get into Stages 2 and 3 in upcoming posts. Ready? Here we go…
The breakdown: Evaluate
Evaluating a data file is kind of like an episode of House Hunters: you need to explore a data file for structural or other flaws that would be a deal breaker for you. How old is this house? Is the construction structurally sound? Is there a blue print that I can look at?
Similarly, when evaluating a raw data file you have collected, you should consider the following questions and tips:
- Read through the data dictionary, codebook, or record layout, which should detail what each field represents. Try not to immediately start playing with the data until you know what you’re looking at. You wouldn’t start renovation in your new house without reading the blue prints, right? You gotta know if that wall is load-bearing!
- What irregularities does the methodology documentation detail and how may it have affected the data? What are the methodology notes that I should make transparent to the reader?
- Is the raw data complete? That is, are there missing values for any records? (Missing values in the raw data can distort your calculations.)
- What outliers exist in the data set? Do they make sense in the context of the data? For instance, a house price of $1.8 million in a neighborhood where houses don’t exceed $200K is probably a red flag.
- Spot check the raw data. If the data set provides totals, then sum the values and check that they match. If they don’t, then does the documentation explain why they may not add up to the totals?
So if the source is good, then the data must be good too. Right?
It’s a mistake to assume the data is authoritative or fine as is just because it’s a published government source or another source you consider just as reliable. Data reporting is susceptible to manipulation and simple mistakes despite the best efforts and intentions of the responsible organizations. Assume nothing and evaluate the data to ensure it checks out! The next stage of data analysis is how to clean raw data to fit your needs. Stay tuned for my next post, where I will review the most effective Excel tips and tricks I’ve learned to help you in your own work!
May 1, 2017
Data and Statistics 101
The fundamental difference between data and statistics (because who knew!)
Before I started working on SAGE Stats, the idea of working with a large data set was quite intimidating. Shout out to the USDA’s Food Access Research Atlas! In the two years since, working regularly with our platform has really opened my eyes to how empowering and beautiful data is once you understand how to pull usable information from it.
My experience has also taught me how overwhelming and confusing data can be. What is a data set and how is it different than a time series? How can I tell if data content is reliable or not? What the heck is a data dictionary and why do I need it? Unless you are consistently elbows deep in data, it can be difficult knowing where to even start. So let’s begin with the very basics: what is the difference between data and statistics?
The two terms are often used interchangeably – even within the same breath. I have even caught myself using both terms in explaining SAGE Stats to team members and close friends without a second thought. Although it is easy to synonymize the two, they are in fact very different.
Data are collected and organized information typically provided in massive files with detailed records and a data dictionary to decode the variable information. The records in those data files do not communicate significant meaning to the naked eye, so time and analysis are needed to read through the data collection methodology, decipher variable information, and determine which variables are of interest to you.
Statistics are clear and understandable explanations or summaries of data based on analysis. Statistics are generally available in tables and represented graphically. For example, the median state unemployment rate in the U.S. was 4.0% in 2016. This is a statistic derived from analysis of sample data collected by the U.S. federal government.
So statistics are better than data, right?
Not necessarily. Whether you need data or statistics really depends on your research question. Data is needed when your research question addresses a new issue that hasn’t been explained or thoroughly explored yet – this requires a deep dive into data where you must analyze and derive meaningful knowledge that can answer your question.
A more straightforward research question, however, can be more quickly answered with statistics because the question has been asked before and so the analysis to answer that question has also already been done. For instance, a student who needs information on unemployment across the Rust Belt states can easily find an answer because that information is frequently processed by the federal government for its own assessment of the economic climate.
The difference between data and statistics lies in the analysis. Data needs to be analyzed to be understood, but a statistic can be understood right away. The next question is: how do I begin to analyze data to get the statistics I need? Stay tuned for my next blog post for tips on just that!