We love data! But, we also know that collecting, analyzing, and reporting with data can be daunting. The person we turn to when we have questions is Diana Aleman - our Editor Extraordinaire for SAGE Stats and U.S. Political Stats. And now Diana is bringing her trials, tribulations, and expertise with data to you in a brand new monthly blog, Tips with Diana. Stay tuned for Diana's experiences, tips, and tricks with finding, analyzing and visualizing data.
July 19, 2018
Data in the News: Gaining an Understanding of Gun Violence
The issue of gun control continues to build steam as the media reports more gun violence incidents across the United States. One recent incident is the gunman attack on the Capital Gazette newspaper office in Maryland this past June, which fueled renewed calls and protests for stronger gun control laws. As decisive as the ucrrent environment is about this particular issue, it is also a reminder to keep yourself informed on issues in which you are interested. If you are looking to use this as a topic of academic research, it is especially important to gain a basic understanding by reviewing and comparing the information you find.
You can begin understanding your topic of interest by writing out your current assumptions and reviewing the information to see if it supports those assumptions. One way to check your assumptions is by creating a simple data comparison. For example, let's assume that a majority of murders are committed with firearms than with no weapon at all. You can compare murders committed with firearms to murders committed without a weapon. Since we menioned Maryland earlier, let's focus on that state. The FBI reported that in 2016, 76.3 percent of Maryland murders were committed with firearms and 4.9 percent of murders were committed barehanded (i.e. with a person's hands, fists, or feet). Based on this comparison, we can confirm that a majority of reported murders in Maryland were committed with firearms that without.
Additionally, seeing this comparison visually can facilitate further research by inviting additional questions. Here are a few questions that arise when you see this data in action: Why are there more murders with firearms than without? How many of these firearm murders were mass shootings? Is this possible to identify? If there were any mass shootings included in these murders, what types of firearms were used? Why did the percentage of murders with firearms drop between 1997 and 1998? What was different about 1997 compared to 1998? Were there any unreported murders in Maryland? How do these percentages for Maryland compare to neighboring states or the U.S. average?
As you can see, a simple comparison has turned into an in-depth research topic that has raised numerous questions that allow for multiple avenues of investigation. When reviewing information gathered from your research, the best thing you can is to ask questions about what you read or observe. This not only adds to your initial knowledge of an issue, but broadens your perspective by encouraging you to consider other factors that perhaps you hadn't thought of before and which may affect or be affected by the issue. Broadening your perspective of what factors or players are in play is imperative to building your understanding of a big issue like gun violence.
June 29, 2018
The Different Angles of Gerrymandering
The recent news that Justice Anthony Kennedy will retire from the Supreme Court this summer has thrown everyone for a loop. As I write this, politicians, news pundits, and voters are debating the implications of Justice Kennedy’s retirement on decisive topics such as abortion and same-sex marriage as President Trump considers candidates for the court's vacancy. One decisive and unresolved issue that the president’s nominee is likely to influence with his or her vote is partisan gerrymandering and the extent to which it dilutes voting power and therefore impacts U.S. elections.
However, unless you’re a political science major it may be difficult to grasp why federal and state politicians are fighting over congressional gerrymandering. What is gerrymandering and how exactly do changing congressional boundaries affect who we elect? Gerrymandering is the practice of manipulating district boundaries to benefit one group over another. Historically, gerrymandering has been driven by racial and political motivations to control who is in power, and more importantly, who is not in power. Although reading about the history of gerrymandering is informative, visualizing the physical changes in congressional districts is a fantastic way to learn and understand the practice.
The breakdown: North Carolina as an example
North Carolina, for example, is a great case study for understanding the electoral impact of gerrymandering. By updating the selected year for the district map on the left, you can observe that prior to the mid-1990s North Carolina elected a majority of Democratic candidates before turning red in 1994. This change in power was due to Republican gerrymandering that went into effect in 1994. Since that time, North Carolina flipped has been for the most reliably red except for a brief number of years between 2008 and 2010.
Gerrymandering is also well known for creating odd-shaped districts. In fact, the 12th congressional district in North Carolina is usually cited as one of the most complex districts in the country. It has even been the subject of multiple legal challenges alleging racially-motivated gerrymandering. Using the map below, you can see how District 12 boundaries have changed over the past two decades. Observing this political struggle in action is a powerful way to understand how the Democratic and Republican parties have successfully used gerrymandering as a tool to achieve their own political interests.
June 14, 2018
Data in the News: Cost of Hurricane Damages
Hurricane season has begun and the United States already has one significant storm under its belt, subtropical depression Alberto. Although initially Alberto decreased in magnitude as it approached the U.S. and made landfall, it caused substantial damage across the southeast. Since it takes about a year or two to gather all data about weather destruction, the extent of Alberto’s damages remains unclear. However, news reports indicate the storm caused significant property damage and some loss of life including the deaths of two news reporters in North Carolina, flash flooding, and more. To this day, more data is arriving about additional property damage and missing people. So one may wonder, what will the damages be for this 2018 hurricane season given that a subtropical storm like Alberto has caused not insignificant damage already? What portion of that will the federal government help with?
We can gain an idea of what 2018 costs will look like using historical data from a variety of sources. Alberto swept through a majority of the southeast region, but we can focus on a particular state to narrow this analysis. Let’s look at Florida in particular: According to FEMA data on assistance funding for hurricane damages, Florida counties were collectively issued approximately $4 million in 2016 and $1.6 million in 2017. These two figures give us some context of FEMA’s aid to Florida for hurricane damages and what to expect for the 2018 season. Forecasts predict five to nine hurricanes this year and one to four major hurricanes in 2018.
In this case, more research on 2016-2017 Florida hurricanes is needed to make a real comparison. However, if this year’s hurricane season is similar to the past two years we can estimate that FEMA aid for hurricane damages occurring in Florida will range between one to four million dollars, barring any major hurricane like Hurricane Sandy or Katrina. This of course does not account for insurance payouts, federal funding via other agencies, or other kinds of funding outside of FEMA. Identifying other actors that are not accounted for in the dataset you’re using for estimation purposes is a good to keep in mind and to communicate to avoid overstating your conclusion.
May 31, 2018
How to Cite Data
If there is anything that school has ingrained in our minds, it’s that we should always always always cite our sources. A detailed citation is important not only to acknowledge how others’ ideas have contributed to your work, but also for readers to see and follow on their own time and for their own purposes. For example, a Public Health student may be interested in tracking down the original information cited in a news article she recently read, but will have an incredibly difficult time doing this if there is no citation provided or if the citation isn’t detailed enough.
Thankfully, we can easily generate accurate and detailed citations with the help of citation managers like EndNote and Zotero. However, some sources can be difficult to cite because of how different they are from traditional text sources like textbooks and journal articles. Citing data for instance can be a tricky business because it often comes in the form of an Excel download or is presented online in a table wizard of some sort. Because of these kinds of differences, you’ve probably found yourself asking several questions: I found this data online so do I cite it as a website? What do I use for the author name if there is no author mentioned?
The breakdown: Elements of a Data Citation
A lot of questions that come up when citing data sets are answered by the International Association for Social Sciences Information Services (IASSIST), which developed a guide to help researchers correctly and comprehensively cite datasets. Below are the fundamental elements you should always include in a citation for data sets.
- Author: if the creator of the data is an organization, then insert the organization’s name here. E.g. U.S. Census Bureau.
- Date of Publication: when was the data first published?
- Title: the name of the dataset. If there is a specific table identification code, then I would include that as well!
- Publisher/Distributor: if the publisher/distributor is the same as the author, then enter “Author” in place of the name.
- URL: ideally the more direct the URL the better. Make sure it's a stable URL!
The source I am interested in using provides a preferred citation, but it doesn’t follow the IASSIST template above. Is it best to honor the source’s request or use the IASSIST citation template?
According to Hailey Mooney, Psychology & Sociology Librarian for the University of Michigan Library, “You should honor the spirit of the preferred citation and include all of the relevant components. Verify that the preferred citation is complete and correct. It is likely that you may need to rearrange elements anyhow, in order to put it into a particular citation style format.”
Be sure to include all relevant information whether you use APA, MLA, Chicago, or another style. If in doubt, always include the elements outlined by IASSIST!
May 9, 2018
Excel Tips: Navigating an Excel table
As simple as it may sound, navigating yourself around an Excel worksheet takes some practice. Scrolling your way through a table of say 10 records is no big deal, but this can be incredibly cumbersome when you are dealing with a dataset that contains hundreds or thousands of records. This is an issue I often came across when collecting and cleaning data for SAGE Stats and while I strongly believe that Excel is best learned by practice than by seeing, I’ll outline the quick shortcuts all Excel users should familiarize themselves with in order to quickly navigate their way around an Excel table.
This will hopefully not come as a shock to most of you, but an Excel worksheet is comprised of columns (represented by letters) and rows (represented by numbers). The last column you’ll see in Excel is column “XFD” and the last row in Excel is row 1,048,576. Imagine having a dataset that occupies a fraction of those limits – yep, it is no fun! Below are the best keyboard shortcuts to navigating up and down an Excel table instead of clicking and scrolling your way into a massive state of frustration.
|Keystroke||Where does it take you?|
|Ctrl + End||The last cell of a data set|
|Ctrl + Up or Down Arrow Keys||The top or bottom of the data set|
|Ctrl + Left or Right Arrow Keys||The left-most or right-most cell of a data set|
|Ctrl + Shift + Arrow Keys||Selects cells in the same column/row as the active cell. A great shortcut when you want to quickly select and copy data.|
|Page Up and Down||Moves one Excel screen up or down in a worksheet|
|Alt + Page Up or Page Down||Moves one Excel screen to the left or right in a worksheet|
I’ve never had an Excel dataset that I couldn’t quickly scroll through on my own.
Suit yourself, but these shortcuts are an excellent way to save yourself the time and effort it even currently takes you to find the information you need. Save yourself some eye strain and practice these shortcuts! Once you’ve gotten into the habit of using them, you’ll wonder how you lived without them. Check out Microsoft's dedicated page for additional keyboard shortcut suggestions!
April 19, 2018
Data in the News: Teacher Salary Protests
If you've visited CNN, NPR or the New York Times in the past few weeks, you may have heard about the current teacher strikes in certain states demanding higher salaries. Oklahoma, Kentucky, West Virginia, and Arizona are among the key states where teachers are protesting what they believe to be unfairly low salaries compared to their colleagues in other states. When considering teacher salary data, it is interesting to examine how these numbers have changed over time and how they vary by state.
Overall, average public school teacher salaries increased by nearly 60% between 1995 and 2017. However, by using the data set above to calculate this change by state, it’s clear that some states have experienced slower salary growth than others. For instance, teacher salaries in Oklahoma, Arizona, and West Virginia have increased by 38%, 48%, and 43%, respectively, whereas salaries in states such as New York have risen as much as 68% in the same time period.
As with any analysis, it is important to consider external factors that may influence the real-world implications we observe in data. When comparing data such as salaries among states, factors such as regional cost of living and state averages must be included. It’s unlikely that the average cost of living is the same between Oklahoma and Manhattan, for example, which may account for the differences in salary growth. At the same time, data can never tell the entire story, and news stories reporting teachers who work multiple jobs to pay rent illustrate that there is a problem beyond just differences in cost of living.
Therefore, this case illustrates the intersection between a data set, external factors, and real-world implications. While it may be easier to draw conclusions based on numbers alone, it is crucial to contextualize an analysis by considering underlying factors and then examining their impact on society. Working with both hard data and first-hand news articles is a good first step to getting closer to the full story for any data challenge.
March 31, 2018
Evaluating a Data Source
Previously, I’ve discussed factors you should consider when evaluating a data set that meets your information needs. This included reading through the data documentation, noting any data outliers, and so on. However, like all other kinds of content, numbers can be just as easily manipulated to paint a rosier or different picture than actually exists. For this reason, it is equally important to evaluate the source organization that is responsible for collecting and distributing the data set you’ve found and want to use.
So what are some ways you can evaluate a data source? Like the evaluation of an actual data file, you should go into the evaluation of a data source with a few questions in mind.
- What survey questions were used to collect this data? These are usually provided by the source and reading through these on your own can help you note any subtle wording that may have influenced the respondent’s answers or unclear wording that many respondents could have interpreted differently.
- What was the sample size and is it appropriate for the population discussed? A sample size of 50 people for the analysis of a population of 50,000 is not quite reliable.
- How and when was this data collection carried out? Is the data based on a telephone survey that was conducted five years ago? The application of that data to the present is not a judicious decision.
- Why did the organization carry out the survey and share their results? This is key to understanding what motivations or incentives the organization may have in disseminating or even suppressing the information.
But the data I found comes from a major organization! It must be fine, right?
Thanks to the internet, we are presented now more than ever with an infinite amount of information from a myriad of sources that all claim authority. However, these claims, the brand name of the organization, or size of the data should not by itself validate its authority. As much as major organizations are perceived as reliable and trustworthy, all organizations have interests in mind that may influence what they included in the survey and how they carried that survey out. With that in mind, it’s always best to evaluate a data source you’ve come across with a healthy degree of skepticism.
February 28, 2018
Data in the News: Flu Mortality Statistics
With spring upon us, it seems the current flu season may be slowly drawing to a close. Current reports from the Center for Disease Control and Prevention (CDC) indicate that the hospitalization rate for flu diagnoses was 59.9 per 100,000 persons during the first week in February. The U.S. has not experienced a rate this high since the 2014-2015 flu season, which reached 50.9 per 100,000 that same week.  In reviewing and discussing these hospitalization rates, it is natural to wonder how these hospitalization rates compare to flu death rates.
Death rate statistics for the current flu season have been widely reported; however, while reading these articles remember that these are estimates. Like most health statistics, final mortality data lags by a year or two and so what we currently see in the news today about the flu season are based on estimates of reported flu deaths. What does this mean? It means that these figures are based on preliminary evidence of cause of death, which may be revised once the CDC receives more complete data. That’s not to say the current CDC statistics are wrong, but that they are estimates until the reporting data is finalized which will not happen until much later in 2018.
So what do the annual flu death rates look like? Based on the chart below, we can observe that the average U.S. flu and pneumonia death rate has gradually decreased between 1998 and 2014. Browsing from year to year in the map view, we can also see that Arkansas and West Virginia in particular have experienced consistently high death rates compared to the U.S. average.
Therefore, while the 2018 season had higher estimated rates of hospitalization and death than in recent years, the overall trends show that deaths are declining.
Overall, when reading about data in the news it’s important to examine the information the same way you might when collecting data to use in a class or other project. By acknowledging when data are estimates or preliminary, and seeking out additional information on overall trends, it will be easier to obtain a complete picture of what story the data is telling us.
 Centers for Disease Control and Prevention. (2018, February 9). CDC Update on Widespread Flu Activity. [press release]. Retrieved from https://www.cdc.gov/media/releases/2018/a0209-widespread-flu-activity.html
January 30, 2017
So many data sources, so little time...
It’s the beginning of a new year, which means that hundreds of government agencies and bureaus are releasing 2017 data updates for their numerous data sets. And by “hundreds” I mean so many we actually do not know how many federal agencies exist.
This is why many people’s first instinct is to visit the Census Bureau to gather statistics on all sorts of topics. It’s a centralized resource that provides data sets such as the American Community Survey which cover an array of demographic and socioeconomic topics. However, as you advance in your research or if your information needs require a more specialized focus, you may need to turn to one of those “hundreds” of federal agencies for more detailed statistics.
Sifting through all the federal agencies for a data set that meets your needs can feel a lot like looking for a needle in a haystack. Thankfully, I’ve spent enough time looking for new data and updating our current SAGE Stats data to identify the federal agencies that will help your more focused research get started on the right foot.
|Agriculture||U.S. Department of Agriculture||There are several options that range from food safety to agricultural trade.|
|Crime||Federal Bureau of Investigation (FBI)||The Uniform Crime Report is one of the first go-to resources for crime statistics.|
|Economy||Bureau of Economic Analysis (BEA)||You know all those GDP figures news outlets report? They get those from the U.S. Economic Accounts resource.|
|Education||The National Center for Education Statistics (NCES)||Multiple data tools are available here depending on your interest in academic levels.|
|Employment||Bureau of Labor Statistics||Employment data can be sliced several ways and the BLS provides more than several options.|
|Health||Centers for Disease Control||Oh, boy where to start. CDC Wonder is a great resource for researchers who want to customize their data download files. For more summarized statistics, check out the Data & Statistics page.|
|Populations||The Census Burea strikes again||Although the Bureau has population data up the wazoo, its Population resource focuses solely on population counts.|
|Transportation||Bureau of Transportation Statistics||The BTS simplifies your research by providing information via multiple reports and tools.|
These agencies provide ready-to-use data files, right?
Aw, bless you. I mentioned in previous posts that data cleaning is a necessary evil to get your data in shape for analysis and visualization – and federal agency data sets are no exception. The resources outlined in the section above provide data and statistics in all sorts of formats and sizes so your work remains to be completed. The clean-up work may be quick or extensive depending on the size of the file and the data’s complexity. For immediate results, check out our Advanced Search on SAGE Stats to find statistics from these same agencies in convenient Excel format!
December 5, 2017
Understanding Different Census Geography Types
I received an excellent question on my previous blog post about the American Community Survey (ACS): Does the ACS, or Census Bureau more generally, provide statistics by urban area? The answer to that is a big fat YES. There are actually several different geography types that are specifically used to analyze urban areas and their surroundings. For now, I’ll focus on Core Based Statistical Areas (CBSAs) (I also call them metro areas more generally) because they fit a broader definition of what an urban area is.
Core Based Statistical Areas (CBSAs) are comprised of at least one core area with a population of 10,000 or more and surrounding counties that exhibit a high degree of social and economic integration with the core area based on work commutes. CBSAs are great units of analysis if you are studying areas that are influenced by the economic and social activity of one or more cities or urban areas.
For instance, Washington, D.C. is a major employer hub for the surrounding counties in Maryland and Virginia – the proximity of the federal government makes these areas ideal for all kinds of companies, which require many employees, which then require more housing construction, which requires more public roads, which means more car buyers, which means more banking loans, and so on. Soon enough, it becomes difficult to distinguish where the domino effect of the city’s economic influence begins and ends.
Like counties, CBSAs have boundary definitions all of which are outlined by the Office of Management and Budget (OMB) and are updated approximately every decade. The last major definition update occurred in 2013, but the OMB is known to modify a CBSA in-between updates. Currently, there are two types of CBSAs that differ only in the population size of their core areas:
|CBSA type||Core area population requirement||Geographic building blocks|
|Metropolitan Statistical Area (MSA)||At least 50,000 people||Counties|
|Micropolitan Statistical Area||At least 10,000 but less than 50,000 people||Counties|
It’s important to note that a decade is a significant period of time and that CBSA boundary definitions are likely to change during this time. This can get tricky when analyzing one CBSA across more than 10 years because its boundary definitions are likely to have changed. You should keep this in mind when researching and analyzing CBSA statistics. Good data sources will provide the specific definition year for the CBSA’s boundaries used in a data set to avoid confusion.
How exactly do CBSA boundaries change over time?
CBSAs can gain or lose counties and sometimes new CBSAs are born and occasionally they are eliminated. Additionally, their names can change year-to-year based on the relative population of the largest cities. Again, be cautious when comparing CBSAs across any span of time greater than 10 years. For more information, visit the Census Bureau or visit our SAGE Stats Methodology page!
October 30, 2017
The American Community Survey: U.S. demographic characteristics at your fingertips
The Census Bureau is anyone’s go-to source when it comes to national and local U.S. statistics. It has anything and everything you can think of about the U.S. population. What U.S. county has the greatest number of people claiming Nepali ancestry? What is the average mortgage payment in my zip code? What is the average travel time to work in Wyoming?
You get the idea. The Census Bureau has a lot to offer, but there is one dataset in particular that is likely to provide answers to many of the different socioeconomic questions you have: the American Community Survey (ACS).
The ACS is an annual survey program that collects and provides key indicators about the American public. It covers a multitude of topics such as employment, housing costs, health insurance coverage, and so on. Think of it as the annual decennial census – only instead of collecting basic information like race and sex, the ACS collects more detailed characteristics like average rent paid, educational attainment, and much more. ACS statistics are released in batches beginning typically in the fall following the year of reference. These batches are divided into the 1-year, 1-year supplemental, and 5-year estimates. But what do these mean exactly?
|ACS Estimates||Definition||How to use it|
|1-year estimates||Data collected over a 12 month period, e.g. January 1, 2016-December 31, 2016.||Best used when analyzing areas with populations of 65,000 or more and when currency is more important than precision.|
|1-year supplemental estimates||Data collected over a 12 month period.||Best used when analyzing areas with populations of 20,000 or more and when smaller geographies are not available in the regular 1-year estimates release|
|5-year estimates||Data collected over a 60 month period, e.g. January 1, 2012-December 31, 2016.||Best used when you’re more concerned with precision than currency and when analyzing any size population. These are the best estimates to use when analyzing small population areas.|
For a complete breakdown, check out the Census Bureau.
So what's the best way to browse all the ACS statistics?
The best entry point to find the ACS statistics you want is American FactFinder, a warehouse of statistical information from surveys implemented by the Census Bureau (including the ACS). It provides multiple ways to get the information you need – from a simple location search to a mass download option. Be sure to select the American Community Survey as the specific Census program you would like to view. As I mentioned, the ACS collects information on a diverse number of topics compared to other Census surveys, which typically focus on one topic like the American Housing Survey. The ACS therefore gives you more bang for your buck! However, take note: ACS estimates are based on a smaller sample of the U.S. population than the traditional decennial census. Therefore its estimates carry a higher margin of error or are less accurate, but are timelier than the decennial census. For more information, visit the Census Bureau.
October 10, 2017
Tips on Data Viz
For most of us, the fun bit of working with data and statistics is the visualization aspect. Many of us are visual learners, or at least understand information best when it’s simplified into a picture. This is especially true when dealing with data and statistics, which hide the meaning and significance of the information it carries behind numbers and variables. No one looks at the data set below and immediately thinks, “Got it, the construction industry’s contribution to U.S. GDP is rising again after the Great Recession. Easy.”
Just looking at a data set is not enough (to the inexperienced eye) to identify patterns in the numbers – we need a representation to more quickly and easily communicate the meaning and significance to our readers. For instance, the construction data in chart form relays the same information, but in a much more understandable way.
Thanks to Excel, almost everyone who has used a computer in school or for business in the past 15 years can create a simple chart like a bar or line graph. (And if you haven’t, the online resources available are infinite.)
However, even if you are a seasoned Excel user or chart builder, below are some practices you should keep in mind as you work on your next visualization.
The breakdown: Top Tips on Creating Data Visualizations
- Ask yourself, “What do I want to show in my chart?” Do you want to show a comparison, trend over time, or relationship among data sets? All of these are great options, but one chart type will usually represent the information better than others. Here’s a quick guide compiled by Dr. A. Abela to help you narrow your options to the best choice.
- Minimize the number of variables in your chart to avoid confusion. What does the reader need to know to understand the significance of the information? Limit the chart to those items.
- Keep the focus on the data, not the visual. It’s tempting to go all out and create a complex visualization using graphic design, but sometimes simplicity is best because you avoid distorting the information or leaving it open to misinterpretation.
- Provide context. Don’t assume the reader will immediately understand the information the chart is trying to impart – provide a brief title or subtitle as needed. Be sure to indicate what is being measured (people? U.S dollars? Squirrels?) in the title or in any labels.
- Provide a detailed citation. “Census Bureau” is not going to cut it! Big sources like the Census Bureau release thousands of data sets on an annual basis – help the reader who wants to find and use the chart’s underlying data by providing the name of the report or data set and a direct URL.
For more information on how to create the best data visualization, check out Data Visualisation, A Handbook for Data Driven Design by Andy Kirk!
August 14, 2017
The Data Analysis Process
#3 – Summarizing your data
So – we have found the data and we have cleaned the data. Great! But, now what do we do with it? The third and final stage of the data analysis process really gets to what you needed to begin with – information and supporting evidence.
As I mentioned in my first post, raw data oftentimes does not make sense at face value or it at least does not provide enough context for a person to understand its significance. This requires the user to “summarize” that micro-information into straightforward intelligence. “Summarizing” data into statistics is much less about creating new information than it is translating and contextualizing the data into meaningful information for everyone.
In the midst of the 2016 campaign debates about climate change, my editor came across statistics on U.S. electrical generation by state and recommended it as a great addition to SAGE Stats. I agreed and found the original data on the Energy Information Administration’s (EIA) website. The EIA regularly releases data on electricity generation by source across the U.S; however, you’ll see below that electricity generation is measured in megawatt hours.
Not many people understand what a “megawatt hour” is – I certainly didn’t, I had to Google it! Is 1,000 megawatt hours a lot? Is it too little? How about one million?
Although I had no idea what a megawatt hour was, I understood it was measuring energy production across the U.S, which is valuable information for assessing which states are moving away from traditional energy sources like coal. But how could I translate megawatt hours into a statistic that everyone could understand? This required me to calculate statistics from the EIA data to neatly “summarize” the information it provided.
The breakdown: Summarizing data, an example.
In this scenario, I was specifically interested in electricity generation by source type and by U.S. state. The EIA provides this information as well as overall total electricity generation values. When facing raw values such as these, ask yourself, “What can I compare these values to in order to better understand their significance?” This is a great question to ask because your audience will understand information much more easily if it’s compared to other information. Here a number of statistics you can calculate to answer this question:
- Totals: summing values to get a big picture perspective is often handy.
- Percent of totals: excellent for comparing segmented data against overall totals.
- Amount change: a good option to compare how much values have changed.
- Percent change: a good way to compare the size by which values have changed.
- Averages: these include mean and median averages.
Based on the EIA’s data, I decided that comparing electricity generated by source type to the overall total electricity generated was much more meaningful. So I calculated percentages for each source type against the total number of megawatt hours. That way I could gauge how much of each state’s total electricity was generated by coal, wind, natural gas, and so on. The results were much easier to understand and particularly enlightening!
Once the data was mapped out, I saw that a large percentage of the Midwest’s electrical generation was due to wind energy – an interesting result considering that neighboring states have strongly adhered to coal and oil.
So I can just throw any raw data values together, right?
Yeah, that’s a big N-O. Use best judgement when you calculate statistics. Any of the statistics in the section above should be calculated with values of the same unit of measure. So don’t go adding dollar values and percentages together because that makes no sense. Likewise, be careful of any missing data values or incorrect data values that can throw your calculations off (although your analysis as reviewed in my first post of this series should help you become aware of those!). If in doubt, ask for help from a trusted resource such as your instructor, librarian, or colleague.
 Herzog, David. Data Literacy. Thousand Oaks: SAGE Publishing, 2016. Print. http://methods.sagepub.com/book/data-literacy
July 28, 2017
The Three Stages of Data Analysis
#2 – Cleaning your data
The term “data cleaning,” the second stage of the data analysis process, is usually met with some confusion. I mentioned to a friend that the most recent SAGE Stats data update required a lot of cleaning, which was taking up a significant amount of time. She asked, “So what exactly is data cleaning?” An excellent question!
Data cleaning or “scrubbing” consists of taking disorganized, messy data and transforming it into a format that enables easier analysis and visualizations. Depending on your formatting or metadata requirements and how big the data file is, it can take days to clean a file into submission.
Since I began working on SAGE Stats, I’ve learned many Excel tricks that can be applied to any kind of data cleaning situation. To avoid information overload, I’ll stick to the tricks I’ve successfully used in the past two years.
Top 10 Tips on Cleaning Your Data
- Read the data documentation. This will tell you what each component of the data file represents and help you identify what data is most relevant to your research interests and what data you can avoid.
- Excel’s “Text-to-Columns” feature. Especially large data files are often stored in “csv” or “comma separated value” formats and can be imported into Excel using this handy feature.
- VLOOKUP formula. My holy grail of Excel formulas. Do you want to pull multiple values from a workbook into another workbook? VLOOKUP has your back.
- COUNTIF formula. Are you looking for duplicate values in a range or checking whether values in one workbook are present in another workbook? COUNTIF counts the number of times a value occurs in a range!
- LEFT and RIGHT formulas. These are very useful when you need to parse out specific characters from the beginning or end of a value. For instance if “092017” represents September 2017, but I only need the year, then I can use the RIGHT formula to collect the last four digits.
- TRIM formula. Frustrated by inexplicable extra spaces that follow the value you want? This formula “trims” those out for you.
- CONCATENATE formula = “&”. Concatenate is a fancy word for linking two values together – you can use the formula for this or insert an ampersand between the two cell references, e.g. =A1&B1.
- I don’t think Excel’s filters get enough credit. Are you looking for multiple misspellings of New York? The filters help you quickly identify and correct them.
- Nest your formulas. Find ways to combine formulas to reduce the number of steps you have to complete! For instance, do you need to look up values in Workbook 1 that are associated to a value’s last five characters in Workbook 2? Nest the RIGHT and VLOOKUP formulas to quickly get your answer.
- Work off a copy of the original data file. You don’t want to be in a situation where you have mistakenly deleted data values and then have to download the data file again. Keep the original version handy as a backup.
This is a lot of work. Why do I need to clean the data file at all?
Sometimes a data set is so simple that it requires no cleaning at all; however, that’s not usually the case. These days you will typically encounter a file with all data merged into one column, which you then have to unmerge or parse out by yourself. Then you find that you need to concatenate some values back together. And then you realize that some values occur multiple times and you want to find out how many times each one occurs in the file. All this when you only want a snippet of that information! Data cleaning is a necessary evil at times in order to get your data in shape for easier visualizations and more accurate information.
The best way to learn these tricks (and even more advanced tricks) is to dive in head first and try them out with a specific data set. In Excel’s case, doing is better than reading or listening. After all, no one starts out as an expert, and I am no exception! My tips above are suggestions and may not work with your specific needs, but they can be applied in almost every kind of data situation. If you use them often enough, then they practically become muscle memory.
June 7, 2017
The Three Stages of Data Analysis
#1 – Evaluating raw data
A friend I haven’t seen in a while asked me what I do for a living, and I talked about SAGE Stats and the work that goes into maintaining and building the collection. Instead of his eyes glazing over (like most people’s would) he asked me, “Ok. Not to seem like an idiot, but what is data analysis? Like what does it cover?” If you’ve had similar thoughts, never fear! I think I can safely say I’ve received multiple variations of this question before. My typical answer: what doesn’t it cover?
Data analysis covers everything from reading the source methodology behind a data collection to creating a data visualization of the statistic you have extracted. All the steps in-between include deciphering variable descriptions, performing data quality checks, correcting spelling irregularities, reformatting the file layout to fit your needs, figuring out which statistic is best to describe the data, and figuring out the best formulas and methods to calculate the statistic you want. Phew. Still with me?
These steps and many others fall into three stages of the data analysis process: evaluate, clean, and summarize.
Let’s take some time with Stage 1: Evaluate. We’ll get into Stages 2 and 3 in upcoming posts. Ready? Here we go…
The breakdown: Evaluate
Evaluating a data file is kind of like an episode of House Hunters: you need to explore a data file for structural or other flaws that would be a deal breaker for you. How old is this house? Is the construction structurally sound? Is there a blue print that I can look at?
Similarly, when evaluating a raw data file you have collected, you should consider the following questions and tips:
- Read through the data dictionary, codebook, or record layout, which should detail what each field represents. Try not to immediately start playing with the data until you know what you’re looking at. You wouldn’t start renovation in your new house without reading the blue prints, right? You gotta know if that wall is load-bearing!
- What irregularities does the methodology documentation detail and how may it have affected the data? What are the methodology notes that I should make transparent to the reader?
- Is the raw data complete? That is, are there missing values for any records? (Missing values in the raw data can distort your calculations.)
- What outliers exist in the data set? Do they make sense in the context of the data? For instance, a house price of $1.8 million in a neighborhood where houses don’t exceed $200K is probably a red flag.
- Spot check the raw data. If the data set provides totals, then sum the values and check that they match. If they don’t, then does the documentation explain why they may not add up to the totals?
So if the source is good, then the data must be good too. Right?
It’s a mistake to assume the data is authoritative or fine as is just because it’s a published government source or another source you consider just as reliable. Data reporting is susceptible to manipulation and simple mistakes despite the best efforts and intentions of the responsible organizations. Assume nothing and evaluate the data to ensure it checks out! The next stage of data analysis is how to clean raw data to fit your needs. Stay tuned for my next post, where I will review the most effective Excel tips and tricks I’ve learned to help you in your own work!
May 1, 2017
Data and Statistics 101
The fundamental difference between data and statistics (because who knew!)
Before I started working on SAGE Stats, the idea of working with a large data set was quite intimidating. Shout out to the USDA’s Food Access Research Atlas! In the two years since, working regularly with our platform has really opened my eyes to how empowering and beautiful data is once you understand how to pull usable information from it.
My experience has also taught me how overwhelming and confusing data can be. What is a data set and how is it different than a time series? How can I tell if data content is reliable or not? What the heck is a data dictionary and why do I need it? Unless you are consistently elbows deep in data, it can be difficult knowing where to even start. So let’s begin with the very basics: what is the difference between data and statistics?
The two terms are often used interchangeably – even within the same breath. I have even caught myself using both terms in explaining SAGE Stats to team members and close friends without a second thought. Although it is easy to synonymize the two, they are in fact very different.
Data are collected and organized information typically provided in massive files with detailed records and a data dictionary to decode the variable information. The records in those data files do not communicate significant meaning to the naked eye, so time and analysis are needed to read through the data collection methodology, decipher variable information, and determine which variables are of interest to you.
Statistics are clear and understandable explanations or summaries of data based on analysis. Statistics are generally available in tables and represented graphically. For example, the median state unemployment rate in the U.S. was 4.0% in 2016. This is a statistic derived from analysis of sample data collected by the U.S. federal government.
So statistics are better than data, right?
Not necessarily. Whether you need data or statistics really depends on your research question. Data is needed when your research question addresses a new issue that hasn’t been explained or thoroughly explored yet – this requires a deep dive into data where you must analyze and derive meaningful knowledge that can answer your question.
A more straightforward research question, however, can be more quickly answered with statistics because the question has been asked before and so the analysis to answer that question has also already been done. For instance, a student who needs information on unemployment across the Rust Belt states can easily find an answer because that information is frequently processed by the federal government for its own assessment of the economic climate.
The difference between data and statistics lies in the analysis. Data needs to be analyzed to be understood, but a statistic can be understood right away. The next question is: how do I begin to analyze data to get the statistics I need? Stay tuned for my next blog post for tips on just that!