Skip to main content
 Take a Tour

Tips with Diana

We love data! But, we also know that collecting, analyzing, and reporting with data can be daunting. The people we turn to when we have questions are Diana Aleman and Raphael Jackson, editors extraordinaire who bring their trials, tribulations, and expertise with data to you in Data Basics. Stay tuned for their experiences, tips, and tricks with finding, analyzing and visualizing data.

  February 27, 2020

Data in the News: Does greater education spending result in higher levels of student success?

The state of public high schools in America is constantly in and out of the news. Strikes and complaints over low teachers’ salaries such as those threatened in Minneapolis, Minnesota feel like a monthly occurrence. In these labor disputes, teachers often complain that their schools are under resourced and that teachers are forced to buy school supplies out of pocket so that students can learn. Teachers complain that this lack of funding negatively impacts student learning because it hampers the ability of teachers to focus on teaching. This got me thinking, are the teachers right? Do states that spend more on public education see higher rates of student success? To find out, I turned to the data.

To begin exploring this issue, I selected three data sets that will be invaluable: total revenue for public schools per state, total number of students enrolled in public schools per state and amount of money being spend per student in each state. States that have higher number of students will most likely require a larger budget each year, which is why the third data set will give me a clearer idea of the level of spending on education in each state.

For this blog post I chose to examine three states: Maine, New York and Florida. Starting in 2013, Maine drastically reduced its spending per pupil and I thought it could be interesting to see if there has been a subsequent dip in student success in that state.

Compared to Maine, New York and Florida have larger student populations and larger public school budgets ranking second and seventh respectively in 2017. However, New York vastly spends more per student in public school and in 2017 was ranked number 2 whilst Florida spends far less per public school student and in 2017 was ranked 37 in the U.S. By comparing student test scores in these two states we can make a judgement as to whether New York’s increased spending per student has paid off.

Now let’s examine the average mathematics scores of public school students achieved on their state’s level of investment in the fourth and eighth grade. Please note that the fourth grade and eighth grade mathematics exams are scored out of 500.

In both graphs, we can see a clear decrease in student success in Maine after 2013, which conveniently coincides with the education cuts in the state. Of course, more research would have to be done to prove conclusively that the cuts in per pupil spending resulted in this decrease in average test scores but it does provide some evidence to support teachers’ claims that greater education spending will benefit student’s education.

At the fourth grade level, New York’s higher levels of investment appear to be worthless as Floridian fourth graders far outperform their New York peers. However, the position of the two states reverses among eighth grade mathematic test scores. . New York eighth grader average math scores are far higher suggesting that New York’s increased education spending is better able to bridge educational gaps resulting in higher average scores later in a student’s career. Interestingly, despite the fact that in 2017 Maine was spending as much per pupil as Florida the state’s average eighth grade math score was above even New York’s possibly suggesting a link between student success and number of students in a school system. Florida and New York have roughly 2.8 million students in the public school systems whilst Maine has less than 500,000.

  February 2, 2020

Data in the News: Investigating Sexually Transmitted Diseases and Dating Apps

In the past couple of weeks, I’ve been reading a number of newspaper articles that claim there is an increase in the number of new cases of sexually transmitted diseases (STDs) and suggest that the reason for this increase is the rise of dating apps such as Tinder. In a Daily Mail article, the newspaper claims that syphilis cases are at their highest level in Britain ever and that someone is diagnosed with a new STD every 70 seconds. In 2017, Vox also published a newspaper article saying that sexually transmitted diseases were on the rise throughout the U.S. and again blamed dating apps for this increase, especially tinder and grindr.

To examine this claim, I looked at the number of Sexually Transmitted Diseases diagnosed by per State and the Sexually Transmitted Disease rate per State.These two datasets will show if the U.S. has experienced growth in the number of STD cases and will show if there’s been a sudden jump in the number of STDs and are thus useful in validating or disproving the claims of the two newspaper articles. Additionally, I analyzed the number of cases of chlamydia reported by year to examine whether growth in the number of overall STI cases is also reflected at the level of an individual disease.

The first question to ask ourselves is when were dating apps invented and when did they become popular? Tinder was founded in 2012 but didn’t become popular until the end of 2014 when the app had been downloaded more than 40 million times and users were swiping a billion times a day. If the Daily Mail and Vox are correct we should then see a spike in STDs in 2015 or 2016.

The graphs both strongly support the claims made in the newspaper articles. In 2014-15 you see both trends increase sharply. In fact, between 2012 and 2013 the rate of STDs decreased from 569.2 to 554.3 per 100,000 people only to skyrocket in 2014 to 573. Furthermore, this growth in sexually transmitted diseases has been sustained since 2014 validating Vox’s claiming that the U.S. has seen a sustained increase in the number of STDs.

The trend of a small increase or even a decrease between 2011 and 2013 and then explosive growth from 2014 onwards can also be seen at the more granular level of total number of chlamydia cases per year. Again, supporting the claims of Vox and the Daily Mail that are STI’s are on the rise.

However, these graphs do not definitively prove that dating apps are responsible for the uptick in the number of overall STDs or chlamydia diagnoses. Yes, the graphs show that in 2014-15 there was an increase in the number of sexually transmitted diseases but this is not necessarily the fault of dating apps. The fact that dating apps also became popular in 2014-15 could be a coincidence. There are multiple other factors that could be influencing this uptick that we would need to consider in a more in-depth analysis, including: changing attitudes towards sex and relationships throughout the late 2010s, differences in sex education in the public school system across the U.S., whether new strains of STDs have been introduced and spread, and so on.

  November 18, 2019

Understanding Inferential Statistics and Descriptive Statistics

We talk a lot about statistics here and it occurred to me that we haven’t explored the [italics] types of statistics that exist. Fun stuff, right? Not exactly and I absolutely recognize that, but understanding fundamental concepts like this can give you the ability to read and talk about statistics with confidence. This confidence is particularly imperative to build during this age of data. So let’s talk basic statistical concepts. I could go any number of directions with that, but if we’re going to start anywhere we should start with explaining descriptive and inferential statistics and the differences between the two.

So let’s talk basic statistical concepts. I could go any number of directions with that, but if we’re going to start anywhere we should start with explaining descriptive and inferential statistics and the differences between the two.

The breakdown

The ins-and-outs of descriptive and inferential statistics could easily take up an entire page, but I don’t plan to go into that much detail other than providing what you need to know and explaining it in layman’s terms.

Descriptive Statistics Inferential Statistics
What is it? Statistics that summarize or describe simple, but key characteristics or variables observed in a population. Statistics based on a sample of an observed population from which inferences can be made about that population.
How is it useful? It simplifies raw, observed data points into understandable and meaningful information about a population’s characteristics. It allows us to hypothesize and generalize about a population’s characteristics.
How is it different than the other? A descriptive statistic does not state anything beyond the observed data points of a specific characteristic. An inferential statistic takes an extra step beyond a descriptive statistic and reasons an assumption based on the data and compared to other data.
Here’s an example. The median number of homeless persons in shelters across the U.S. was 1,968 in 2018. The median number of sheltered homeless persons across U.S. states was 2,317 in 2007 and 1,968 in 2018. We can therefore infer that sheltered homeless populations dropped by 15% on average. However, we cannot reason or explain why it may be falling without first bringing in other data as well.

Source:Jupp, V. (2006). The SAGE dictionary of social research methods London, : SAGE Publications, Ltd doi: 10.4135/9780857020116

OK, but when will I actually come across descriptive or inferential statistics in the real-world?

You always come across these in the real-world! The statistics and data you see cited in newspaper and television news headlines, journal articles, commercial advertisements, and so on can almost always be categorized as either descriptive or inferential. Does it make a big difference if the creator or author doesn’t tell you whether they are using descriptive or inferential statistics? No, not really; however, it’s important that you recognize when someone is providing the raw, descriptive statistics or extrapolated, inferred statistics based on his or her reasoning. This will reinforces your critical reading of statistics and therefore your confidence in understanding statistics!

  November 8, 2019

Data in the News: Estimating the time cost of Public transportation

by Raphael Jackson, Assistant Editor

October 2019 saw the start of huge civil unrest in Chile sparked by a metro fare increase in the nation’s capital Santiago. These protests got me thinking about the importance and cost of public transportation in the United States and for this month’s blog post I wanted to examine this topic through the data.

I started by asking myself how important is public transportation to Americans? A good way to answer that question is to ask what would happen if we didn’t have it? The Annual Delay Increase per Auto Commuter if Public Transportation Were Discontinued (Metro) estimates how much time would be added to each commuter if public transportation was removed, i.e. how much extra time it would take for people to get to work if everybody drove or walked.

From 2007 to 2011 (the last available years of data) the median increase across the 84 metro areas was 2 hours – this means that if all public transportation were removed in these areas, people would spend an extra two hours commuting into work every day! However this doesn’t tell us how much time could potentially be lost if public transportation discontinued.

To delve deeper into this issue, I combined this data with population data by metro area and multiplied the increase in commute time by the population totals for each metro area. I then totaled the amounts for each year to produce this graph below. As you can see below, Americans in these 84 metro areas would have spent between 2.15 and 1.9 billion hours commuting per day if public transportation did not exist.

Now, there are some problems with these statistics. Firstly the population data by metro area provides the total population of that area – it doesn’t reflect the number of people who commute using public transportation such as working men and women or children travelling to school. To make this graph more accurate, we would need to try and find population data on the number of people commuting per day.

Additionally, in 2010 and 2011 three metro areas: Honolulu, HI (26180), Los Angeles-Long Beach-Santa Ana, CA (31100) and Poughkeepsie-Newburgh-Middletown, NY (39100) in the Annual Delay Increase per Auto Commuter if Public Transportation Were Discontinued (Metro)dataset did not match exactly with population data by metro. When working with different data sets there is always a likelihood that the geographical regions will be different and this is particularly true of metro areas which change geographic boundaries every decade.

For this project, I found that I had population data for the metro area Poughkeepsie-Newburgh-Middletown, NY (39100) only up to 2009 so I used that figure when calculating the amount of time that would have been potentially been lost? if public transportation did not exist. Most likely the population of this metro area would have increased between 2009 and 2011 but by how much? To be conservative, I decided to keep the 2009 figure and avoid the risk of making an incorrect guess at the population of the metro area in 2010 and 2011. Furthermore, in two years the population is unlikely to have increased drastically so using the 2009 population data most likely will not impact the total.

Finally, Honolulu, HI (26180) and Los Angeles-Long Beach-Santa Ana, CA (31100) did not appear in the population data by metro area dataset; however, I was able to find two other metro regions that roughly covered the same area. These included Urban Honolulu, HI (46520) and Los Angeles-Long Beach-Anaheim, CA (31080) and so I used the population data to calculate the increase in commuting time. Please note that you should always explain how you’ve dealt with problems in the data so that others can make a judgement for themselves as to the reliability of your calculations! The skill of the data researcher is explaining your data methodology and limitations and how you deal with these problems in order to best represent the data.

  September 27, 2019

Data in the News: The Race to Renewable Energy

In honor of Climate Month and Greta Thunberg’s emotional speech to the UN General Assembly, I thought that in this month’s blog post we could investigate America’s dependence on fossil fuels and how much work needs to be done to power the U.S. solely through renewable energy. Note that this blog post is predicated on the assumption that climate change is real. For a further explanation on the data on climate change see my earlier blog post.

In February 2019, Rep. Alexandria Ocasio-Cortez (D-NY) proposed the Green New Deal, , a plan that would address climate change and economic inequality. The core of the plan is simple: cut U.S. greenhouse gas emissions (carbon dioxide and methane) to net zero in the next ten years and move to entirely clean and renewable energy by 2030. This is an incredibly ambitious target and to examine how ambitious we need to look at the data.

The first step is to examine the current energy needs of the U.S. Fortunately the U.S. Energy Information Administration (EIA) publishes an annual data set on U.S. energy consumption, which we can map and analyze on SAGE Stats and that I charted here as well.

Source: Energy Information Administration (Department of Energy). (2019). EIA Renewable Energy Trends: Electricity generated through renewable sources (state) [dataset]. Washington, DC: SAGE Stats by SAGE Publishing. Available from

As you can see from the graph U.S. energy consumption has increased since the 1990s, but for the past decade has remained reasonably constant at 100,000 trillion British Thermal Units (Btu). But what proportion of that consumed energy is currently generated by renewable energy? In other words, how far does the U.S. have to go to reach 100% renewable energy by 2030?

To answer this question we turn to EIA data that details the amount of electricity generated through renewable sources.I downloaded the data set and calculated totals for every year in order to create this graph charting the amount of energy produced by the U.S. from 2003 to 2017. As you can see renewable energy generation has almost doubled, which indicates the U.S. has made significant strides just in the past two decades.

Source: Energy Information Administration (Department of Energy). (2019). EIA Renewable Energy Trends: Electricity generated through renewable sources (state) [dataset]. Washington, DC: SAGE Stats by SAGE Publishing. Available from

However, the more eagle eyed among you will have noticed that the EIA has produced data on renewable energy generation in Kilowatt hours (kWh) whilst data on energy consumption of the U.S. is in British Thermal Units (Btu). We can’t directly compare data sets that are using different measures. Therefore, I needed to convert Kilowatt hours into British thermal units. One kilowatt hour is equivalent to 3,412.141633 British thermal units. But if you don’t feel like doing the math you can always use an online calculator!

Source: Energy Information Administration (Department of Energy). (2019). EIA State Energy Data: Energy consumption (state) [dataset]. Washington, DC: SAGE Stats by SAGE Publishing. Available from

Thus according to my calculations based on the EIA data I’ve collected here, in 2016 the U.S. produced 2,079,512,908,842,100.00 (Btu) in renewable energy generation out of a total 97,314,700,000,000,000.00 (Btu). Or 2.1 percent of America’s energy needs, which suggests that the Green New Deal will be a vast undertaking. But here’s where things get a little more positive! According to the Percent of Electricity Generated through Renewables Sources dataset, the U.S. generated 14.9% of its energy need through renewable sources in 2016 and this number increased to 16.3% in 2017.

So why the discrepancy? One possible explanation is in the differing definitions of “renewable energy” and what energy sources are included in this term for the data sets I’ve used. For instance, nuclear energy whilst not emitting greenhouse gases does produce different forms of pollution. Additionally, much of the fissionable energy for generating electricity comes from Uranium, a non-renewable metal that needs to be extracted through mines.

  September 11, 2019

What are percentages and ratios and how are they different?

One of the more familiar types of statistics we have all used at some point or another is the percentage. At the very least, we have all seen percentages used in journal or news articles as evidence to support argument. For instance, this article on the fires in the Amazon rainforest in South America lists a few facts like how 6 percent of the planet’s oxygen comes from the rainforest.

This percentage makes sense at face value. But what about when you bring in other statistics that are similar to percentages like ratios? For instance, another statistics cited in the same PBS article is that a ratio of 1 in 10 known species live in the Amazon. This also makes sense and is equally informative. But what if I asked you to explain the difference between percentages and ratios?

The breakdown

Let’s start with why these two statistics are helpful. Percentages and ratios help you understand the relationship between a slice to the whole. That is, they are proportional measures. They help you answer questions like, “Looking at the big picture, how important is this piece of the picture?” or “How does this amount compare to these other amounts?” Percentages and ratios summarize how one number relates to another, which helps us quickly understand the significance and relationship between the two numbers.

So what is the difference between percentages and ratios and how should they be used? According to Data Literacy by David Herzog, the definitions are quite straightforward:

Percentage Ratio
How is it different than the other? Compares a portion of a total to the total Compares the difference between numbers from different groups
How should it be used? Use when you want to assess how significant a portion or amount is to an established total Use when you want to understand and compare the relationship between two groups
Here’s an example Women make up 25% of the U.S. Senate in 2019. For every 4 male U.S. Senators, there is 1 female U.S. Senator in 2019.

You’re measuring the number of women in the U.S. Senate in the percentage and ratio examples. Wouldn’t you only use either a percentage or ratio to measure and compare the number of women in the Senate?

The percentage and ratio examples are comparing the number of women to two different groups: the percentage example is comparing the number of women to the total Senate population and the ratio example is comparing the number of women to the number of men in the Senate. In the percentage example we are highlighting the fact that females comprise a not insignificant chunk of the Senate, but in the ratio example we are highlighting the fact that the number of male Senators significantly outnumber the number of female Senators. As you can tell, the angle of the story is different between the percentage and ratio examples!

  August 28, 2019

Data in the news: Inequality between the sexes

Despite the passage of the Civil Rights Act in the U.S. in 1964 that outlawed discrimination based on race, color, relation, sex or national origin, barely a day goes by without complaints that either men or women are being unfairly discriminated against. In fact, the debate around whether women are disadvantaged economically and socially has become one of the key debates in America’s culture wars these days.

In this blog post I want to show you how I used data to investigate this topic and different ways you can present and transform any data you collect to answer your research topic. I chose to examine male and female employment data in the finance industry because it’s an industry long stereotyped as a male-dominated industry with elements of “bro” culture.

To find data on male and female employment within the finance industry I headed to the Bureau of Labor Statistics (BLS) and the reports from the Current Population Survey (CPS). The CPS has published an annual report on the number of women working in the Finance and Insurance industry, e.g. this is the 2016 report. For the sake of simplicity, I will refer to the Finance and Insurance as just “finance”. Now, the report doesn’t state out right how many men and women work in finance outright. It provides the percentage of women employed and the total number of people employed in finance in thousands. In 2016, 55.1% of 7,241 thousand finance employees were women. There are lots of statistics we can pull from this information. For instance, 7,241,000 multiplied by 0.551 gives you 3,989,791‬, the number of women working in finance. And if you know that 55.1% of people working in finance are women then you know that 44.9% of people working in finance in 2016 are men and can work out the exact number of men. I did this for every year 1995-2018 recording my results in an excel file to produce this graph.

Yes, it was time-consuming, but note that it didn’t require advanced math or even any special resources. All of this data was readily available for free on the BLS site.

The graph yielded some results that were surprising. Far from confirming that the finance and insurance industry is male-dominated, data from the BLS shows that the number of women working in the industry has been higher, roughly a million, since 1995. This graph however only charts the number of men and women working in the “finance and insurance” industry, which follows a specific definition and includes all types of financial and insurance occupations. Additionally, further research would need to be done on the average pay of men and women and the number of men and women in leadership positions to get a greater insight into whether men and women are treated equally in finance.

By complete accident the graph also reflected the state of the finance industry 1995-2018. Remember that when I set about collecting this data, I wanted to investigate gender discrimination in the finance industry. In 2008 you see a steep decline in the number of men and women employed in the finance industry as the global financial crisis hits and huge numbers of people working in finance lose their jobs. Additionally, you see a small decrease in the number of men working in finance in 2000-2004 due to the dot-com bubble, but interestingly the number of female employees actually increase during that time.

To better communicate this to any viewers of this graph, I created a gif so that readers can better understand how historical events are reflected in the data.

  June 27, 2019

Data in the news: The Opioid Crisis

The opioid crisis has dominated the U.S. news cycle for the past decade and in 2017 there were c. 50,000 opioid related deaths with no sign that the number of death will begin to decrease any time soon. If you have read any news on this issue you will no doubt have been inundated with graphs and stats on the victims of the opioid epidemic and the profits of pharmaceuticals companies. For this blog post I decided to create an animated gif charting the number of opioid related deaths provided by the CDC from 1999 to 2017 to try and use data to tell the story of the opioid crisis in America.

The opioid crisis lends itself particularly well to an animated data visualization because of the shocking increase in the number of opioid deaths; c. 8,000 to 1999 to c. 50,000 in 2017. Additionally, the opioid crisis has multiple causes, which can be seen in the data. The crisis begins with the over prescription of opioid pain killers such as oxycodone (OxyContin is a brand name of the drug) and hydrocodone, which reaped huge profits for pharmaceutical companies. Then the proliferation of illegal synthetic markets further increased the number of overdoses. From 1999 to 2013 the number of opioid related deaths increased from c. 8,000 to c. 26,000. In 2013, the production of synthetic opioids became popular due to the crackdown on legal opioids and the number of opioid related deaths nearly doubled from c. 26,000 in 2013 to c. 50,000 in 2017, accomplishing in four years what had taken previously taken 14.

Of course you could put this information in a graph and it would tell the same story. But the gif format lends itself to highlighting how different events are impacting the data.

  June 27, 2019

Data in the news: Organ donations and Data Visualizations

This week saw a number of articles published online about the need for stories of tragedy and miracle cures thanks to the generosity of strangers.These news articles stress that the need for organ donations is urgent and according to the Organ Procurement and Transplantation Network 124,472 people are waiting for organ donors. While this number is shocking it provides little information on the history of this problem and its trends. Are the number of people waiting for organs going up or down? To answer that question we need to look at the data.

For this blog post I decided to create an animated gif charting the number of people waiting for a kidney transplants within the United States to better show you how the trend in the number of people waiting for kidney transplants in the U.S. has developed since the early 1990s.

Of course a line graph could convey the same information (see below); however, by creating an animated gif viewers are better able to appreciate the gradual and huge increase in demand for kidney donations. For instance, in 1990 the number of people waiting for kidney transplants was 14,349 and in 2015 that number had risen to 99,985 - an increase of 596 percent! In contrast the U.S. population increased 27 percent between 1990 and 2015. When dealing with statistics on health it is important to check the population increase as well since as the number of people within the United States increases we would expect the waitlist number for kidney transplants to increase as well. However, the disparity in the huge increase in the number of people on the kidney transplant wait list compared to the U.S. population increases suggest that there may be larger forces at work driving the need for kidney transplants that merit further research such as how other countries are attempting to reduce the wait time for organ transplants. The U.K. for example, is implementing an opt-out system in 2020 where all citizens are considered donors unless they explicitly choose to opt out.

  June 20, 2019

About Data Visualizations: Bar graphs

What is a bar graph?

Simple but powerful, bar graphs are one of the most common charts used to compare categorical data, which are data that can be grouped into categories like race and sex.. Bar graphs are also unique in design because they can be displayed horizontally or vertically. Bar graphs are helpful for comparing changes that happen over time, such as years, or comparing differences by category.

Centers for Disease Control and Prevention (Department of Health and Human Services). (2019). CDC Multiple Cause of Death: African american drug overdose death rate (county) [dataset]. Washington, DC: SAGE Stats by SAGE Publishing. Available from

The above graph tracks the death rate of African Americans per 100,000 persons in Washington, D.C. due to drug overdose. The increasing length of the vertical bars clearly shows a rise in deaths between 2014 and 2017. In this way, bar graphs make it easy to process data and ask questions about it. For instance, this graph guides the reader to question why drug overdose deaths have steadily increased since 2014. This is something the bar graph can’t tell us, but we wouldn’t have come to this question without the bar graph revealing that trend to us.

Tips on creating a bar graph

Excel makes it easy to create a bar graph from scratch with their bar graph feature. However, the guidelines below will make it easy to create a bar graph no matter your platform!

  1. Have data prepared for your X and Y axis. If you’re using continuous data, which are data values within a certain range, it is visually better to put them on the X axis.
  2. Add labels and scales for each axis.
  3. Add rectangular bars to represent your data on the graph.
  4. Title your graph and make sure it quickly and concisely explains the bar graph.
  5. Include a legend if necessary.

  May 29, 2019

Data in the News: Department of Veterans Affairs

For the past couple of months the VA or the U.S. Department of Veterans affairs has been in and out of the news for various bureaucratic failures such as delays to the GI bill payments and malpractice at Veterans Affairs' hospitals. However, this week Robert Wilkie, the secretary of the Department of Veterans Affairs published an op-ed claiming that the department for Veterans Affairs is changing for the better.

One way to examine Robert Wilkie’s claim is to look at the data that the VA publishes and try to find ways to measure the efficacy of the Department of Veterans Affairs. For this blog post I looked at the total expenditure of the VA and number of veterans in Michigan as well as Alcona County within Michigan.

From the graph we see a clear decrease in the number of veterans residing in Michigan with a small increase in 2002 and 2003 due to the wars in Iraq and Afghanistan. However, when we examine the total amount of money spent by the VA, we see the opposite trend.

The amount of money spent by the VA has increase dramatically over the past twenty years despite the fact that the number of veterans in Michigan has decreased from around 950,000 to 570,000. As the number of veterans decreases the natural supposition would be that VA spending would decrease as there were less people requiring education and health benefits. The increased spending could be explained by greater health costs as veterans age or inflation - remember that as time passes the value of currency tends to decrease hence more money has to be spent to acquire the same goods and services. However, the four fold increase in VA spending could suggest inefficiencies within the department that warrants further investigation.

These two graphs present another visual challenge since the two data sets have large differences in scales for the Y axis. The first graph is plotting the number of veterans in Michigan by the hundreds of thousands while the second graph plots the VA’s expenditure in the hundreds of millions of dollars. In this instance you could plot the two line graphs on the same graph with different axis units on each side.

Finally, a further way to examine the spending of the VA is to calculate the amount of money the department is spending on each veteran. In order to do this you divide the total expenditure of the VA in Michigan by the number of veterans in the state. This calculation allows greater insight into the VA’s spending as from the graph we can see that the VA’s spending has increased by 7x from $1,000 in 1996 to $7,000 in 2018.

To examine the VA’s spending further it may help to look at a specific county in Michigan to see if the trend continues at the smallest granularity the data offers and to see if the VA is spending money evenly across the state of Michigan. For the purposes of this blog post I picked Alcona County, Michigan at random.

Both graphs display the same trend as the State level graph: a decrease in the number of veterans, apart from in 2002-3 due to the wars in Afghanistan and Iraq, but an increase in the VA’s spending. Of particular note is that the amount of money spent per veteran ($7,000) is the same at the state level and for the Alcona County level suggesting that the VA’s expenditure is even across the state of Michigan.

  May 22, 2019

How your college library helps YOU.

I think I speak for most people when I say that when I think of the word “library”, a few things come to mind: books, silence, and studying. A few words that don’t pop up include support or databases. According to Amanda Izenstark, librarian from the University of Rhode Island, these misconceptions are not too far from how most students think of their university library: “Students view the library as a quiet and stodgy place and they don’t realize it’s a place for studying AND for finding information and research support!”

I have to admit that until I began working for SAGE, I hadn’t thought of the library as anything beyond the ideal physical place to work quietly and productively or a place to find an excellent book. I also didn’t give any thought to the librarians and what they dedicated themselves to – I always assumed it was to physically stamp or scan my books out! Over the years I have realized that [italicize] I was completely wrong. There are [italicize] many resources and services your library provides that support your college career without you even knowing it!

The breakdown

What’s usually the first thing you do when presented with a topic you recognize, but don’t know much about? Let’s be honest: Google, Wikipedia, maybe that suspicious looking web page that was listed on Google’s fourth search result page. Turns out there are a few more ways your library tries to help you access quality information in the most efficient way possible. For now, I will focus on key library resources and services that Amanda identified as ones students would greatly benefit from if they knew about them.

  • Consult your librarian.

    Librarians know a lot and can personally advise on what resources (other than books!) can help answer your research question and how to go about accessing those resources. Wondering what details you should bring to a librarian consult? Amanda has an answer:
    “It helps if the student can answer these two questions: What class is the assignment for? And what exactly is the assignment? Together, this helps me (the librarian) identify the most appropriate resource for the student’s level and understand what the professor is asking the student to do.”
  • There’s a subject database for that.

    “Knowing that the library has subject-specific databases would save students a lot of time in their research.” Interested in Latin American public opinion on a particular topic? Need more information on music from the 1930s? There are plenty of niche content providers who to help answer these questions and you most likely have access to these resources through your library.
  • Something better than just Works Cited.

    Amanda explains: “Lots of students develop the habit of using EasyBib and similar free citation resources, but don't realize there are citation tools the library has access to that are meant to be used for them as scholars.’ Zotero, Mendeley, and others are more sophisticated citation managers that provide more flexibility like saving material, annotating, or sharing with other students.” Consider these options when you think about the numerous projects ahead of you in your college career and how revisiting resources you’ve used in the past would be easier with more advanced citation tools available at your library.

What else can the library and librarians help me with?

Loads more. When it comes to questions or challenges around finding and accessing information, your librarian has an answer or will work to find an answer with you. The key thing to remember when planning to talk with your librarian is to bring a clear question or explanation of your challenge as well as realistic expectations of the information that is available to use.

  May 1, 2019

Data in the news: The rising cost of prescription drugs

Scandals involving price increases for prescription drugs have become increasingly common in the past five years. In 2015 there was widespread outrage when the drug Daraprim, a drug used to treat malaria and parasitic infections overnight, went from $13.50 a tablet to $750 a tablet. Similarly two whistleblowers exposed the practices of Questcor Pharmaceuticals, now Mallinckrodt, to increase the sale of the drug H.P. Acthar Gel. The drug, used to treat rare infant disorders, has increased from $40 a vial in 2000 to nearly $39,000 in 2019.

These price hikes although huge tell us little about the average cost of pharmaceuticals for Americans, since Drarprim and H.P. Acthar Gel are not used to treat common ailments. In this blog post we shall explore the different data techniques used to explore the cost of pharmaceuticals for all Americans and their relative strengths and weaknesses.

The different techniques

For this post I used OECD, an NGO committed to promoting policies that will improve the economic and social well-being of people around the world. It is important to note that this data includes expenditures on prescription medication and self-medication or over-the-counter products, i.e. products that do not require a prescription from a licensed medical practitioner. If you were conducting a research project where you solely wanted to focus on the cost of prescription drugs, then you would need to take this into account in your research.

Percentages and inflation

From the data on the OECD website I created the line graph below (for more information on line graphs see the previous blog post), which plots the percentage of U.S. health spending on pharmaceuticals over time. One question that students often ask when confronted with percentages rather than a whole number is why? Why use a percentage, would it not be easier to see the trends in U.S. pharmaceutical spending if I plotted just the amount of money the U.S. spent on pharmaceuticals?

The problem with using this method to examine the cost of drugs is inflation. Over time the average price of goods will increase thus reducing the purchasing power of a particular currency. Therefore, we would expect to see pharmaceutical spending increase over time because inflation would cause the cost of goods (such as drugs) to increase. Using the amount of money to measure pharmaceutical spending therefore can lead to possible misrepresentation because we would not be taking inflation into account. When we measure the percentage of US health care spent on pharmaceuticals, we can better compare data across multiples years since the line graph is comparing the proportion of money spent on pharmaceuticals compared to the total money spent on health, effectively avoiding the issue of inflation.

Percentages and GDP

An alternative way to determine if the U.S. is spending more on pharmaceutical products is to chart the spending on pharmaceuticals as a percentage of GDP. GDP or Gross Domestic Product is a broad measure of a nation’s wealth and economic activity. If a nation’s wealth were to increase, then we would expect to see an increase in spending on drugs and other kinds of goods because the more money a country has, the more money it will spend. But provided that the proportion spending in relation to GDP was similar to previous years we would see no growth on the line graph. Thus by comparing spending to GDP we can gain a more accurate insight into the trends on U.S. spending on pharmaceuticals.

Per Capita

A third way to examine the cost of U.S. pharmaceutical spending is on a per capita basis. Per capita looks complicated because it’s Latin, but it is actually very easy to understand. It simply means “per person” i.e. the graph below charts the average amount of money that every person in the U.S. spent on pharmaceutical products. This method gives us a clear idea of how much people spend on prescription and non-prescription drugs, but doesn’t factor in larger economic factors such as inflation or GDP.

International comparisons

One final way to examine U.S. pharmaceutical spending would be to compare it against other countries. A number that might seem large initially may turn out to be the same as other countries of similar wealth or population. For this blog post I also charted the per capita spending on pharmaceutical products in the United Kingdom.

The eagle eyed amongst you may have noticed that there is a gap in the data between 1997 and 2013 leading to a misrepresentative steep increase on the line graph. When working with data you may come across data sets which are lacking values for a given year or series of years. In those instances, you must decide how best to get around this problem and most accurately represent the trends your data is showing. For this project I decided to increase the per capita spending by the same multiple every year in order to allow for a clear comparison with the U.S.

The Takeaway

So what do these graphs tell us about the cost and spending habits of the U.S. on pharmaceuticals? It is particularly telling that all the different measures we have considered (per capita, percentage of health spending on pharmaceuticals, percentage of GDP), have all demonstrated that the cost of spending on pharmaceuticals has gone up substantially. Additionally, we can see that this trend is not the norm among countries of similar wealth and demographics as the U.S. For example the 2016 per capita spending on pharmaceuticals in the UK is $400 compared $1,200 in the U.S. The data suggests that Americans are spending a disproportionately large amount of money on drugs compared to other countries and that Americans have seen rises in the price of drugs far above inflation.

  April 29, 2019

About Data Visualizations: Line Graph

What is a line graph?

From news articles to math class, line graphs show up everywhere, so it’s important to understand them.

A line graph is a data visualization type used to track how values change over time. They are particularly useful for tracking trends, which help us more quickly and easily determine when data changed and consider what outside factor could have contributed to that change. For instance, in the personal bankruptcy chart below we see a noticeable drop and rise in the average personal bankruptcy rate between 2005 and 2010 – right around the onset of the Great Recession. Coincidence? I think not.

The above shows a line graph tracking the median average rate of personal bankruptcy in the United States.

Line graphs are used by people who work with data regularly like business owners, budgeters, and statisticians; however, the everyday person is equally likely to have created a line graph at some point in his or her life. They are extremely versatile, which is what makes them so popular and therefore important to learn how to create.

Tips on creating a line graph

Excel is where you are most likely to end up creating a line graph from scratch and thanks to the chart feature, this work is easier than ever. Generally, however, you should follow the guidelines below!

  1. Be sure to have data for your X axis (horizontal) and your Y axis (vertical). For line graphs, the X axis is usually a time variable like years. The Y axis typically measures the dependent variable, i.e. the data that you want to track.
  2. Add labels and scales for each axis.
  3. Next, plot your data points.
  4. Finally, title your chart. Your title should be succinct and convey the key takeaway of the graph.

For a sleek design, remove background gridding and apply engaging formats or coloring.

  April 4, 2019

Data in the News: The China-U.S. Trade War and Soybeans

In March 2018, the United States began implementing tariffs against Chinese imports of steel and aluminum in response to alleged unfair trade practices and intellectual property theftby the Chinese state. In response, China in April and June of 2018 announced tariffs on American products including a 25 percent tariff on soybeans, a key U.S. export to China. This lead Bloomber to predict that America’s Midwest farmers, a key demographic in Trump’s approaching 2020 election campaign, would suffer as the cost of importing soybeans to the U.S. increases.

Today, China announced a further escalation on the U.S.-China trade war including more tariffs on U.S. grown soybeans. It remains to be seen how these new tariffs will affect U.S. soybean farmers, but with the dataset for the number of soybean acres planted in 2018 just updated, it is possible to examine how the U.S.-China trade war has affected U.S. farmers in 2018.

Source: National Agricultural Statistical Service (Department of Agriculture). (2019). USDA Crop Acreage Data: Acres planted: soybean fields (county) [dataset]. Washington, DC: SAGE Stats by SAGE Publishing. Available from

The graph shows a small decrease in the number of soybean acres planted in the U.S. by over one million acres between 2017 and 2018. However, there was a large increase of over seven million acres in soybeans from 2016 to 2017.

That’s data at the U.S. national level - what about those Midwestern states that Bloomberg predicted would be hit hardest by the Chinese tariffs? How can we see the Midwest follows this same national trend? If we isolate and aggregate the data for the 12 Midwestern states (Illinois, Indiana, Iowa, Kansas, Michigan, Minnesota, Missouri, Nebraska, North Dakota, Ohio, South Dakota and Wisconsin) we get the following graph.

Source: National Agricultural Statistical Service (Department of Agriculture). (2019). USDA Crop Acreage Data: Acres planted: soybean fields (county) [dataset]. Washington, DC: SAGE Stats by SAGE Publishing. Available from

Very similar to the national trend chart, no? Again, we see a decline in the number of acres of soybeans being planted in the Midwest, but not as dramatic a decline as was suggested in the news coverage of this topic. The number of acres planted in the Midwest declines from 72 million in 2017 to 71 million in 2018, but still high far above the 2016 figure of 66 million acres of soybeans planted. This suggests that it’s too early to definitively say the slight decline is due to the Chinese tariff on soybeans and not another factor. Additionally, more data such as the number of U.S. soybeans being sold to China and the profits of U.S. soybean farms are needed in order to further investigate the effect of Chinese tariffs.

When presenting this research we might present the data as shown below in order to best show and understand the relationship between the total number of acres planted in the U.S. as compared to the Midwest.

Source: National Agricultural Statistical Service (Department of Agriculture). (2019). USDA Crop Acreage Data: Acres planted: soybean fields (county) [dataset]. Washington, DC: SAGE Stats by SAGE Publishing. Available from

  March 20, 2019

Surveys vs. Polls

What are polls and surveys?

Both polls and surveys are data collection methods used to reach a large and varied audience. They both have the potential to gather controlled, insightful data to gain a better understanding of a community of people. Both surveys and polls are helpful tools in learning more about a population’s opinions and characteristics, such as with political polls.

What’s the difference?

Surveys allow you to ask multiple questions at one time. The results may take longer to get because there’s more data to process. Also, people may not be comfortable giving out their personal information to a surveyor with whom they are unfamiliar. However, if you are interested in a certain demographic, a survey is ideal because you can ensure that you only get submissions from relevant participants.

Polls are essentially quick surveys with only one question that can be interpreted instantaneously. They are ideal for making small decisions and getting immediate data on public opinion. Polls also prove to be the better option when you have a specific set of options you want people to choose from, rather than having open-ended questions for people to answer. You might recognize an election ballot as a type of poll people encounter quite regularly.

Practical Application

Surveys are implemented in infinite ways. For example in businesses, customer satisfaction surveys are a commonly used as a way to reach out to the customer base and get feedback on products or services in order to inform future improvements or pinpoint issues.

In the academic world, surveys serve as an important tool for undergraduates and researchers who need to collect human data for a project. They can easily be shared and marketed via online channels such as Facebook in order to get a large number of responses.

Polls are conducted on a more ad hoc basis and can be very casual. For instance, Twitter provides a polling option that you can tweet to followers to get their take on a topic. Political polls are another type of poll that constantly make the news, such as the following poll by NBC on President Trump’s job approval:


Before You Go

Consider: what are other popular means of gathering data?

For more in-depth, relevant content, check out this article about forms, surveys, and polls

  March 7, 2019

What do the graphs in the climate change debate mean?

The use of data as evidence of climate change has become an increasingly debated topic in the past decade. The current U.S. political climate has only further politicized the issue, which is likely to escalate as presidential campaigns for the 2020 election begin in 2019. President Donald Trump, who has already declared he will run for re-election, has exacerbated the debate by claiming that global warming was created by the Chinese to make their manufacturing competitive. Since then he has gone on to mock Democrat Sen. Amy Klobuchar of Minnesota for her acceptance of climate change when she announced her candidacy for presidency in a blizzard and suggested that the "polar vortex" in January was evidence against global warming.

If you have been following the debate around climate change, you will most likely have seen this graph from Nasa, which charts the change in global surface temperature relative to the average global temperature. You can see that this graph has a positive correlation and shows a steady increase in temperature anomaly over time. However, what does this graph actually mean? Moreover, how does NASA calculate the temperature anomaly?

The graph illustrates the change in global surface temperature relative to the 1951-1980 average temperatures. Between 1951-1980 NASA calculated the global surface temperature using a network of weather stations to create a long-term global average temperature, for which future annual global temperatures could be compared against, otherwise known as a reference value. NASA then calculated the annual global surface temperature as usual and compared it to the long-term average global temperature or the reference value. This comparison allows NASA to assess by how much global temperature has deviated from the long-term average.

A positive value means that the average global temperature for the year was hotter than the reference value, while a negative anomaly indicates that the observed temperature was cooler than the reference value. As we can see from the chart and data file, the temperature anomaly has remained higher than in previous decades. So far, politicians such as Sen. Amy Klobuchar and others who acknowledge climate change have not found an effective way to use data like NASA’s to make the case for needed legislation.

The black line is known as a LOWESS Smoothing or Locally Weighted Scatterpoint Smoothing, a tool that allows data analysts to create a smooth line through a timeplot or scatter diagram to help the viewer see the relationship between the variables.

Each point on the scatter graph, connected by the grey line, represents how much the global temperature for a given year differed from the reference value.

A useful resource when trying to understand climate data is the National Centers for Environmental Information.

  February 13, 2019

Brainstorming Your Research Topic

Whether it is your first time with a research project or you are a pro, it is normal to face writer’s block. Here is a short guide on picking the best topic for your research project through prewriting:

Consider what you want to expand your knowledge on.

If you are looking at topics you are completely unfamiliar with, you might become discouraged when it comes time to research. Starting with no background information on a topic makes for a much more difficult and time-consuming project. It is best to plan on writing about a topic you have some prior knowledge of and interest in. This will give you the best chance of crafting a solid thesis (topic) for your research, and it will sustain your engagement throughout the project.


There are several ways to tackle brainstorming and organizing thoughts as you delve into your project. This becomes especially necessary for research because of the endless amounts of information you have to go through. To start, try writing three varied topics you are interested in learning more about.

Sample Research Topics:

  • Philosophy: What do children and parents owe each other?
  • Literature: What is the most pervasive narrative in YA fiction, and why is it so popular?
  • Psychology: What are the most effective means of treatment for eating disorders?
  • Public Health: What is the relationship between impact and cost effectiveness in treating the elderly?

As you can see by the sample topics, essentially any topic is up for grabs when in the brainstorming phase of the writing process. However If you find this exercise difficult, try to think specifically about issues that affect your life and take a stand on it. This ensures you stay engaged in your research.

Is your topic researchable?

To ensure you pick a topic that has been previously researched, do a quick Google search of various keywords and phrases related to your topic. Do not be afraid to ask specific questions you have saved for research. This will also allow you to discover what other researchers have shared. Your topic is likely to change or refine itself as you do more research. It is good to refine your topic to be as concise and interesting as possible.


Be certain your topic reflects exactly what you want to say. This is the most essential part of your project. Your thesis will ground the reader and keep your writing focused

  February 6, 2019

Data in the News: The Hawaiian smoking ban

This week saw Democrat Richard Creagan in the Hawaiian House of Representatives propose a law to raise the legal smoking age to 100 in the state of Hawaii. The bill will need to pass through the state legislature and faces potential backlash from the tobacco lobby but marks a continuation of Hawaiian policy to reduce smoking with the state. In 2016 Hawaii was the first state to raise the legal smoking age to 21 and to date is only one of seven states to do so.

Why has Hawaii been so ambitious in its attempts to limit the smoking of tobacco? One possible answer lies in the data. Hawaiian manufacturing has gained zero dollars from tobacco manufacturing from 2013 to 2016. California in comparison has seen a huge growth in value added in its manufacturing industry from tobacco manufacturing over that same period - five million dollars to over 40 million dollars, an eightfold increase in four years.

Value Added in Tobacco Manufacturing (State). (2017). SAGE stats (Web site). Washington, DC: CQ Press. Retrieved from:

Hawaii has also received significantly less money than California in tobacco settlements, meaning that the state has very little to lose and everything to gain by attempting to reduce tobacco consumption, which according to the CDC is responsible for 480,000 deaths per year in the US and 1,400 deaths in Hawaii. Although the tobacco lobby is likely to stand against this bill, the Hawaiian public may shrug its shoulders. According to the CDC in 2016, 61.6 percent of Hawaiian adults have never smoked - one of the highest percentages in the nation.

Estimated Tobacco Settlement Revenues in Fiscal Year (State). (2012). SAGE stats (Web site). Washington, DC: CQ Press. Retrieved from: h

  January 18, 2019

Data in the News: Immigration & the Government Shutdown

On the 22nd of December funding for nine of the fifteen U.S. federal departments ceased as the Democrats refused President Donald Trump’s demands for five billion dollars to fund the Mexico border wall.This shutdown officially became the longest U.S. federal government shutdown in U.S. history as of January 12th, 2019.

The border wall was a centerpiece of Trump’s campaign and a promise to get tough on immigration both illegal and legal.Since the start of Donald Trump’s presidency, he has introduced multiple policies to reduce and deter foreign immigration, including new requirements for the H1-B visa applications and suspended fast track applications for the visauntil at least February 2019. Trump also started and then discontinued a policy of separating illegal immigrants from their families at the border, which former chief of staff Kelly described as “a tough deterrent”. These policies have resulted in a heated political debate, which has culminated in the current government shutdown impasse between Congressional Democrats and Trump.

The difficulty around the immigration debate is that there are many types of immigration: legal vs illegal, immigrants who gain U.S. citizenship vs those who work in the U.S. on a visa and the ethnicity of those seeking entry to the United States. This variety can lead to many contradictory claims about immigration levels in the United States depending on what numbers are reported to the public. The graphs below are intended to get you thinking about immigration and the portrayal of immigration statistics in the media.

The number of newly naturalized U.S. citizens is in decline at 707,265 in 2017, significantly lower than the 2008 high of 1,046,539 and a 50,000 person decrease from 2016. This chart brings to mind multiple questions: Is a reduction from the previous year sufficient to ease checks on legal immigration? Does the current narrative about a fear of immigration overflow reflect what these figures tell us? Should the federal government seek to bring legal immigration to 2008 levels?

Much of the 2018 narrative revolved around fears of immigration from Latin America with news of an “immigrant caravan” heading towards the United States border and made immigration a central debate of the 2018 midterms.

According to releases from the Department of Homeland Security, the number of Latin American-born residents in the United States has increased from 2011 and is clustered in southern states particularly Florida, Texas and California. Yet on average across the nation, there are 15 Latin Americans per U.S. ZIP Code compared to 2,658 U.S. native borns per ZIP Code. What does this say about how ethnicity or country of origin play in the US immigration debate?

  December 3, 2018

Data in the News: Californian Wildfires

Natural disasters in California have dominated the headlines for the past year. In January, mudslides tore through Southern California caused by an earlier forest fire that left the hillsides in Montecito vulnerable to landslides. In November 2018 a wildfire in Northern California, Paradise killed 79 people and burned 151, 272 acres according to CBS news. Meanwhile, the Woolsey fire in Southern California, Ventura country forced the evacuation of SAGE's head office in Thousand Oaks and has taken the lives three people.

In the face of such deadly and prolonged natural disasters, many people are asking what the government is doing to prevent wildfires. President Trump has blamed the fires on poor forest management stating that if California followed the example of Finland, who "spend a lot of time raking and cleaning" the forest floors, the fire would have been less severe.

President Trump’s response to the Californian wildfires has resulted in skepticism but does beg the question what is FEMA doing to stop these fires? The Federal Emergency Management Agency is responsible for providing assistance after natural disasters and investing in infrastructure to ensure the prevention of natural disasters. FEMA's responsibilities are best summarized by its mission station "Helping people before, during, and after disasters". The last time that FEMA gave serious assistance to California was in 2008. From 2009 to 2016 (the last year that data is available) FEMA has granted less than a million dollars for fire disasters to the state of California.

Fema public assistance grant dollars for fires (county). (2017). SAGE stats (Web site). Washington, DC: CQ Press. Retrieved from:

In a press release, FEMA announced that it would be assisting state and local officials responding to the Hill and Woolsey wildfires and that “FEMA is bringing federal resources to bear to assist the state of California”. The Hill fire is now 70% contained and the Woolsey fire 94% contained but many, much like President Trump, will be wondering what can be done to stop future wildfires.

  September 27, 2018

Data in the News: CDC recommends injectable flu shots

As another flu season prepares to sweep the U.S., the question on many parents’ minds is one as old as time: to vaccinate my child or not?

Young children and the elderly are particularly susceptible to influenza and the flu. The CDC reported a record of 180 pediatric flu deaths during the 2017 flu season. In terms of the entire population, final data for the 2015 demonstrates that there were 15.2 flu and pneumonia deaths per 100,000 persons.

According to STAT News, the American Academy of Pediatrics (AAP) is recommending parents vaccinate their children with the flu shot this season rather than the nasal spray vaccine, FluMist. This contrasts slightly with the Center of Disease Control and Prevention’s (CDC) recommendation, which allows for choice among different vaccination methods like FluMist.

Parents are taking this choice seriously despite conflicting advice between the AAP and CDC, which may confuse them and physicians alike. “There’s no question that ideally we would like for the CDC and the AAP to be completely harmonized,” says Dr. Henry Bernstein, a pediatrician and ex officio member of the AAP’s committee on infectious diseases. “Both groups are harmonized in wanting as many children to receive flu vaccine as possible each and every year.” In alignment with the AAP, physicians are urging parents to vaccinate (via injections) their children despite an increasing trend to forgo vaccinations due to fears of unknown or harmful ingredients that many believe can cause worse diseases than the flu.

Despite these concerns, the percentage of children vaccinated over the past seven years has slightly increased at a slow, but stable pace as shown by the chart above provided by the CDC. Even with the common misconceptions regarding preventative care for the flu, and the fear of deadly side effects, this steady trend suggests concerns among parents and the population in general have not impacted vaccination rates so far.

  August 31, 2018

Excel Tips: Identifying and Removing duplicate data points

There may be more than a few data points to double-check as you review and clean a data file. These can include blank values, outlier data points, data label misspellings, and so on. Duplicate data points are probably one of the most difficult to spot unless you’re lucky. Duplicates are exactly what they sound like: exact copies of the same data point. For instance, if I am looking at a data set on the number of hamsters across the United States and I see that Wisconsin has two data points, both of which are 50,000 (totally fabricated!), then I can infer that the data set has mistakenly included two duplicate values for Wisconsin.

So why does this matter? It matters because duplicate data points may inadvertently lead to miscalculation or misunderstanding of the data. The appearance of duplicates does not necessarily mean the entire data set is completely wrong – only that the data set may require a closer eye and some additional clean-up work as do most data sets. Thankfully, Excel offers two handy features that simplify the identification and removal of duplicate data points from a file!

The breakdown

It usually takes finding one set of duplicate data points for me to determine that Conditional Formatting should be applied to identify if any additional duplicates are present in a data file. The Conditional Formatting feature programmatically identifies duplicates in an entire data set. Without this feature I would be forced to manually check each data point. That may not be a big deal for a data set with about 50 rows of data, but it can be an incredibly inefficient process for a data set that contains, say, over 50,000 rows of data.

Using Conditional Formatting:

  1. Select the entire data set. Actually, you don’t have to select the entire data set; you may want to identify duplicate values in a particular column or row. If you want to identify duplicates across the entire data set, then select the entire set.
  2. Navigate to the Home tab and select the Conditional Formatting button.
  3. In the Conditional Formatting menu, select Highlight Cells Rules.
  4. In the menu that pops up, select Duplicate Values.
  5. A window will appear detailing how Excel will highlight the duplicate values it identifies. The default setting is light red highlighting with red font, which works very well.
  6. Voilà. All duplicate values should now be highlighted in red!

After reviewing the highlighted duplicates, you can determine whether all the duplicates should be removed or not. To remove all duplicate values, you can use the Remove Duplicates feature to, well, remove the duplicates!

Using the Remove Duplicates feature:

  1. Select the data set that contains duplicates.
  2. Navigate to the Data tab in the tool bar.
  3. In the Data Tools section of the Data tab, select Remove Duplicates.
  4. One of two windows will appear:
    1. If you selected the entire data set, then an option will appear asking you to specify which columns you wish to delete duplicates from; if you want duplicates removed from the entire data set, then leave all the columns selected.
    2. If you selected a specific column, then a warning will appear to confirm that you want to limit removal to the column selected; if yes, then be sure to select “Continue with current selection”. If you decide to expand it to the entire data set, then choose “Expand the selection”.

All the duplicates should now be removed!

  August 17, 2018

Data in the News: Fast Increase in Virtual School Enrollment

In today’s highly technological society, it’s no surprise that enrollment in virtual schooling is steadily rising among high school students in the US. Parents are turning towards at-home virtual classrooms as the safer, more convenient option as they face the harsh reality of questioning the safety of their children in traditional brick-and-mortar schools in light of recent mass-school shootings and an increase in bullying. Bullying has always been an issue in any school setting, and as this issue becomes more prevalent in the US, more students are reporting forced physical confrontations or being verbally abused by their classmates while on school grounds. This is creating a situation where students are weary about attending school and where parents and guardians are unable to intervene or defend them in such situations.

In Arizona alone, the percentage of students who feel too unsafe to attend school has soared past the national average since 2004.

Line graph comparison of the percentage of Arizona high school students who feel too unsafe to attend school to the median national percentage.

The availability of virtual schooling is also trending upward among parents and students across all levels of schooling, including K-12. Thanks to online school curriculums like Connections Academy, the second largest virtual charter school company in the US, parents have more control over their child’s education. Elearning Inside News reported a 60% graduation rate from Connections Academy and an overall 5,300 student graduates as of June 2018. According to the National Education Policy Center, enrollments in virtual schools in the past few years have increased by 17,000 students between 2015-16 and 2016-17 and enrollments in blended learning schools increased by 80,000 during this same time period. In the last decade overall, Wired has reported on the vast increase of virtual school enrollments which has contributed to a national boom of more than 260,000 full-time students.This changing trend in technology and burgeoning need of educating children at home is reflected in the 29 states that now offer hybrid schools as of 2017 – a combination of virtual and classroom style teaching that allows parents to personalize their child’s education without isolating them from their peers completely. Especially when families spend a significant amount of time traveling or away from home, students are able to log on anywhere (with internet access of course) and complete their schooling in their own time at their own pace.

Another attractive aspect of virtual schooling is giving students the opportunity to focus their time on passion projects such as art, music, and technology. Additionally, it enables students to learn time management early on – an important skill to learn before progressing into the job market or college. For instance, an Arizona high school has adapted their class requirements to allow students to personalize their own schedule. With all these advancements and positive aspects, more traditional schools are integrating online learning into their own curriculum to fulfill the different learning needs of their students. Diane Douglas, an Arizona Public Instruction superintendent, has noted that “what may work best for one student may not work for another.”

For the future, this trend can only continue to rise in popularity. As more students are homeschooled or attending hybrid-online classes, the traditional school setting may soon be a thing of the past. With the advantage of preparing students for the ever-changing job market, technology and online communication skills are a core component of these virtual school curriculums.

  July 31, 2018

Tips on Interpreting Data Visualizations

Previously, I’ve discussed best practices in creating data visualizations and explained how a visual representation of data simplifies the information you want to convey. These are great concepts to keep in mind when creating data visualizations, but what about when you are on the receiving end of a data visualization? Your ability to interpret the visualization may vary depending on the data used, how well created the visualization is, and even your own familiarity with data or data visualizations.

As a reader, your goal is to understand, interpret, and reflect on the information represented in a data visualization and then infer new information based on that assessment. However, this can be difficult to accomplish if you are not familiar with data or statistics. To that end, below are some tips on how to interpret a data visualization including questions and information to consider.

The breakdown: Six tips on reading a data visualization

Data visualizations can take on multiple formats and can represent an infinite number of information types and combinations. Because of this wide variability of possibilities, my suggestions are broad enough to apply to any kind of scenario.

  1. Establish what idea or claim the data visualization is trying to reinforce. Visualizations are not created for the fun of it (some enthusiasts might disagree) and are created with the purpose to use it as evidence. For instance, one visualization might aim to demonstrate that homeless populations are decreasing instead of increasing.
  2. Make explicit observations of the visualization. Quite literally, what do you see? Do you see any highs or lows? Is the map or chart coloring darker in some places than others? Things like that.
  3. What patterns can you discern? Patterns can present themselves as clusters, steady increases/decreases, consistent coloring on parts of a map, and so on. Patterns like these are usually where the takeaway of the data visualization lies.
  4. Consider other factors that may have shaped the data and therefore the visualization. What factors not measured in the data set could have affected how the data is represented? For instance, comparing homeless populations across countries may be affected by different definitions of what constitutes a homeless person.
  5. Reflect and interpret. Based on these patterns and other factors, what is the takeaway of the visualization and how does it support or undermine the claim being made? For example, if a trend line on homeless populations is rising year-to-year, does that support the claim that homelessness is no longer an issue?
  6. Infer further. What other information can you reason based on this interpretation? If homelessness is rising, then I can probably infer that the economy and employment are not doing so well.

Should I follow this thinking every time I come across a data visualization?

I mean, it can’t hurt! Of course, not every data visualization will require a step-by-step thought process like this – some visualizations are self-explanatory and the best visualizations are often the simplest. However, it’s always helpful to have an idea of where to start if you’re not too familiar with data or statistics. Nowadays, data visualizations are everywhere and because of that the ability to thoughtfully interpret them has become a critical skill to learn.

  July 19, 2018

Data in the News: Gaining an Understanding of Gun Violence

The issue of gun control continues to build steam as the media reports more gun violence incidents across the United States. One recent incident is the gunman attack on the Capital Gazette newspaper office in Maryland this past June, which fueled renewed calls and protests for stronger gun control laws. As decisive as the ucrrent environment is about this particular issue, it is also a reminder to keep yourself informed on issues in which you are interested. If you are looking to use this as a topic of academic research, it is especially important to gain a basic understanding by reviewing and comparing the information you find.

You can begin understanding your topic of interest by writing out your current assumptions and reviewing the information to see if it supports those assumptions. One way to check your assumptions is by creating a simple data comparison. For example, let's assume that a majority of murders are committed with firearms than with no weapon at all. You can compare murders committed with firearms to murders committed without a weapon. Since we menioned Maryland earlier, let's focus on that state. The FBI reported that in 2016, 76.3 percent of Maryland murders were committed with firearms and 4.9 percent of murders were committed barehanded (i.e. with a person's hands, fists, or feet). Based on this comparison, we can confirm that a majority of reported murders in Maryland were committed with firearms that without.

Line graph comparison of the percentage of Maryland murders by firearms and murders by hands, fists, and feet

Additionally, seeing this comparison visually can facilitate further research by inviting additional questions. Here are a few questions that arise when you see this data in action: Why are there more murders with firearms than without? How many of these firearm murders were mass shootings? Is this possible to identify? If there were any mass shootings included in these murders, what types of firearms were used? Why did the percentage of murders with firearms drop between 1997 and 1998? What was different about 1997 compared to 1998? Were there any unreported murders in Maryland? How do these percentages for Maryland compare to neighboring states or the U.S. average?

As you can see, a simple comparison has turned into an in-depth research topic that has raised numerous questions that allow for multiple avenues of investigation. When reviewing information gathered from your research, the best thing you can is to ask questions about what you read or observe. This not only adds to your initial knowledge of an issue, but broadens your perspective by encouraging you to consider other factors that perhaps you hadn't thought of before and which may affect or be affected by the issue. Broadening your perspective of what factors or players are in play is imperative to building your understanding of a big issue like gun violence.

  June 29, 2018

The Different Angles of Gerrymandering

The recent news that Justice Anthony Kennedy will retire from the Supreme Court this summer has thrown everyone for a loop. As I write this, politicians, news pundits, and voters are debating the implications of Justice Kennedy’s retirement on decisive topics such as abortion and same-sex marriage as President Trump considers candidates for the court's vacancy. One decisive and unresolved issue that the president’s nominee is likely to influence with his or her vote is partisan gerrymandering and the extent to which it dilutes voting power and therefore impacts U.S. elections.

However, unless you’re a political science major it may be difficult to grasp why federal and state politicians are fighting over congressional gerrymandering. What is gerrymandering and how exactly do changing congressional boundaries affect who we elect? Gerrymandering is the practice of manipulating district boundaries to benefit one group over another. Historically, gerrymandering has been driven by racial and political motivations to control who is in power, and more importantly, who is not in power. Although reading about the history of gerrymandering is informative, visualizing the physical changes in congressional districts is a fantastic way to learn and understand the practice.

The breakdown: North Carolina as an example

North Carolina, for example, is a great case study for understanding the electoral impact of gerrymandering. By updating the selected year for the district map on the left, you can observe that prior to the mid-1990s North Carolina elected a majority of Democratic candidates before turning red in 1994. This change in power was due to Republican gerrymandering that went into effect in 1994. Since that time, North Carolina flipped has been for the most reliably red except for a brief number of years between 2008 and 2010.

Embedded visualization from SAGE U.S. Political Stats product.

Gerrymandering is also well known for creating odd-shaped districts. In fact, the 12th congressional district in North Carolina is usually cited as one of the most complex districts in the country. It has even been the subject of multiple legal challenges alleging racially-motivated gerrymandering. Using the pop-up map to the left, you can see how District 12 boundaries have changed over the past two decades. Observing this political struggle in action is a powerful way to understand how the Democratic and Republican parties have successfully used gerrymandering as a tool to achieve their own political interests.

  June 14, 2018

Data in the News: Cost of Hurricane Damages

Hurricane season has begun and the United States already has one significant storm under its belt, subtropical depression Alberto. Although initially Alberto decreased in magnitude as it approached the U.S. and made landfall, it caused substantial damage across the southeast. Since it takes about a year or two to gather all data about weather destruction, the extent of Alberto’s damages remains unclear. However, news reports indicate the storm caused significant property damage and some loss of life including the deaths of two news reporters in North Carolina, flash flooding, and more. To this day, more data is arriving about additional property damage and missing people. So one may wonder, what will the damages be for this 2018 hurricane season given that a subtropical storm like Alberto has caused not insignificant damage already? What portion of that will the federal government help with?

We can gain an idea of what 2018 costs will look like using historical data from a variety of sources. Alberto swept through a majority of the southeast region, but we can focus on a particular state to narrow this analysis. Let’s look at Florida in particular: According to FEMA data on assistance funding for hurricane damages, Florida counties were collectively issued approximately $4 million in 2016 and $1.6 million in 2017. These two figures give us some context of FEMA’s aid to Florida for hurricane damages and what to expect for the 2018 season. Forecasts predict five to nine hurricanes this year and one to four major hurricanes in 2018.

In this case, more research on 2016-2017 Florida hurricanes is needed to make a real comparison. However, if this year’s hurricane season is similar to the past two years we can estimate that FEMA aid for hurricane damages occurring in Florida will range between one to four million dollars, barring any major hurricane like Hurricane Sandy or Katrina. This of course does not account for insurance payouts, federal funding via other agencies, or other kinds of funding outside of FEMA. Identifying other actors that are not accounted for in the dataset you’re using for estimation purposes is a good to keep in mind and to communicate to avoid overstating your conclusion.

  May 31, 2018

How to Cite Data

If there is anything that school has ingrained in our minds, it’s that we should always always always cite our sources. A detailed citation is important not only to acknowledge how others’ ideas have contributed to your work, but also for readers to see and follow on their own time and for their own purposes. For example, a Public Health student may be interested in tracking down the original information cited in a news article she recently read, but will have an incredibly difficult time doing this if there is no citation provided or if the citation isn’t detailed enough.

Thankfully, we can easily generate accurate and detailed citations with the help of citation managers like EndNote and Zotero. However, some sources can be difficult to cite because of how different they are from traditional text sources like textbooks and journal articles. Citing data for instance can be a tricky business because it often comes in the form of an Excel download or is presented online in a table wizard of some sort. Because of these kinds of differences, you’ve probably found yourself asking several questions: I found this data online so do I cite it as a website? What do I use for the author name if there is no author mentioned?

The breakdown: Elements of a Data Citation

A lot of questions that come up when citing data sets are answered by the International Association for Social Sciences Information Services (IASSIST), which developed a guide to help researchers correctly and comprehensively cite datasets. Below are the fundamental elements you should always include in a citation for data sets.

  • Author: if the creator of the data is an organization, then insert the organization’s name here. E.g. U.S. Census Bureau.
  • Date of Publication: when was the data first published?
  • Title: the name of the dataset. If there is a specific table identification code, then I would include that as well!
  • Publisher/Distributor: if the publisher/distributor is the same as the author, then enter “Author” in place of the name.
  • URL: ideally the more direct the URL the better. Make sure it's a stable URL!

The source I am interested in using provides a preferred citation, but it doesn’t follow the IASSIST template above. Is it best to honor the source’s request or use the IASSIST citation template?

According to Hailey Mooney, Psychology & Sociology Librarian for the University of Michigan Library, “You should honor the spirit of the preferred citation and include all of the relevant components. Verify that the preferred citation is complete and correct. It is likely that you may need to rearrange elements anyhow, in order to put it into a particular citation style format.”

Be sure to include all relevant information whether you use APA, MLA, Chicago, or another style. If in doubt, always include the elements outlined by IASSIST!

  May 9, 2018

Excel Tips: Navigating an Excel table

As simple as it may sound, navigating yourself around an Excel worksheet takes some practice. Scrolling your way through a table of say 10 records is no big deal, but this can be incredibly cumbersome when you are dealing with a dataset that contains hundreds or thousands of records. This is an issue I often came across when collecting and cleaning data for SAGE Stats and while I strongly believe that Excel is best learned by practice than by seeing, I’ll outline the quick shortcuts all Excel users should familiarize themselves with in order to quickly navigate their way around an Excel table.

The breakdown

This will hopefully not come as a shock to most of you, but an Excel worksheet is comprised of columns (represented by letters) and rows (represented by numbers). The last column you’ll see in Excel is column “XFD” and the last row in Excel is row 1,048,576. Imagine having a dataset that occupies a fraction of those limits – yep, it is no fun! Below are the best keyboard shortcuts to navigating up and down an Excel table instead of clicking and scrolling your way into a massive state of frustration.

Keystroke Where does it take you?
Ctrl + End The last cell of a data set
Ctrl + Up or Down Arrow Keys The top or bottom of the data set
Ctrl + Left or Right Arrow Keys The left-most or right-most cell of a data set
Ctrl + Shift + Arrow Keys Selects cells in the same column/row as the active cell. A great shortcut when you want to quickly select and copy data.
Page Up and Down Moves one Excel screen up or down in a worksheet
Alt + Page Up or Page Down Moves one Excel screen to the left or right in a worksheet

I’ve never had an Excel dataset that I couldn’t quickly scroll through on my own.

Suit yourself, but these shortcuts are an excellent way to save yourself the time and effort it even currently takes you to find the information you need. Save yourself some eye strain and practice these shortcuts! Once you’ve gotten into the habit of using them, you’ll wonder how you lived without them. Check out Microsoft's dedicated page for additional keyboard shortcut suggestions!

  April 19, 2018

Data in the News: Teacher Salary Protests

If you've visited CNN, NPR or the New York Times in the past few weeks, you may have heard about the current teacher strikes in certain states demanding higher salaries. Oklahoma, Kentucky, West Virginia, and Arizona are among the key states where teachers are protesting what they believe to be unfairly low salaries compared to their colleagues in other states. When considering teacher salary data, it is interesting to examine how these numbers have changed over time and how they vary by state.

Overall, average public school teacher salaries increased by nearly 60% between 1995 and 2017. However, by using the data set above to calculate this change by state, it’s clear that some states have experienced slower salary growth than others. For instance, teacher salaries in Oklahoma, Arizona, and West Virginia have increased by 38%, 48%, and 43%, respectively, whereas salaries in states such as New York have risen as much as 68% in the same time period.

As with any analysis, it is important to consider external factors that may influence the real-world implications we observe in data. When comparing data such as salaries among states, factors such as regional cost of living and state averages must be included. It’s unlikely that the average cost of living is the same between Oklahoma and Manhattan, for example, which may account for the differences in salary growth. At the same time, data can never tell the entire story, and news stories reporting teachers who work multiple jobs to pay rent illustrate that there is a problem beyond just differences in cost of living.

Therefore, this case illustrates the intersection between a data set, external factors, and real-world implications. While it may be easier to draw conclusions based on numbers alone, it is crucial to contextualize an analysis by considering underlying factors and then examining their impact on society. Working with both hard data and first-hand news articles is a good first step to getting closer to the full story for any data challenge.

  March 31, 2018

Evaluating a Data Source

Previously, I’ve discussed factors you should consider when evaluating a data set that meets your information needs. This included reading through the data documentation, noting any data outliers, and so on. However, like all other kinds of content, numbers can be just as easily manipulated to paint a rosier or different picture than actually exists. For this reason, it is equally important to evaluate the source organization that is responsible for collecting and distributing the data set you’ve found and want to use.

The breakdown

So what are some ways you can evaluate a data source? Like the evaluation of an actual data file, you should go into the evaluation of a data source with a few questions in mind.

  • What survey questions were used to collect this data? These are usually provided by the source and reading through these on your own can help you note any subtle wording that may have influenced the respondent’s answers or unclear wording that many respondents could have interpreted differently.
  • What was the sample size and is it appropriate for the population discussed? A sample size of 50 people for the analysis of a population of 50,000 is not quite reliable.
  • How and when was this data collection carried out? Is the data based on a telephone survey that was conducted five years ago? The application of that data to the present is not a judicious decision.
  • Why did the organization carry out the survey and share their results? This is key to understanding what motivations or incentives the organization may have in disseminating or even suppressing the information.

But the data I found comes from a major organization! It must be fine, right?

Thanks to the internet, we are presented now more than ever with an infinite amount of information from a myriad of sources that all claim authority. However, these claims, the brand name of the organization, or size of the data should not by itself validate its authority. As much as major organizations are perceived as reliable and trustworthy, all organizations have interests in mind that may influence what they included in the survey and how they carried that survey out. With that in mind, it’s always best to evaluate a data source you’ve come across with a healthy degree of skepticism.

  February 28, 2018

Data in the News: Flu Mortality Statistics

With spring upon us, it seems the current flu season may be slowly drawing to a close. Current reports from the Center for Disease Control and Prevention (CDC) indicate that the hospitalization rate for flu diagnoses was 59.9 per 100,000 persons during the first week in February. The U.S. has not experienced a rate this high since the 2014-2015 flu season, which reached 50.9 per 100,000 that same week. [1] In reviewing and discussing these hospitalization rates, it is natural to wonder how these hospitalization rates compare to flu death rates.

Death rate statistics for the current flu season have been widely reported; however, while reading these articles remember that these are estimates. Like most health statistics, final mortality data lags by a year or two and so what we currently see in the news today about the flu season are based on estimates of reported flu deaths. What does this mean? It means that these figures are based on preliminary evidence of cause of death, which may be revised once the CDC receives more complete data. That’s not to say the current CDC statistics are wrong, but that they are estimates until the reporting data is finalized which will not happen until much later in 2018.

So what do the annual flu death rates look like? Based on the chart below, we can observe that the average U.S. flu and pneumonia death rate has gradually decreased between 1998 and 2014. Browsing from year to year in the map view, we can also see that Arkansas and West Virginia in particular have experienced consistently high death rates compared to the U.S. average.

Therefore, while the 2018 season had higher estimated rates of hospitalization and death than in recent years, the overall trends show that deaths are declining.

Overall, when reading about data in the news it’s important to examine the information the same way you might when collecting data to use in a class or other project. By acknowledging when data are estimates or preliminary, and seeking out additional information on overall trends, it will be easier to obtain a complete picture of what story the data is telling us.


[1] Centers for Disease Control and Prevention. (2018, February 9). CDC Update on Widespread Flu Activity. [press release]. Retrieved from

  January 30, 2017

So many data sources, so little time...

It’s the beginning of a new year, which means that hundreds of government agencies and bureaus are releasing 2017 data updates for their numerous data sets. And by “hundreds” I mean so many we actually do not know how many federal agencies exist.

This is why many people’s first instinct is to visit the Census Bureau to gather statistics on all sorts of topics. It’s a centralized resource that provides data sets such as the American Community Survey which cover an array of demographic and socioeconomic topics. However, as you advance in your research or if your information needs require a more specialized focus, you may need to turn to one of those “hundreds” of federal agencies for more detailed statistics.

The breakdown

Sifting through all the federal agencies for a data set that meets your needs can feel a lot like looking for a needle in a haystack. Thankfully, I’ve spent enough time looking for new data and updating our current SAGE Stats data to identify the federal agencies that will help your more focused research get started on the right foot.

Topic Agency Specific resource
Agriculture U.S. Department of Agriculture There are several options that range from food safety to agricultural trade.
Crime Federal Bureau of Investigation (FBI) The Uniform Crime Report is one of the first go-to resources for crime statistics.
Economy Bureau of Economic Analysis (BEA) You know all those GDP figures news outlets report? They get those from the U.S. Economic Accounts resource.
Education The National Center for Education Statistics (NCES) Multiple data tools are available here depending on your interest in academic levels.
Employment Bureau of Labor Statistics Employment data can be sliced several ways and the BLS provides more than several options.
Health Centers for Disease Control Oh, boy where to start. CDC Wonder is a great resource for researchers who want to customize their data download files. For more summarized statistics, check out the Data & Statistics page.
Populations The Census Burea strikes again Although the Bureau has population data up the wazoo, its Population resource focuses solely on population counts.
Transportation Bureau of Transportation Statistics The BTS simplifies your research by providing information via multiple reports and tools.

These agencies provide ready-to-use data files, right?

Aw, bless you. I mentioned in previous posts that data cleaning is a necessary evil to get your data in shape for analysis and visualization – and federal agency data sets are no exception. The resources outlined in the section above provide data and statistics in all sorts of formats and sizes so your work remains to be completed. The clean-up work may be quick or extensive depending on the size of the file and the data’s complexity. For immediate results, check out our Advanced Search on SAGE Stats to find statistics from these same agencies in convenient Excel format!

  December 5, 2017

Understanding Different Census Geography Types

I received an excellent question on my previous blog post about the American Community Survey (ACS): Does the ACS, or Census Bureau more generally, provide statistics by urban area? The answer to that is a big fat YES. There are actually several different geography types that are specifically used to analyze urban areas and their surroundings. For now, I’ll focus on Core Based Statistical Areas (CBSAs) (I also call them metro areas more generally) because they fit a broader definition of what an urban area is.

Core Based Statistical Areas (CBSAs) are comprised of at least one core area with a population of 10,000 or more and surrounding counties that exhibit a high degree of social and economic integration with the core area based on work commutes. CBSAs are great units of analysis if you are studying areas that are influenced by the economic and social activity of one or more cities or urban areas.

Check out how the Washington, D.C. metro area has changed!

For instance, Washington, D.C. is a major employer hub for the surrounding counties in Maryland and Virginia – the proximity of the federal government makes these areas ideal for all kinds of companies, which require many employees, which then require more housing construction, which requires more public roads, which means more car buyers, which means more banking loans, and so on. Soon enough, it becomes difficult to distinguish where the domino effect of the city’s economic influence begins and ends.

The breakdown

Like counties, CBSAs have boundary definitions all of which are outlined by the Office of Management and Budget (OMB) and are updated approximately every decade. The last major definition update occurred in 2013, but the OMB is known to modify a CBSA in-between updates. Currently, there are two types of CBSAs that differ only in the population size of their core areas:

CBSA type Core area population requirement Geographic building blocks
Metropolitan Statistical Area (MSA) At least 50,000 people Counties
Micropolitan Statistical Area At least 10,000 but less than 50,000 people Counties

It’s important to note that a decade is a significant period of time and that CBSA boundary definitions are likely to change during this time. This can get tricky when analyzing one CBSA across more than 10 years because its boundary definitions are likely to have changed. You should keep this in mind when researching and analyzing CBSA statistics. Good data sources will provide the specific definition year for the CBSA’s boundaries used in a data set to avoid confusion.

How exactly do CBSA boundaries change over time?

CBSAs can gain or lose counties and sometimes new CBSAs are born and occasionally they are eliminated. Additionally, their names can change year-to-year based on the relative population of the largest cities. Again, be cautious when comparing CBSAs across any span of time greater than 10 years. For more information, visit the Census Bureau or visit our SAGE Stats Methodology page!

  October 30, 2017

The American Community Survey: U.S. demographic characteristics at your fingertips

The Census Bureau is anyone’s go-to source when it comes to national and local U.S. statistics. It has anything and everything you can think of about the U.S. population. What U.S. county has the greatest number of people claiming Nepali ancestry? What is the average mortgage payment in my zip code? What is the average travel time to work in Wyoming?

You get the idea. The Census Bureau has a lot to offer, but there is one dataset in particular that is likely to provide answers to many of the different socioeconomic questions you have: the American Community Survey (ACS).

The breakdown

The ACS is an annual survey program that collects and provides key indicators about the American public. It covers a multitude of topics such as employment, housing costs, health insurance coverage, and so on. Think of it as the annual decennial census – only instead of collecting basic information like race and sex, the ACS collects more detailed characteristics like average rent paid, educational attainment, and much more. ACS statistics are released in batches beginning typically in the fall following the year of reference. These batches are divided into the 1-year, 1-year supplemental, and 5-year estimates. But what do these mean exactly?

ACS Estimates Definition How to use it
1-year estimates Data collected over a 12 month period, e.g. January 1, 2016-December 31, 2016. Best used when analyzing areas with populations of 65,000 or more and when currency is more important than precision.
1-year supplemental estimates Data collected over a 12 month period. Best used when analyzing areas with populations of 20,000 or more and when smaller geographies are not available in the regular 1-year estimates release
5-year estimates Data collected over a 60 month period, e.g. January 1, 2012-December 31, 2016. Best used when you’re more concerned with precision than currency and when analyzing any size population. These are the best estimates to use when analyzing small population areas.

For a complete breakdown, check out the Census Bureau.

So what's the best way to browse all the ACS statistics?

The best entry point to find the ACS statistics you want is American FactFinder, a warehouse of statistical information from surveys implemented by the Census Bureau (including the ACS). It provides multiple ways to get the information you need – from a simple location search to a mass download option. Be sure to select the American Community Survey as the specific Census program you would like to view. As I mentioned, the ACS collects information on a diverse number of topics compared to other Census surveys, which typically focus on one topic like the American Housing Survey. The ACS therefore gives you more bang for your buck! However, take note: ACS estimates are based on a smaller sample of the U.S. population than the traditional decennial census. Therefore its estimates carry a higher margin of error or are less accurate, but are timelier than the decennial census. For more information, visit the Census Bureau.

  October 10, 2017

Tips on Data Viz

The basics

For most of us, the fun bit of working with data and statistics is the visualization aspect. Many of us are visual learners, or at least understand information best when it’s simplified into a picture. This is especially true when dealing with data and statistics, which hide the meaning and significance of the information it carries behind numbers and variables. No one looks at the data set below and immediately thinks, “Got it, the construction industry’s contribution to U.S. GDP is rising again after the Great Recession. Easy.”

Just looking at a data set is not enough (to the inexperienced eye) to identify patterns in the numbers – we need a representation to more quickly and easily communicate the meaning and significance to our readers. For instance, the construction data in chart form relays the same information, but in a much more understandable way.

Thanks to Excel, almost everyone who has used a computer in school or for business in the past 15 years can create a simple chart like a bar or line graph. (And if you haven’t, the online resources available are infinite.)

However, even if you are a seasoned Excel user or chart builder, below are some practices you should keep in mind as you work on your next visualization.

The breakdown: Top Tips on Creating Data Visualizations

  1. Ask yourself, “What do I want to show in my chart?” Do you want to show a comparison, trend over time, or relationship among data sets? All of these are great options, but one chart type will usually represent the information better than others. Here’s a quick guide compiled by Dr. A. Abela to help you narrow your options to the best choice.
  2. Minimize the number of variables in your chart to avoid confusion. What does the reader need to know to understand the significance of the information? Limit the chart to those items.
  3. Keep the focus on the data, not the visual. It’s tempting to go all out and create a complex visualization using graphic design, but sometimes simplicity is best because you avoid distorting the information or leaving it open to misinterpretation.
  4. Provide context. Don’t assume the reader will immediately understand the information the chart is trying to impart – provide a brief title or subtitle as needed. Be sure to indicate what is being measured (people? U.S dollars? Squirrels?) in the title or in any labels.
  5. Provide a detailed citation. “Census Bureau” is not going to cut it! Big sources like the Census Bureau release thousands of data sets on an annual basis – help the reader who wants to find and use the chart’s underlying data by providing the name of the report or data set and a direct URL.

For more information on how to create the best data visualization, check out Data Visualisation, A Handbook for Data Driven Design by Andy Kirk!

  August 14, 2017

The Data Analysis Process

#3 – Summarizing your data

The basics

So – we have found the data and we have cleaned the data. Great! But, now what do we do with it? The third and final stage of the data analysis process really gets to what you needed to begin with – information and supporting evidence.

Context: Read this CQ Researcher report to learn about energy development and its possible expansion into Native American territories!

As I mentioned in my first post, raw data oftentimes does not make sense at face value or it at least does not provide enough context for a person to understand its significance. This requires the user to “summarize” that micro-information into straightforward intelligence. “Summarizing” data into statistics is much less about creating new information than it is translating and contextualizing the data into meaningful information for everyone.

In the midst of the 2016 campaign debates about climate change, my editor came across statistics on U.S. electrical generation by state and recommended it as a great addition to SAGE Stats. I agreed and found the original data on the Energy Information Administration’s (EIA) website. The EIA regularly releases data on electricity generation by source across the U.S; however, you’ll see below that electricity generation is measured in megawatt hours.

Not many people understand what a “megawatt hour” is – I certainly didn’t, I had to Google it! Is 1,000 megawatt hours a lot? Is it too little? How about one million?

Although I had no idea what a megawatt hour was, I understood it was measuring energy production across the U.S, which is valuable information for assessing which states are moving away from traditional energy sources like coal. But how could I translate megawatt hours into a statistic that everyone could understand? This required me to calculate statistics from the EIA data to neatly “summarize” the information it provided.

The breakdown: Summarizing data, an example.

In this scenario, I was specifically interested in electricity generation by source type and by U.S. state. The EIA provides this information as well as overall total electricity generation values. When facing raw values such as these, ask yourself, “What can I compare these values to in order to better understand their significance?” This is a great question to ask because your audience will understand information much more easily if it’s compared to other information.[1] Here a number of statistics you can calculate to answer this question:

  • Totals: summing values to get a big picture perspective is often handy.
  • Percent of totals: excellent for comparing segmented data against overall totals.
  • Amount change: a good option to compare how much values have changed.
  • Percent change: a good way to compare the size by which values have changed.
  • Averages: these include mean and median averages.

Based on the EIA’s data, I decided that comparing electricity generated by source type to the overall total electricity generated was much more meaningful. So I calculated percentages for each source type against the total number of megawatt hours. That way I could gauge how much of each state’s total electricity was generated by coal, wind, natural gas, and so on. The results were much easier to understand and particularly enlightening!

Embedded visualization from SAGE Stats product.

Once the data was mapped out, I saw that a large percentage of the Midwest’s electrical generation was due to wind energy – an interesting result considering that neighboring states have strongly adhered to coal and oil.

So I can just throw any raw data values together, right?

Yeah, that’s a big N-O. Use best judgement when you calculate statistics. Any of the statistics in the section above should be calculated with values of the same unit of measure. So don’t go adding dollar values and percentages together because that makes no sense. Likewise, be careful of any missing data values or incorrect data values that can throw your calculations off (although your analysis as reviewed in my first post of this series should help you become aware of those!). If in doubt, ask for help from a trusted resource such as your instructor, librarian, or colleague.


[1] Herzog, David. Data Literacy. Thousand Oaks: SAGE Publishing, 2016. Print.

  July 28, 2017

The Three Stages of Data Analysis

#2 – Cleaning your data

The basics

The term “data cleaning,” the second stage of the data analysis process, is usually met with some confusion. I mentioned to a friend that the most recent SAGE Stats data update required a lot of cleaning, which was taking up a significant amount of time. She asked, “So what exactly is data cleaning?” An excellent question!

Data cleaning or “scrubbing” consists of taking disorganized, messy data and transforming it into a format that enables easier analysis and visualizations. Depending on your formatting or metadata requirements and how big the data file is, it can take days to clean a file into submission.

Since I began working on SAGE Stats, I’ve learned many Excel tricks that can be applied to any kind of data cleaning situation. To avoid information overload, I’ll stick to the tricks I’ve successfully used in the past two years.

The breakdown

Top 10 Tips on Cleaning Your Data

  1. Read the data documentation. This will tell you what each component of the data file represents and help you identify what data is most relevant to your research interests and what data you can avoid.
  2. Excel’s “Text-to-Columns” feature. Especially large data files are often stored in “csv” or “comma separated value” formats and can be imported into Excel using this handy feature.
  3. VLOOKUP formula. My holy grail of Excel formulas. Do you want to pull multiple values from a workbook into another workbook? VLOOKUP has your back.
  4. COUNTIF formula. Are you looking for duplicate values in a range or checking whether values in one workbook are present in another workbook? COUNTIF counts the number of times a value occurs in a range!
  5. LEFT and RIGHT formulas. These are very useful when you need to parse out specific characters from the beginning or end of a value. For instance if “092017” represents September 2017, but I only need the year, then I can use the RIGHT formula to collect the last four digits.
  6. TRIM formula. Frustrated by inexplicable extra spaces that follow the value you want? This formula “trims” those out for you.
  7. CONCATENATE formula = “&”. Concatenate is a fancy word for linking two values together – you can use the formula for this or insert an ampersand between the two cell references, e.g. =A1&B1.
  8. I don’t think Excel’s filters get enough credit. Are you looking for multiple misspellings of New York? The filters help you quickly identify and correct them.
  9. Nest your formulas. Find ways to combine formulas to reduce the number of steps you have to complete! For instance, do you need to look up values in Workbook 1 that are associated to a value’s last five characters in Workbook 2? Nest the RIGHT and VLOOKUP formulas to quickly get your answer.
  10. Work off a copy of the original data file. You don’t want to be in a situation where you have mistakenly deleted data values and then have to download the data file again. Keep the original version handy as a backup.

This is a lot of work. Why do I need to clean the data file at all?

Sometimes a data set is so simple that it requires no cleaning at all; however, that’s not usually the case. These days you will typically encounter a file with all data merged into one column, which you then have to unmerge or parse out by yourself. Then you find that you need to concatenate some values back together. And then you realize that some values occur multiple times and you want to find out how many times each one occurs in the file. All this when you only want a snippet of that information! Data cleaning is a necessary evil at times in order to get your data in shape for easier visualizations and more accurate information.

The best way to learn these tricks (and even more advanced tricks) is to dive in head first and try them out with a specific data set. In Excel’s case, doing is better than reading or listening. After all, no one starts out as an expert, and I am no exception! My tips above are suggestions and may not work with your specific needs, but they can be applied in almost every kind of data situation. If you use them often enough, then they practically become muscle memory.

  June 7, 2017

The Three Stages of Data Analysis

#1 – Evaluating raw data

The basics

Starting to analyze your data? Head to SAGE Research Method's Which Stats Test for more guidance!

A friend I haven’t seen in a while asked me what I do for a living, and I talked about SAGE Stats and the work that goes into maintaining and building the collection. Instead of his eyes glazing over (like most people’s would) he asked me, “Ok. Not to seem like an idiot, but what is data analysis? Like what does it cover?” If you’ve had similar thoughts, never fear! I think I can safely say I’ve received multiple variations of this question before. My typical answer: what doesn’t it cover?

Data analysis covers everything from reading the source methodology behind a data collection to creating a data visualization of the statistic you have extracted. All the steps in-between include deciphering variable descriptions, performing data quality checks, correcting spelling irregularities, reformatting the file layout to fit your needs, figuring out which statistic is best to describe the data, and figuring out the best formulas and methods to calculate the statistic you want. Phew. Still with me?

These steps and many others fall into three stages of the data analysis process: evaluate, clean, and summarize.

Let’s take some time with Stage 1: Evaluate. We’ll get into Stages 2 and 3 in upcoming posts. Ready? Here we go…

The breakdown: Evaluate

Evaluating a data file is kind of like an episode of House Hunters: you need to explore a data file for structural or other flaws that would be a deal breaker for you. How old is this house? Is the construction structurally sound? Is there a blue print that I can look at?

Similarly, when evaluating a raw data file you have collected, you should consider the following questions and tips:

  • Read through the data dictionary, codebook, or record layout, which should detail what each field represents. Try not to immediately start playing with the data until you know what you’re looking at. You wouldn’t start renovation in your new house without reading the blue prints, right? You gotta know if that wall is load-bearing!
  • What irregularities does the methodology documentation detail and how may it have affected the data? What are the methodology notes that I should make transparent to the reader?
  • Is the raw data complete? That is, are there missing values for any records? (Missing values in the raw data can distort your calculations.)
  • What outliers exist in the data set? Do they make sense in the context of the data? For instance, a house price of $1.8 million in a neighborhood where houses don’t exceed $200K is probably a red flag.
  • Spot check the raw data. If the data set provides totals, then sum the values and check that they match. If they don’t, then does the documentation explain why they may not add up to the totals?

When spot checking, it’s good to check a data point that you may be familiar with. E.g. for geographic data, checking the data for your home state and other states that you are more familiar with will enable you to spot something weird and off faster than if you check something random.

The Washington Post has compiled incident-level data on police shootings since 2015 with the help of crowdsourcing. This is an impressive feat, but as I evaluated the raw data they provide, I walked away with several questions:
  • Are missing values due to underreporting by police?
  • What are the original sources for each incident?
  • Do they distinguish between neighborhoods in cities or just use major cities?
Together, these questions helped me decide that the Post's data was not suitable for use in SAGE Stats quite yet.

So if the source is good, then the data must be good too. Right?

It’s a mistake to assume the data is authoritative or fine as is just because it’s a published government source or another source you consider just as reliable. Data reporting is susceptible to manipulation and simple mistakes despite the best efforts and intentions of the responsible organizations. Assume nothing and evaluate the data to ensure it checks out! The next stage of data analysis is how to clean raw data to fit your needs. Stay tuned for my next post, where I will review the most effective Excel tips and tricks I’ve learned to help you in your own work!

  May 1, 2017

Data and Statistics 101

The fundamental difference between data and statistics (because who knew!)

The basics

If you haven't seen David McCandless' TED Talk presentation you need to!

Before I started working on SAGE Stats, the idea of working with a large data set was quite intimidating. Shout out to the USDA’s Food Access Research Atlas! In the two years since, working regularly with our platform has really opened my eyes to how empowering and beautiful data is once you understand how to pull usable information from it.

My experience has also taught me how overwhelming and confusing data can be. What is a data set and how is it different than a time series? How can I tell if data content is reliable or not? What the heck is a data dictionary and why do I need it? Unless you are consistently elbows deep in data, it can be difficult knowing where to even start. So let’s begin with the very basics: what is the difference between data and statistics?

The two terms are often used interchangeably – even within the same breath. I have even caught myself using both terms in explaining SAGE Stats to team members and close friends without a second thought. Although it is easy to synonymize the two, they are in fact very different.

The breakdown

Data are collected and organized information typically provided in massive files with detailed records and a data dictionary to decode the variable information. The records in those data files do not communicate significant meaning to the naked eye, so time and analysis are needed to read through the data collection methodology, decipher variable information, and determine which variables are of interest to you.

You'll recognize data as those ugly massive files that instantly cause your hard drive to whine when you try to open them or cause the much feared wheel-of-death to appear.

Embedded visualization from SAGE Stats product.

Statistics are clear and understandable explanations or summaries of data based on analysis. Statistics are generally available in tables and represented graphically. For example, the median state unemployment rate in the U.S. was 4.0% in 2016. This is a statistic derived from analysis of sample data collected by the U.S. federal government.

The best way to think about it is that the statistic is the big picture, which is created by individual pixels, the data. (Insert Monet joke and Clueless reference here.)

So statistics are better than data, right?

Not necessarily. Whether you need data or statistics really depends on your research question. Data is needed when your research question addresses a new issue that hasn’t been explained or thoroughly explored yet – this requires a deep dive into data where you must analyze and derive meaningful knowledge that can answer your question.

A more straightforward research question, however, can be more quickly answered with statistics because the question has been asked before and so the analysis to answer that question has also already been done. For instance, a student who needs information on unemployment across the Rust Belt states can easily find an answer because that information is frequently processed by the federal government for its own assessment of the economic climate.

The difference between data and statistics lies in the analysis. Data needs to be analyzed to be understood, but a statistic can be understood right away. The next question is: how do I begin to analyze data to get the statistics I need? Stay tuned for my next blog post for tips on just that!

Diana Aleman is an associate editor on SAGE Stats and U.S. Political Stats, which simplify the statistical research process by providing ready-to-use statistics on the social sciences to students and faculty. She enjoys metadata challenges and wrangling raw data files into workable formats.

Raphael Jackson is as Assistant Editor working on SAGE Stats and SAGE Business Cases, online repositories designed to teach students about business and management as well as data analysis. Raphael cleans and harvests raw data files to be published on SAGE Stats and to embed datasets in teaching cases to allow students to practice making data driven decisions.