Why you should convert categorical variables into multiple binary variables

Take the example of a variable reporting if someone is judged to be very poor, poor, moderately rich, or rich. This could be the outcome of a participatory wealth ranking (PWR) exercise like that used by Village Enterprise.

In a PWR exercise, local community leaders can identify households that are most vulnerable. These rankings can then be used to target a development program (like VE’s graduation-out-of-poverty program that combines cash transfers with business training) to the community members that are most in need.

Let’s say that you want to include the PWR results in a regression analysis as a covariate. You have a dataset of all the relevant variables for each household, including a variable that records whether the household was ranked in the PWR exercise as very poor, poor, moderately rich, or rich.

You need to convert this string variable (text) into a numeric value. You could assign each option a value from 1 to 4, with 1 being “very poor” and 4 meaning “rich” … but you shouldn’t use this directly in your regression.

If you have a variable that moves from 1 to 2 to 3 to 4, you’re implying that there is a linear pattern between each of those values. You’re saying that the effect on your outcome variable of going from being very poor (1) to poor (2) is the same as the effect of going from poor (2) to moderately rich (3). But you don’t know what the real relationship is between the different PWR levels, since the data isn’t that granular. You can’t make the linear assumption.

So instead, you should use four different binary variables in your regression: Ranked “very poor” or not? “Poor” or not? “Moderately rich” or not? “Rich” or not?

This Stata support page does a great job of summarizing how to apply this in your regression code or create binary variables from categorical using easy shortcuts. I like:

reg y x i.pwr

But how do you interpret the results?

When you create dummies (binary variables) out of a categorical variable, you use one of the group dummies as the reference group and don’t actually include it in the regression.

By default, the reference group is usually the smallest/lowest group. In this case, that means “very poor.” So in the regression, you’ll have three dummies, not four. Being “very poor” is the base condition against which to compare the other rankings.

Let’s say there is a statistically significant, positive coefficient on the “moderately rich” dummy in your regression results. That means that, compared to the base condition of being very poor, being moderately rich has a positive effect on your outcome variable.

Published by

Hannah Blackburn

Hannah Blackburn is a Research Associate at UCSD with JPAL's Payments and Governance Research Group, under Professors Paul Niehaus and Karthik Muralidharan.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s