Ryan Estrellado

I work in public education and love data science.

Turning Dataset Codes to Words With R

Note: I include a lot of code in this post so my fellow data scientists can either learn from it or give me feedback about how to make it better. It’s totally ok to skip over all that and just check out the data. The data immediately follow the code and have ## marks in them.

Introduction: Labels Are Nicer Than Codes

I do a lot of work with the same public education datasets, like the CASEMIS dataset and the California School Dashboard datasets. These both have columns that use codes instead of labels. This means that you will never see the words “autism” in a CASEMIS dataset. Instead, you’ll see 320 among a sea of other disability codes. I’m a big advocate of sharing as much information and insight as we can about our school data, but that’s hard if your audience has to refer to their technical assistance guide to figure out what a code means. Consider the following plot, which takes a count of disabilities from a fake dataset:

I can just feel the insight slipping away as your audience flips through their manuals to figure out which disabilities 210 and 320 are!

So what can we do about this? In the past, I’ve tried one of three solutions in my spreadsheets:

  • Copy and paste a disability label over each disability code
  • Find and replace each disability code with a disability label
  • Just leave the disability codes as they are and deal with it

These solutions are ok, but they introduce a lot of human error and tend to be really slow, particularly if you’re working with datasets that have thousands and thousands of records. Once I started complimenting my spreadsheet work with programming in R, I learned how to swap those codes for labels pretty quickly. Here’s how that works!

1. Use a Named Vector to Transform Codes to Words

First we’ll make a pretend dataset of students, primary disability, and secondary disability. It’s the same pretend dataset we used to make the plot above. We’ll make a thousand records, since it’s a little easier to see the benefit of this technique over a larger dataset.

library(tidyverse)
# Names for disability vector
dis_codes <- c(seq(200, 280, by = 10), 281, seq(290, 330, by = 10))

# Make dataset
set.seed(200)

df <- tibble(
  disability_1 = as.character(sample(dis_codes, 1000, replace = TRUE)), 
  disability_2 = as.character(sample(dis_codes, 1000, replace = TRUE)))
df
## # A tibble: 1,000 x 2
##    disability_1 disability_2
##    <chr>        <chr>       
##  1 280          250         
##  2 280          260         
##  3 280          200         
##  4 290          270         
##  5 290          230         
##  6 310          260         
##  7 290          330         
##  8 210          281         
##  9 270          200         
## 10 230          240         
## # ... with 990 more rows

Next we’ll make a vector of our disability names. These are the names we’ll be replacing the codes with later. These labels are the exact ones you’ll see in the California Education Code and will correspond with the disability codes described in the CASEMIS technical assistance guide.

# Elements of disability label vector
dis_desc <- c(
  "none", "intellectual disability", "hard of hearing", "deafness", 
  "speech and language impairment", "visual impairment", 
  "emotional disturbance", "orthopedic impairment","other health impairment", 
  "established medical disability", "specific learning disability", 
  "deaf-blindness", "multiple disabilities", "autism", "traumatic brain injury"
)

Then we’ll use set_names from purrr to name each of the disability labels with their corresponding code. Now we have a named vector that looks like this:

# Create named vector
(dis_lookup <- set_names(dis_desc, nm = dis_codes))
##                              200                              210 
##                           "none"        "intellectual disability" 
##                              220                              230 
##                "hard of hearing"                       "deafness" 
##                              240                              250 
## "speech and language impairment"              "visual impairment" 
##                              260                              270 
##          "emotional disturbance"          "orthopedic impairment" 
##                              280                              281 
##        "other health impairment" "established medical disability" 
##                              290                              300 
##   "specific learning disability"                 "deaf-blindness" 
##                              310                              320 
##          "multiple disabilities"                         "autism" 
##                              330 
##         "traumatic brain injury"

This is the point where things start to feel like magic! We’ll take our new named vector and use mutate to transform the codes in disability_1 into actual disability labels.

df %>% 
  mutate(disability_1 = dis_lookup[disability_1])
## # A tibble: 1,000 x 2
##    disability_1                 disability_2
##    <chr>                        <chr>       
##  1 other health impairment      250         
##  2 other health impairment      260         
##  3 other health impairment      200         
##  4 specific learning disability 270         
##  5 specific learning disability 230         
##  6 multiple disabilities        260         
##  7 specific learning disability 330         
##  8 intellectual disability      281         
##  9 orthopedic impairment        200         
## 10 deafness                     240         
## # ... with 990 more rows

Now we can actually look at the disability_1 column and understand what it means. But what if you have more than one disability column that needs to be transformed?

2. Use a Function to Convert More Than One Column of Codes

The CASEMIS dataset, like other public education datasets, has more than one column that uses codes instead of labels. For example, there are two disability code columns: DISABILIT1 and DISABILIT2. One solution is to change each column separately, but let’s push for a more efficient solution. We can take our named vector and build a function out of it that takes a vector of disability codes and converts it into a column of disability labels.

convert_dis <- function(x) {
  # Names for disability vector
  dis_codes <- c(seq(200, 280, by = 10), 281, seq(290, 330, by = 10))

  # Elements of disability vector
  dis_desc <- c(
    "none", "intellectual disability", "hard of hearing", "deafness", 
    "speech and language impairment", "visual impairment", 
    "emotional disturbance", "orthopedic impairment","other health impairment", 
    "established medical disability", "specific learning disability", 
    "deaf-blindness", "multiple disabilities", "autism", "traumatic brain injury"
  )

  # Create named vector
  dis_lookup <- set_names(dis_desc, nm = dis_codes)

  # Create a new vector of disability descriptions
  as.vector(dis_lookup[as.character(x)])
}

Now let’s use mutate_at to call that function on both of the disability columns.

df %>% 
  mutate_at(vars(disability_1:disability_2), funs(convert_dis))
## # A tibble: 1,000 x 2
##    disability_1                 disability_2                  
##    <chr>                        <chr>                         
##  1 other health impairment      visual impairment             
##  2 other health impairment      emotional disturbance         
##  3 other health impairment      none                          
##  4 specific learning disability orthopedic impairment         
##  5 specific learning disability deafness                      
##  6 multiple disabilities        emotional disturbance         
##  7 specific learning disability traumatic brain injury        
##  8 intellectual disability      established medical disability
##  9 orthopedic impairment        none                          
## 10 deafness                     speech and language impairment
## # ... with 990 more rows

3. Collect Them All in A Package

I had so much fun making that work that I started taking notice anytime there were chances to create functions that convert codes to labels. As soon as I had a second function, I built an R package called spedtools to house them so I could just load the package at the top of any CASEMIS analysis I do. Now I don’t need to rebuild the functions over and over again and I can get my data transformed for analysis with some compact and readable code. Here’s what that looks like:

library(spedtools) 

df %>% 
  mutate_at(vars(disability_1:disability_2), funs(convert_dis))
## # A tibble: 1,000 x 2
##    disability_1                 disability_2                  
##    <chr>                        <chr>                         
##  1 other health impairment      visual impairment             
##  2 other health impairment      emotional disturbance         
##  3 other health impairment      none                          
##  4 specific learning disability orthopedic impairment         
##  5 specific learning disability deafness                      
##  6 multiple disabilities        emotional disturbance         
##  7 specific learning disability traumatic brain injury        
##  8 intellectual disability      established medical disability
##  9 orthopedic impairment        none                          
## 10 deafness                     speech and language impairment
## # ... with 990 more rows
Build an R package around all your renaming functions and load it up before every analysis

Build an R package around all your renaming functions and load it up before every analysis

Conclusion

Using programming as a compliment to my work in spreadsheets changed my mindset on tasks I find myself doing over and over again. If there’s something I know I’ll be doing many times, like converting disability codes to disability labels, I prefer to do a little more work up front if it helps me move faster and more accurately in the long run. That way I’m spending more time sharing useful information to data consumers and less time cleaning data.

PSA: Consider submitting your R related blog posts to rweekly.org