Monday, February 4, 2008

Drop unused factor levels

When creating a subset of a dataframe, I often exclude rows based on the level of a factor. However, the "levels" of the factor remain intact. This is the intended behavior of R, but it can cause problems in some cases. I finally discovered how to clean up levels in this post to R-Help. Here is an example:
> a <- factor(letters)
> a
[1] a b c d e f g h i j k l m n o p q r s t u v w x y z
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

## Now, even though b only includes five letters,
## all 26 are listed in the levels
> b <- a[1:5]
> b
[1] a b c d e
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

## This behavior can be changed using the following syntax:
> b <- a[1:5,drop = TRUE]
> b
[1] a b c d e
Levels: a b c d e

Another way to deal with this is to use the dropUnusedLevels() command in the Hmisc library. The only issue here is that behavior is changed globally which may have undesired consequences (see the post listed above).

****UPDATE****
As Jeff Hollister mentions in the comments, there is another way to do this:

a<-factor(letters)
b<-factor(a[1:5])


Yet another way, if you are working with data frames that by default convert characters into factors, was suggested on r-sig-ecology by Hadley Wickham:

options(stringsAsFactors = FALSE)
a <-data.frame("alpha"=letters)
b<-a[1:5]

4 comments:

Jeff Hollister said...

James,
Alternatively you can just reassign as a factor. For instance,

a<-factor(letters)
b<-factor(a[1:5])

also does the trick.

Cheers,
Jeff Hollister

Larus said...

drop.levels <- function(dat){
dat[] <- lapply(dat, function(x) x[,drop=TRUE])
return(dat)
}

Sylvia said...

Hey everyone,

I tried to drop unused levels as described in the post, yet as soon as I check again, the levels reappear. I tried also other ways (droplevels(results)), but all of them were only temporary.

This is my code:

> results$condition
[1] 1 1 2 2 3 3
Levels: 1 2 3 Block Duration Gleich: Groesser: Kleiner:
> results$condition[,drop=T]
[1] 1 1 2 2 3 3
Levels: 1 2 3
> results$condition
[1] 1 1 2 2 3 3
Levels: 1 2 3 Block Duration Gleich: Groesser: Kleiner:

Any ideas why this happens?

Thanks a lot,

Sylvia Kreutzer

Forester said...

Hi Sylvia,

You did not save your changes. Try:

results=data.frame(condition=c(1,2,3,"Block", "Duration", "otherfactors"))

results$condition=results$condition[results$condition %in% c(1,2,3),drop=TRUE]


Best,

James