Skip to content

Conversation

@DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Nov 4, 2025

@topepo

This comment was marked as resolved.

@DavisVaughan

This comment was marked as resolved.

Copy link
Member

@jennybc jennybc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the proposal! Made a few comments as I reacted to a first reading.

@t-kalinowski

This comment was marked as resolved.

@DavisVaughan

This comment was marked as resolved.

@shikokuchuo

This comment was marked as resolved.

@DavisVaughan

This comment was marked as resolved.

lionel-

This comment was marked as resolved.

EmilHvitfeldt

This comment was marked as resolved.

@DavisVaughan DavisVaughan changed the title Tidyup 8 - Retaining and excluding rows Tidyup 8 - Expanding the filter() family Nov 6, 2025
@DavisVaughan

This comment was marked as resolved.

@DavisVaughan DavisVaughan marked this pull request as ready for review November 6, 2025 15:26
@wurli
Copy link

wurli commented Nov 7, 2025

- `exclude()`, as noted above, which would have been paired with

FWIW, as a user I would much prefer the name exclude() to filter_out(). IMO, to the uninitiated it would not be clear which of filter()/filter_out() retains and which excludes, but I think if filter() is paired with exclude() then the purpose of both becomes clearer. I also like retain() as an alias for filter(), but on balance I agree it's probably best not to add an alias since filter() is so well established.

Love the idea for this API btw!

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Nov 7, 2025

@wurli most of us felt that filter_out() was very clear that it's removing the indicated rows, which helps you intuit that filter() must keep them.

We also really appreciated how it feels like a "variant" of filter() rather than another core verb. On the home page of dplyr we'd still just list filter(), that's the core verb. It's only when you come to filter()s help docs that you'd also learn about filter_out() (or your teacher would tell you about it). Similar to slice() being the core verb and slice_*() being the variants. I think there is something pretty powerful to this idea, and it also helps with autocompletions, i.e. filt<tab> brings up both, which is quite nice.

With exclude(), I'd feel the need to say "filter() / exclude() to keep or drop cases based on their values" on the dplyr home page and that felt like a net negative in comparison https://dplyr.tidyverse.org/#overview

@jrosell
Copy link

jrosell commented Nov 7, 2025

With filter_out, would someone wonder if it exists filter_in as a filter alias?

@joeycouse
Copy link

Love this implementation, I do think the filter() and filter_out() is slightly unclear but I don't see this as a major hurdle in practice. IMO keeping filter() api should be a priority over slightly clearer language and introducing new core verbs and leaving filter() stranded.

@davidhodge931
Copy link

Awesome proposal!

My 2 cents - I think filter/filter_out is slightly unclear naming. I think filter_keep/filter_drop would be better with filter deprecated

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Nov 9, 2025

@davidhodge931 as stated in the tidyup at https://github.com/tidyverse/tidyups/blob/feature/008/008-dplyr-filter-family.md#alternate-names-for-filter, we are not considering renaming filter(), so we are working within the constraints of that. Renaming filter() is likely just too disruptive to the whole community to be worth it.

@krlmlr
Copy link
Member

krlmlr commented Nov 10, 2025

Love the idea!

How do you teach that filter_out(x, y) is actually filter_out(x & y) and not filter_out(x | y) ? I'd be confused about half the time. Would it be safer to allow just one predicate in filter_out() ? Haven't followed the entire discussion, please disregard if redundant.

# Sequence with filter()
. |>
  filter(x) |>
  filter(y)

# Same as conjunction
. |>
  filter(x, y)

# Sequence with filter_out()
. |>
  filter_out(x) |>
  filter_out(y)

# Same as alternation (!?!)
. |>
  filter_out(x | y)

@DavisVaughan
Copy link
Member Author

I think the best way to teach this is probably something like:

  • With filter(), target rows to keep
  • With filter_out(), target rows to drop
  • Both combine with & (consistent for both)
  • If you want |, use when_any() (consistent for both)
# Combining with `&`
df |> filter(x, y)
df |> filter_out(x, y)

# Combining with `|`
df |> filter(when_any(x, y))
df |> filter_out(when_any(x, y))

I think the fact that df |> filter_out(x | y) is equivalent to df |> filter_out(x) |> filter_out(y), and df |> filter(x & y) is equivalent to df |> filter(x) |> filter(y) is theoretically pleasing, but isn't something I would harp on while teaching. Instead I'd focus on when_any(), which is used the same way no matter which verb you use.

@krlmlr
Copy link
Member

krlmlr commented Nov 20, 2025

To me, the antisymmetry is not only theoretically pleasing. I'm reading . |> filter_out(x, y) like:

  • I'm taking the input
  • I'm filtering out the entries that match x
  • Then, I'm filtering out the entries that match y

I'd never read it like:

  • I'm taking the input
  • I'm filtering out the entries that match x and also match y

Even stronger with

. |>
  filter_out(
    x,
    y
  )

To me, the , translates to a "then" much better than to an "and". Is it only me? I don't know, but I'd like us to think a bit longer about the ambiguity here and the options that we have. The option most appealing to me is to implement an initial draft that accepts only one argument; there's much less ambiguity in filter_out(when_any(...)) and filter(when_all()) . Then we can play with it and decide if and how we extend to multiple arguments.

@joeycouse
Copy link

Completely agree with @krlmlr here.

df |> filter(x) |> filter(y) 

 df |> filter(x,y)

I think this is a critical function of the api that makes learning the syntax much easier, especially for beginners. I would expect filter_out() to have the same behavior.

@DavisVaughan
Copy link
Member Author

There are two competing worldviews at play here.

  • filter() and filter_out() as complements of one another.

  • filter(df, x, y) and filter_out(df, x, y) as equivalent to df |> filter(x) |> filter(y) and df |> filter_out(x) |> filter_out(y).

Both of these have their pros and cons. My theory is that the first of these is the most practically useful for dplyr users and is the easiest to learn.

As complements

If both filter() and filter_out() combine using &, then you get the following result table:

df |> filter(x, y)
df |> filter(x & y)
df |> filter(when_all(x, y))

df |> filter_out(x, y)
df |> filter_out(x & y)
df |> filter_out(when_all(x, y))

# ---

df |> filter_out(x | y)
df |> filter_out(when_any(x, y))

df |> filter(x | y)
df |> filter(when_any(x, y))

Notice how everything above the line related to & works the exact same regardless of whether it is filter() or filter_out(). Similarly, everything below the line works the same with |.

I'd argue that an extremely important property of this table is that you only have to learn 1 rule - that , separated conditions are combined with &. This exactly matches what people have been doing with filter() since day 1 of dplyr. There are no mental gymnastics required when swapping between filter() and filter_out() if you remember this 1 rule you've been using the whole time.

As a nice side effect this means you only need to worry about when_any() - if you find yourself using | in either filter() or filter_out(), you can immediately switch to when_any(), no extra thought required. filter() and filter_out() users should never need when_all() because conditions combine with & already, and that's perfectly fine, one less thing to learn, and when_all() is still useful on its own in other contexts.

This all means that if you are translating from a filter() to a filter_out() to simplify your conditions, then doing so is very easy by design. For example:

Filter out rows where the patient is deceased and the year of death was before 2012.

patients <- tibble::tibble(
  name = c("Anne", "Mark", "Sarah", "Davis", "Max", "Derek", "Tina"),
  deceased = c(FALSE, TRUE, NA, TRUE, NA, FALSE, TRUE),
  date = c(2005, 2010, NA, 2020, 2010, NA, NA)
)

patients
# A tibble: 7 × 3
  name  deceased  date
  <chr> <lgl>    <dbl>
1 Anne  FALSE     2005
2 Mark  TRUE      2010
3 Sarah NA          NA
4 Davis TRUE      2020
5 Max   NA        2010
6 Derek FALSE       NA
7 Tina  TRUE        NA

With years of filter() muscle memory built up, you might start with this:

patients |>
  filter(!(deceased & date < 2012))
# A tibble: 3 × 3
  name  deceased  date
  <chr> <lgl>    <dbl>
1 Anne  FALSE     2005
2 Davis TRUE      2020
3 Derek FALSE       NA

But immediately get frustrated when it drops your NAs, then you remember filter_out()! It is intentionally designed so that you can very easily drop the ! and () to translate to:

patients |>
  filter_out(deceased & date < 2012)
# A tibble: 6 × 3
  name  deceased  date
  <chr> <lgl>    <dbl>
1 Anne  FALSE     2005
2 Sarah NA          NA
3 Davis TRUE      2020
4 Max   NA        2010
5 Derek FALSE       NA
6 Tina  TRUE        NA

And boom that works as expected.

And since there is only 1 rule that applies for both filter() and filter_out() - that conditions are combined with &, you'll probably also remember that you can simplify further to:

patients |>
  filter_out(deceased, date < 2012)

You also get this nice result, i.e. they are complements of one another

# Equivalent up to row ordering
union(filter(df, x, y), filter_out(df, x, y)) ~= df

It is true that you can't break df |> filter_out(x, y) into df |> filter_out(x) |> filter_out(y) like you can with filter():

df |> filter(x, y)
df |> filter(x & y)
df |> filter(x) |> filter(y)

df |> filter_out(x | y)
df |> filter_out(x) |> filter_out(y)

But I'd argue that was never a goal to begin with, and is not how I would teach them. For example, if I'm looking for "rows where cyl == 5 and disp > 20" then I'd write:

df |> filter(cyl == 5, disp > 20)

and it would not occur to me to write this, even though they are equivalent

df |> filter(cyl == 5) |> filter(disp > 20)

In other words, my problem statement of "rows where cyl == 5 and disp > 20" is made up of two coupled conditions and I would never separate them across two filter() statements.

This also means that I don't find Kirill's idea that , is treated like a "then" very convincing. I very much read the , like an "and" that translates directly from my real-life problem statement of "rows where cyl == 5 and disp > 20".

I think a more appropriate goal of filter_out() is ease of translation from a "negated filter", which ends up resulting in this complement worldview.

As chainable equivalents

If filter() combines conditions with & and filter_out() combines conditions with |, you end up with this table:

df |> filter(x, y)
df |> filter(x & y)
df |> filter(when_all(x, y))

df |> filter_out(x, y)
df |> filter_out(x | y)
df |> filter_out(when_any(x, y))

# ---

df |> filter(x | y)
df |> filter(when_any(x, y))

df |> filter_out(x & y)
df |> filter_out(when_all(x, y))

My argument is that this is actually much harder for people to learn.

  • You must remember that filter() combines with &, but filter_out() combines with |.
  • You must remember to use when_any() in filter() but when_all() in filter_out().

And this is on top of having to think about NA handling! So that's 3 different aspects you have to think about all at once (filter for vs filter out, & vs |, and when_any vs when_all). With the complement approach I'd argue there is only 1 aspect to think about - filter for vs filter out, because everything else works the same.

But most importantly, you can no longer easily translate a filter() that you mistakenly started into a filter_out(). With the above example, when you realize that this is the wrong approach:

patients |>
  filter(!(deceased & date < 2012))

then you have to translate to this filter_out(),

patients |>
  filter_out(when_all(deceased, date < 2012))

and I'd argue that is an increase in mental burden to translate to over the "just drop the !" translation of filter_out(deceased & date < 2012).

In my ideal world both when_all() and when_any() are rarely required, and this holds true with the current "treat them as complements" worldview, where only when_any() is ever needed, which is also only in the rare case of needing to combine with |. This would not be the case if filter_out() combined conditions with |, because pretty much every time you'd reach for a filter_out() with >1 conditions, you'd also need when_all(), because combining conditions with & is the more common situation.

This approach does have this "chainable equivalence" property that has been discussed, but I'd again argue that this is not a design goal, and is not the way I'd encourage teaching filter() or filter_out(), because, as mentioned in the previous section, when you have a problem like find "rows where cyl == 5 and disp > 20" you would not want to split that over two filter() calls.

df |> filter(x, y)
df |> filter(x) |> filter(y)

df |> filter_out(x, y)
df |> filter_out(x) |> filter_out(y)

So why do , separated conditions combine with & at all?

Good question!

I think this is the heart of the problem. Deciding whether to combine , separated conditions with & or | is inherently ambiguous. But back in the origins of dplyr it must have been decided that combining with & was the more common case, and I do think that has held true.

I think Kirill nailed it by mentioning that in an ideal world there is only 1 expr allowed. This would force the explicit usage of either & / | or when_all() / when_any() (where there is no ambiguity about how ... combine). That would have been a pretty elegant way to solve all of this!

In fact, this is exactly how Stata's keep if and drop if work, their specification is:

keep if expr
drop if expr

and you must use explicit & and | like drop if inlist(v1,88,99) | missing(v2). No ambiguity there!

But I think limiting filter_out() to just 1 condition would do the world a disservice and would just cause more confusion about why filter() and filter_out() aren't equivalent in this regard.

Instead, I'm arguing that we should just lean into the status quo. Rather than contribute to the ambiguity of how , separated conditions should be combined by chainging its meaning between filter() and filter_out(), let's just have 1 consistent rule of "combine with &", which is already ambiguous enough but has years of muscle memory built up for most filter() users.

@t-kalinowski
Copy link

In ordinary English, when we talk about removing things, “X and Y” is almost always understood as “anything that is X or Y,” i.e. a union of categories to exclude, not a logical “and” inside a single condition.

Examples:

  • “Filter out spam and promotional emails from my inbox.”
    → Remove any email that is spam or promotional.

  • “Filter out missing values and zeros before plotting.”
    → Remove any row that is missing or zero.

And without “filter” language at all:

  • “Exclude France and Germany from the analysis.”
    → Drop any row where the country is France or Germany.

  • “Ignore students who failed and students who dropped the course.”
    → Ignore a student if they failed or dropped.

  • “Remove late submissions and plagiarized submissions.”
    → Remove a submission if it is late or plagiarized.

In all these cases “X and Y” is just a list of things to get rid of: “get rid of X, and also get rid of Y,” which is logically “X or Y” on the exclusion side.

If you also think of filter_out() as “the complement of filter(),” basic logic points the same way: the complement of “keep A and B” is “drop A or B” (De Morgan’s law, !(A & B) ≡ (!A | !B).). So both everyday English and the usual “complement of filter” story support reading “filter out X and Y” as “filter out X or Y.”

@DavisVaughan
Copy link
Member Author

@t-kalinowski that is a good example to think about, but I do not think it is as compelling as you think it is because it is the same for keeping things.

Examples:

  • “Filter out spam and promotional emails from my inbox.”
    → Remove any email that is spam or promotional.

  • "Filter for spam and promotional emails in my inbox"
    → Return any email that is spam or promotional.

  • “Filter out missing values and zeros before plotting.”
    → Remove any row that is missing or zero.

  • “Filter for missing values and zeros for analysis.”
    → Return any row that is missing or zero.

  • “Exclude France and Germany from the analysis.”
    → Drop any row where the country is France or Germany.

  • “Retain France and Germany in the analysis.”
    → Return any row where the country is France or Germany.

So IMO this cannot be used as an argument for "filter out combines with |", because you could just have easily made the argument about filter().

These examples are all somewhat interesting because they only involve a single variable, and the way they have been written actually translates to a single %in% statement.

  • “Filter out spam and promotional emails from my inbox.”
    emails %in% c("spam", "promotional")

  • “Filter out missing values and zeros before plotting.”
    values %in% c(NA, 0)

  • “Exclude France and Germany from the analysis.”
    country %in% c("France", "Germany")

Something about the way the English and works means that it can feel like union in some scenarios and intersection in others, and that makes this very hard! Because these all involve a single variable that can't be in two states at once (a value can't be both missing and zero at the same time), I think we are accustomed to thinking of this as a union instead.

You'd have to say it like this to mean intersection, and written this way it kind of implies you have separate spam and promotional logical columns in your dataset.

  • “Filter for emails that are both spam and promotional.”
    spam & promotional i.e. filter(emails, spam, promotional)

  • “Filter out emails that are both spam and promotional.”
    spam & promotional i.e. filter_out(emails, spam, promotional)


I also do not think we agree on what a complement means?

the complement of “keep A and B” is “drop A or B”

I don't think so? If you're filling in a set venn diagram and you start with

"keep where both A and B are TRUE" (what filter() does)

then that is where the A and B circles overlap. To get the complement, shade every part of the diagram except where A and B overlap, and that gives you

"keep where only A is TRUE, or only B is TRUE, or where neither A nor B are TRUE"

If you stare at that for a bit, you see that an equivalent way to say this is to drop the part where A and B overlap, which is:

"drop where both A and B are TRUE" (what filter_out() does)

I think the confusion can come in if you try and perform the complement in your head at the same time that you switch verbs from "keep" to "drop". This confused me numerous times while writing the tidyup until I wrote the venn diagram down on paper.

Regardless, that means that "drop where either A or B are TRUE" is definitely not the complement of "keep where both A and B are TRUE".

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Nov 21, 2025

Allllll of this discussion really leads us back to a single key point:

For both filter() and filter_out(), whether the , separated conditions should combine with & or | is high ambiguous and situation dependent

That's it. That's the whole confusion right there.

I think we have in our heads that filter() combining , separated conditions with & is somehow "right", but I don't think it is in 100% of cases. As mentioned above, if you wrote "filter for spam and promotional emails in my inbox" as df |> filter(email == "spam", emails == "promotional") then you'd be disappointed. We've certainly had requests over the years for "filter combining , with |", which is how we ended up with when_any() in this PR.

Here's an "ideal world" thought. Take a page from Zen of Python and adopt the rule of "In the face of ambiguity, refuse the temptation to guess". Remove the ambiguity altogether by doing what Stata does and what Kirill said before - limit to only 1 expression.

retain(data, when, ..., by = NULL)     
exclude(data, when, ..., by = NULL)

when_all(...)
when_any(...)

if_all(cols, fn)
if_any(cols, fn)

I think everyone wins here.

  • retain() and exclude() are complements
  • exclude() is a great new way to drop rows
  • The ambiguity about how , separated conditions combine disappears - there's only 1 condition!
  • when_all() and when_any() let you chain multiple conditions combined by a , - but their name tells you if its & or |, so it isn't ambiguous. These can greatly improve readability though, so are very useful.
  • The ambiguity about the name filter() itself disappears (is it filter for or filter out?)

So you can write things like

cars |> retain(class == "suv" & mpg < 15)
cars |> retain(when_all(
  class == "suv", 
  mpg < 15
))

cars |> retain(class == "suv" | mpg < 15)
cars |> retain(when_any(
  class == "suv", 
  mpg < 15
))

cars |> exclude(class == "suv" & mpg < 15)
cars |> exclude(when_all(
  class == "suv", 
  mpg < 15
))

cars |> exclude(class == "suv" | mpg < 15)
cars |> exclude(when_any(
  class == "suv", 
  mpg < 15
))

And still use if_any() like this too

cars |> exclude(if_any(c(x, y, z), is.na))

I think this is...beautiful?

It has a very nice symmetry to it, and all of the ambiguity we've been confused over has disappeared.

The main issue with it is that introducing a new name for filter() can cause a fracture in the R community. It still pretty much has all the problems outlined in the tidyup here https://github.com/tidyverse/tidyups/blob/feature/008/008-dplyr-filter-family.md#alternate-names-for-filter.

@t-kalinowski
Copy link

Thanks @DavisVaughan! You convinced me that , (if we keep it) should reduce with & and not |.

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Nov 21, 2025

At this point I really see two options

filter(.data, ...) and filter_out(.data, ...)

filter(.data, ..., .by = NULL)
filter_out(.data, ..., .by = NULL)

This is the current proposal.

The ... are still combined with & for both. I still feel quite strongly that this is correct, as argued for above.

Pros:

  • We don't move away from filter()
    • Every single SO post, blog, and tutorial that mentions filter() is still completely valid
    • Teachers don't necessarily have to update their notes
    • We don't fracture the R community between filter() users and retain() users
    • Students that learn retain() would still have to learn filter() eventually when they discover legacy code or older tutorials
  • filter() acts like the "core verb", and filter_out() is a variant
    • Just like how slice() is the "core verb" and slice_*() are variants
    • The front page of dplyr doesn't change. It just mentions filter() as one of the 5 core verbs
  • It is very straightforward to translate from a complex negated filter() to a much simpler filter_out(). Less so with translating a filter() to an exclude(when_all()), or to a filter_out(when_all()) if it combined with |.
  • Some people really like that you can supply multiple conditions to filter() and filter_out() via ... and have them combined with &. If you've already got that mental model, forcing you to change to use retain(when_all()) feels like needless overhead at this point in dplyr's lifecycle.
  • With this approach you never need when_all() (it already does that), and rarely need when_any() (when you really do want to combine multiple conditions with |), so there is not much new to learn. With the 1 condition approach of retain() and exclude(), you'll be using both a lot.

retain(data, when) and exclude(data, when)

retain(data, when, ..., by = NULL)
exclude(data, when, ..., by = NULL)

Pros:

  • filter()'s ambiguity (is it filter for or filter out?) is gone due to having clearer names.
  • The ambiguity about how , separated conditions are combined is gone. There is only 1 condition.
  • When you do need to combine multiple conditions together, explicit usage of either & / | or when_all() / when_any() mean you'll never be confused about how they are reduced.

Summary

IMO I still prefer the current proposal of filter() and filter_out(). It still feels like it would be too disruptive to the whole community to move away from filter(). The newly added benefits of limiting retain() and exclude() to just 1 when condition still don't outweigh this in my mind.

In a perfect world if we restarted dplyr, I'd strongly consider retain(data, when) and exclude(data, when) as the "best" approach because there is no ambiguity about them, but sadly we don't live in that world 😄

@joeycouse
Copy link

After the further discussion, I don't think there is a viable way to create filter_out() in way that is consistent with two opposing views (combining with & and composability via pipe).

I think one unintuitive result of combining filter_out() with & is that as a user adding more conditions to filter_out() actually decreases the numbers of cases you remove.

# Removes only if hp >10 & cyl > 4
# Would remove LESS cases

mtcars |> 
       filter_out(hp > 10, cyl > 4)

# Would remove more cases

mtcars |> 
     filter_out(hp > 10)

Maybe this is just me, but my intuition would tell me the first call removes more rows but the opposite behavior is true. IMO @DavisVaughan proposal for retain() and exclude() is compelling as it solves two key issues:

  1. Ambiguous naming
  2. Combining via & or |

I think the net benefits of solving both of these issues is worth the cost of potentially superseding filter(), and results in a significant improvement in usability and clarity that warrants the addition of these new foundational verbs. One question that remains would be how would combination of when_any() and when_all() work in the same function call? This would probably be most relevant in the case of missing values and zeros. e.g.

# Should we do this?
df |>
      exclude(when_all(when_any(is.na(x), x == 0), y > 5))

# or this?
df |>
      exclude(when_any(is.na(x), x == 0) & y > 5)


# Or that 
df |>
      exclude((is.na(x) | x == 0) &  y > 5))


I guess I don't see a compelling reason why when_any() and when_all() need to exist if users can just use & and |, and would result in more readable code. To me the existence of when_any() and when_all() implies that these methods should be preferred over & and | but using those function decreases the left-to-right readability of the function.

@DavisVaughan
Copy link
Member Author

adding more conditions to filter_out() actually decreases the numbers of cases you remove.

That doesn't feel unintuitive to me. You are tightening the bounds on what to drop. Importantly, it works the same way as filter(). Adding more conditions == keeping less rows. So there is nothing new to learn.


the net benefits of solving both of these issues is worth the cost of potentially superseding filter()

As mentioned in #30 (comment) (sent at roughly the same time as your message), I came to the opposite conclusion


when_any() and when_all() are useful when they decrease the number of parentheses required, which almost always increases readability.

countries |>
  filter(
    (name %in% c("US", "CA") & between(score, 200, 300)) |
      (name %in% c("PR", "RU") & between(score, 100, 200)) |
      (name %in% c("JP", "CH") & between(score, 400, 600))
  )

# VS

countries |>
  filter(when_any(
    name %in% c("US", "CA") & between(score, 200, 300),
    name %in% c("PR", "RU") & between(score, 100, 200),
    name %in% c("JP", "CH") & between(score, 400, 600)
  ))

They are also faster than repeated & or |.

They also provide a useful na_rm argument if you want to do something like when_all(x, y, na_rm = TRUE) where you want NA & TRUE to result in TRUE

Outside of the context of filter() and filter_out(), they are useful for the same reasons that pmin() and pmax() are useful. https://github.com/tidyverse/tidyups/blob/feature/008/008-dplyr-filter-family.md#when_all

I'd write your example like this

df |>
  exclude(when_all(
    x == 0 | is.na(x), 
    y > 5
  ))

df |>
  exclude(when_all(
    x %in% c(0, NA), 
    y > 5
  ))

I don't think there is anything wrong with using when_all() and | together, and I think they can still improve readability when used together.

@krlmlr
Copy link
Member

krlmlr commented Nov 24, 2025

I see appeal in theunion_all() symmetry stated above:

# Equivalent up to row ordering
union(filter(df, x, y), filter_out(df, x, y)) ~= df

Does this still hold in the presence of missing values in the predicates? If not, can we create a similar invariant using when_any() or when_all() that also works in the presence of missings? If there is such an invariant, should that be the default setting in filter_out() then, so that the original claim holds? And if we do that, will filter_out() still be as useful, or is the (alleged) built-in asymmetry actually a good thing in your view?

I had to look up and manually test the semantics of na.rm for pmin() . They seem to align with my understanding of the proposed semantics of the na_rm argument to when_any() and when_all() . I wonder if an option like na = c("propagate", "ignore") would be clearer; the docs could mention that na = "propagate" means na.rm = FALSE, we're renaming the argument anyway.

It looks like any solution that we come up with here will be a tradeoff. Realistically, the only way a larger user base can play with it is by sending an experimental version to CRAN. I assume the new functions will be tagged "experimental", with some opportunity to adapt as needed?

@DavisVaughan
Copy link
Member Author

If you'd like to try these yourself, I've pushed a WIP to

pak::pak("tidyverse/dplyr@feature/filter-out-2")

Just so we are all talking about the same thing, here's the output table for when_any() and when_all() with its options set

library(dplyr)

df <- tibble(
  x = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, NA, NA, NA),
  y = c(TRUE, FALSE, NA, TRUE, FALSE, NA, TRUE, FALSE, NA)
)

df |>
  mutate(
    any_propagate = when_any(x, y, na_rm = FALSE),
    any_remove = when_any(x, y, na_rm = TRUE),
    all_propagate = when_all(x, y, na_rm = FALSE),
    all_remove = when_all(x, y, na_rm = TRUE)
  )
#> # A tibble: 9 × 6
#>   x     y     any_propagate any_remove all_propagate all_remove
#>   <lgl> <lgl> <lgl>         <lgl>      <lgl>         <lgl>     
#> 1 TRUE  TRUE  TRUE          TRUE       TRUE          TRUE      
#> 2 TRUE  FALSE TRUE          TRUE       FALSE         FALSE     
#> 3 TRUE  NA    TRUE          TRUE       NA            TRUE      
#> 4 FALSE TRUE  TRUE          TRUE       FALSE         FALSE     
#> 5 FALSE FALSE FALSE         FALSE      FALSE         FALSE     
#> 6 FALSE NA    NA            FALSE      FALSE         FALSE     
#> 7 NA    TRUE  TRUE          TRUE       NA            TRUE      
#> 8 NA    FALSE NA            FALSE      FALSE         FALSE     
#> 9 NA    NA    NA            FALSE      NA            TRUE

Regarding option renaming, we already have dplyr::last(na_rm =) / dplyr::first(na_rm =) / dplyr::slice_min(na_rm =) / dplyr::slice_max(na_rm =) so I liked that we had a decent bit of precedence for it and it felt connected to pmin(na.rm =) / pmax(na.rm =). Ultimately I'd guess that any alternate name we pick here would resonate less well with users than the strong precedent for an na.rm style argument.


Does this still hold in the presence of missing values in the predicates?

I'm assuming you are questioning whether the union claim holds in filter() and filter_out() themselves when when_any() and when_all() are involved in the conditions. The answer to that is yes. Regardless of what conditions are supplied in ..., the conditions hold. when_any() and when_all() aren't special in that regard.

The important part of when_any() and when_all() to me is that they both propagate NAs by default, just like pmin() and pmax() (and just like what base::pany() and base::pall() would do if they ever implemented them). This makes when_all(x, y) equivalent to x & y, which is important to me. And then when_all(x, y, na_rm = TRUE) becomes a version of x & y that also "ignores" NA values entirely, so that NA & TRUE is still TRUE, which is a neat addition.


It looks like any solution that we come up with here will be a tradeoff

Yea, totally

the only way a larger user base can play with it is by sending an experimental version to CRAN

Yea, we will mark it as experimental for a release or so. Note that we have already gotten lots of great feedback about this idea via bluesky and linkedin.

@DavisVaughan
Copy link
Member Author

Here's another example I think will be quite common.

Interactively I will probably want to confirm that I'm about to drop the rows I think I'm going to drop, so interactively I'll do

df |> filter(these_rows, those_rows)

then I'll stare at the output and make sure those are the rows i want to drop. Once I'm happy with that, all I currently have to do is change to

df |> filter_out(these_rows, those_rows)

and boom now those rows are dropped. That's pretty nice! That doesn't hold if we combine with | in filter_out(), which would be a big blow to its usefulness.

(I'm trying to document examples like these here, as they will eventually make their way into the tidyup under a new section of something like ## Why combine with `&` over `|`? )

@krlmlr
Copy link
Member

krlmlr commented Nov 26, 2025

I was surprised to see, with the PR:

pkgload::load_all()
#> ℹ Loading dplyr

df <- tidyr::expand_grid(a = c(TRUE, FALSE, NA), b = c(TRUE, FALSE, NA))

df |>
  filter(a, b)
#> # A tibble: 1 × 2
#>   a     b    
#>   <lgl> <lgl>
#> 1 TRUE  TRUE

df |>
  filter_out(a, b)
#> # A tibble: 8 × 2
#>   a     b    
#>   <lgl> <lgl>
#> 1 TRUE  FALSE
#> 2 TRUE  NA   
#> 3 FALSE TRUE 
#> 4 FALSE FALSE
#> 5 FALSE NA   
#> 6 NA    TRUE 
#> 7 NA    FALSE
#> 8 NA    NA

Created on 2025-11-26 with reprex v2.1.1

This means that filter_out() is a true complement of filter() . For some reason I expected filter_out() to remove a row if one of the predicates evaluates as "missing", just like filter() removes if any predicate evaluates as missing. My conclusion here is that both filter() and filter_out() only act on predicates that return TRUE . The logic is the same, just the action changes (retain vs. exclude).

It makes much more sense now, thanks for your patience!

@DavisVaughan DavisVaughan merged commit c8c6d4b into main Dec 15, 2025
@DavisVaughan DavisVaughan deleted the feature/008 branch December 15, 2025 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.