-
Notifications
You must be signed in to change notification settings - Fork 5
Tidyup 8 - Expanding the filter() family
#30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
jennybc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the proposal! Made a few comments as I reacted to a first reading.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
filter() family
This comment was marked as resolved.
This comment was marked as resolved.
|
tidyups/008-dplyr-filter-family.md Line 948 in 5b76b43
FWIW, as a user I would much prefer the name Love the idea for this API btw! |
|
@wurli most of us felt that We also really appreciated how it feels like a "variant" of With |
|
With |
|
Love this implementation, I do think the |
|
Awesome proposal! My 2 cents - I think filter/filter_out is slightly unclear naming. I think filter_keep/filter_drop would be better with filter deprecated |
|
@davidhodge931 as stated in the tidyup at https://github.com/tidyverse/tidyups/blob/feature/008/008-dplyr-filter-family.md#alternate-names-for-filter, we are not considering renaming |
|
Love the idea! How do you teach that # Sequence with filter()
. |>
filter(x) |>
filter(y)
# Same as conjunction
. |>
filter(x, y)
# Sequence with filter_out()
. |>
filter_out(x) |>
filter_out(y)
# Same as alternation (!?!)
. |>
filter_out(x | y) |
|
I think the best way to teach this is probably something like:
# Combining with `&`
df |> filter(x, y)
df |> filter_out(x, y)
# Combining with `|`
df |> filter(when_any(x, y))
df |> filter_out(when_any(x, y))I think the fact that |
|
To me, the antisymmetry is not only theoretically pleasing. I'm reading
I'd never read it like:
Even stronger with . |>
filter_out(
x,
y
)To me, the |
|
Completely agree with @krlmlr here. I think this is a critical function of the api that makes learning the syntax much easier, especially for beginners. I would expect |
|
There are two competing worldviews at play here.
Both of these have their pros and cons. My theory is that the first of these is the most practically useful for dplyr users and is the easiest to learn. As complementsIf both df |> filter(x, y)
df |> filter(x & y)
df |> filter(when_all(x, y))
df |> filter_out(x, y)
df |> filter_out(x & y)
df |> filter_out(when_all(x, y))
# ---
df |> filter_out(x | y)
df |> filter_out(when_any(x, y))
df |> filter(x | y)
df |> filter(when_any(x, y))Notice how everything above the line related to I'd argue that an extremely important property of this table is that you only have to learn 1 rule - that As a nice side effect this means you only need to worry about This all means that if you are translating from a
patients <- tibble::tibble(
name = c("Anne", "Mark", "Sarah", "Davis", "Max", "Derek", "Tina"),
deceased = c(FALSE, TRUE, NA, TRUE, NA, FALSE, TRUE),
date = c(2005, 2010, NA, 2020, 2010, NA, NA)
)
patientsWith years of patients |>
filter(!(deceased & date < 2012))But immediately get frustrated when it drops your patients |>
filter_out(deceased & date < 2012)And boom that works as expected. And since there is only 1 rule that applies for both patients |>
filter_out(deceased, date < 2012)You also get this nice result, i.e. they are complements of one another # Equivalent up to row ordering
union(filter(df, x, y), filter_out(df, x, y)) ~= dfIt is true that you can't break df |> filter(x, y)
df |> filter(x & y)
df |> filter(x) |> filter(y)
df |> filter_out(x | y)
df |> filter_out(x) |> filter_out(y)But I'd argue that was never a goal to begin with, and is not how I would teach them. For example, if I'm looking for "rows where df |> filter(cyl == 5, disp > 20)and it would not occur to me to write this, even though they are equivalent df |> filter(cyl == 5) |> filter(disp > 20)In other words, my problem statement of "rows where This also means that I don't find Kirill's idea that I think a more appropriate goal of As chainable equivalentsIf df |> filter(x, y)
df |> filter(x & y)
df |> filter(when_all(x, y))
df |> filter_out(x, y)
df |> filter_out(x | y)
df |> filter_out(when_any(x, y))
# ---
df |> filter(x | y)
df |> filter(when_any(x, y))
df |> filter_out(x & y)
df |> filter_out(when_all(x, y))My argument is that this is actually much harder for people to learn.
And this is on top of having to think about But most importantly, you can no longer easily translate a patients |>
filter(!(deceased & date < 2012))then you have to translate to this patients |>
filter_out(when_all(deceased, date < 2012))and I'd argue that is an increase in mental burden to translate to over the "just drop the In my ideal world both This approach does have this "chainable equivalence" property that has been discussed, but I'd again argue that this is not a design goal, and is not the way I'd encourage teaching df |> filter(x, y)
df |> filter(x) |> filter(y)
df |> filter_out(x, y)
df |> filter_out(x) |> filter_out(y)So why do
|
|
In ordinary English, when we talk about removing things, “X and Y” is almost always understood as “anything that is X or Y,” i.e. a union of categories to exclude, not a logical “and” inside a single condition. Examples:
And without “filter” language at all:
In all these cases “X and Y” is just a list of things to get rid of: “get rid of X, and also get rid of Y,” which is logically “X or Y” on the exclusion side. If you also think of |
|
@t-kalinowski that is a good example to think about, but I do not think it is as compelling as you think it is because it is the same for keeping things. Examples:
So IMO this cannot be used as an argument for "filter out combines with These examples are all somewhat interesting because they only involve a single variable, and the way they have been written actually translates to a single
Something about the way the English You'd have to say it like this to mean intersection, and written this way it kind of implies you have separate
I also do not think we agree on what a complement means?
I don't think so? If you're filling in a set venn diagram and you start with
then that is where the A and B circles overlap. To get the complement, shade every part of the diagram except where A and B overlap, and that gives you
If you stare at that for a bit, you see that an equivalent way to say this is to drop the part where A and B overlap, which is:
I think the confusion can come in if you try and perform the complement in your head at the same time that you switch verbs from "keep" to "drop". This confused me numerous times while writing the tidyup until I wrote the venn diagram down on paper. Regardless, that means that "drop where either A or B are TRUE" is definitely not the complement of "keep where both A and B are TRUE". |
|
Allllll of this discussion really leads us back to a single key point: For both That's it. That's the whole confusion right there. I think we have in our heads that Here's an "ideal world" thought. Take a page from Zen of Python and adopt the rule of "In the face of ambiguity, refuse the temptation to guess". Remove the ambiguity altogether by doing what Stata does and what Kirill said before - limit to only 1 expression. retain(data, when, ..., by = NULL)
exclude(data, when, ..., by = NULL)
when_all(...)
when_any(...)
if_all(cols, fn)
if_any(cols, fn)I think everyone wins here.
So you can write things like cars |> retain(class == "suv" & mpg < 15)
cars |> retain(when_all(
class == "suv",
mpg < 15
))
cars |> retain(class == "suv" | mpg < 15)
cars |> retain(when_any(
class == "suv",
mpg < 15
))
cars |> exclude(class == "suv" & mpg < 15)
cars |> exclude(when_all(
class == "suv",
mpg < 15
))
cars |> exclude(class == "suv" | mpg < 15)
cars |> exclude(when_any(
class == "suv",
mpg < 15
))And still use cars |> exclude(if_any(c(x, y, z), is.na))I think this is...beautiful? It has a very nice symmetry to it, and all of the ambiguity we've been confused over has disappeared. The main issue with it is that introducing a new name for |
|
Thanks @DavisVaughan! You convinced me that |
|
At this point I really see two options
|
|
After the further discussion, I don't think there is a viable way to create I think one unintuitive result of combining Maybe this is just me, but my intuition would tell me the first call removes more rows but the opposite behavior is true. IMO @DavisVaughan proposal for
I think the net benefits of solving both of these issues is worth the cost of potentially superseding I guess I don't see a compelling reason why |
That doesn't feel unintuitive to me. You are tightening the bounds on what to drop. Importantly, it works the same way as
As mentioned in #30 (comment) (sent at roughly the same time as your message), I came to the opposite conclusion
countries |>
filter(
(name %in% c("US", "CA") & between(score, 200, 300)) |
(name %in% c("PR", "RU") & between(score, 100, 200)) |
(name %in% c("JP", "CH") & between(score, 400, 600))
)
# VS
countries |>
filter(when_any(
name %in% c("US", "CA") & between(score, 200, 300),
name %in% c("PR", "RU") & between(score, 100, 200),
name %in% c("JP", "CH") & between(score, 400, 600)
))They are also faster than repeated They also provide a useful Outside of the context of I'd write your example like this df |>
exclude(when_all(
x == 0 | is.na(x),
y > 5
))
df |>
exclude(when_all(
x %in% c(0, NA),
y > 5
))I don't think there is anything wrong with using |
|
I see appeal in the
Does this still hold in the presence of missing values in the predicates? If not, can we create a similar invariant using I had to look up and manually test the semantics of It looks like any solution that we come up with here will be a tradeoff. Realistically, the only way a larger user base can play with it is by sending an experimental version to CRAN. I assume the new functions will be tagged "experimental", with some opportunity to adapt as needed? |
|
If you'd like to try these yourself, I've pushed a WIP to pak::pak("tidyverse/dplyr@feature/filter-out-2")Just so we are all talking about the same thing, here's the output table for library(dplyr)
df <- tibble(
x = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, NA, NA, NA),
y = c(TRUE, FALSE, NA, TRUE, FALSE, NA, TRUE, FALSE, NA)
)
df |>
mutate(
any_propagate = when_any(x, y, na_rm = FALSE),
any_remove = when_any(x, y, na_rm = TRUE),
all_propagate = when_all(x, y, na_rm = FALSE),
all_remove = when_all(x, y, na_rm = TRUE)
)
#> # A tibble: 9 × 6
#> x y any_propagate any_remove all_propagate all_remove
#> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 TRUE TRUE TRUE TRUE TRUE TRUE
#> 2 TRUE FALSE TRUE TRUE FALSE FALSE
#> 3 TRUE NA TRUE TRUE NA TRUE
#> 4 FALSE TRUE TRUE TRUE FALSE FALSE
#> 5 FALSE FALSE FALSE FALSE FALSE FALSE
#> 6 FALSE NA NA FALSE FALSE FALSE
#> 7 NA TRUE TRUE TRUE NA TRUE
#> 8 NA FALSE NA FALSE FALSE FALSE
#> 9 NA NA NA FALSE NA TRUERegarding option renaming, we already have
I'm assuming you are questioning whether the union claim holds in The important part of
Yea, totally
Yea, we will mark it as experimental for a release or so. Note that we have already gotten lots of great feedback about this idea via bluesky and linkedin. |
|
Here's another example I think will be quite common. Interactively I will probably want to confirm that I'm about to drop the rows I think I'm going to drop, so interactively I'll do df |> filter(these_rows, those_rows)then I'll stare at the output and make sure those are the rows i want to drop. Once I'm happy with that, all I currently have to do is change to df |> filter_out(these_rows, those_rows)and boom now those rows are dropped. That's pretty nice! That doesn't hold if we combine with (I'm trying to document examples like these here, as they will eventually make their way into the tidyup under a new section of something like |
|
I was surprised to see, with the PR: pkgload::load_all()
#> ℹ Loading dplyr
df <- tidyr::expand_grid(a = c(TRUE, FALSE, NA), b = c(TRUE, FALSE, NA))
df |>
filter(a, b)
#> # A tibble: 1 × 2
#> a b
#> <lgl> <lgl>
#> 1 TRUE TRUE
df |>
filter_out(a, b)
#> # A tibble: 8 × 2
#> a b
#> <lgl> <lgl>
#> 1 TRUE FALSE
#> 2 TRUE NA
#> 3 FALSE TRUE
#> 4 FALSE FALSE
#> 5 FALSE NA
#> 6 NA TRUE
#> 7 NA FALSE
#> 8 NA NACreated on 2025-11-26 with reprex v2.1.1 This means that It makes much more sense now, thanks for your patience! |
Readable link
Most relevant issues
filter(.missing = )option to optionally retain missing values dplyr#6560filter(.missing = NULL, .how = c("keep", "drop"))dplyr#6891We are open to feedback until Monday, November 24th.