Skip to content

Instantly share code, notes, and snippets.

@Lextuga007
Created December 2, 2020 09:47
Show Gist options
  • Select an option

  • Save Lextuga007/d271076245e1d2944a64893d62a3919b to your computer and use it in GitHub Desktop.

Select an option

Save Lextuga007/d271076245e1d2944a64893d62a3919b to your computer and use it in GitHub Desktop.
How-to-remove-similar-kind-of-data-from-a-csv-file
Post was removed from RStudio: https://community.rstudio.com/t/how-to-remove-similar-kind-of-data-from-a-csv-file/89706
I would create a look up table with the categories listed out, apply a filter flag and then use it to filter. Having the separate table of unique categories means you can check your assumptions and it may highlight any data quality issues in the original data around these names.
```{r reprex}
library(dplyr)
library(stringr)
# Create reprex
df <- tibble::tribble(
~start, ~end, ~weight,
"A1", "A2", 1,
"A2", "A5", 0.5,
"Falcarinone", "A5", 1,
"Leucodelphinidin", "A10", 1,
"(+)-1(10),4-Cadinadiene", "Falcarinone", 0.09,
"Leucodelphinidin", "(+)-1(10),4-Cadinadiene", 0.876,
"Falcarinone", "Lignin", 1,
"A1", "(+)-1(10),4-Cadinadiene", 0.5,
"A2", "Lignin", 1,
"A3", "(2E,7R,11R)-2-Phyten-1-ol", 0.896
)
```
Create the lookup table with unique categories. You could save this for future use or just run the code each time, being in code it can change if you csv files change which is useful.
```{r lookup}
lookup <- df %>%
select(category = start) %>% # rename to new column
distinct(category) %>%
union_all(df %>%
select(category = end) %>% # rename to new column
distinct(category)) %>%
mutate(format_A = case_when(str_length(category) < 3 & # assumes A.. only has 3 characters, change if you get more like A100 = 4
substr(category, 1, 1) == "A" ~ 1, # first character is an A
TRUE ~ 0))
```
Join the look up table to the data for both start and end, filter where the two columns match the format of A..
```{r Solution}
df_result <- df %>%
left_join(lookup, by = c("start" = "category")) %>%
left_join(lookup, by = c("end" = "category")) %>%
filter(format_A.x != format_A.y)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment