| Title: | Inspect and Clean Subject-Generated ID Codes and Related Data |
|---|---|
| Description: | Makes data wrangling with ID-related aspects more comfortable. Provides functions that make it easy to inspect various subject-generated ID codes (SGIC) for plausibility. Also helps with inspecting other common identifiers, ensuring that your data stays clean and reliable. |
| Authors: | Annemarie Pläschke [aut, cre, cph] (ORCID: <https://orcid.org/0009-0005-7115-8790>), Tobias Brändle [aut] (ORCID: <https://orcid.org/0000-0001-8872-9872>) |
| Maintainer: | Annemarie Pläschke <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.0.9000 |
| Built: | 2026-05-15 10:15:50 UTC |
| Source: | https://github.com/kuuuwe/trustmebro |
Identify duplicate cases in a data frame or tibble based on specific variables. A logical column 'has_dupes' is added, that indicates whether or not a row has duplicate values based on the provided variables.
find_dupes(data, ...)find_dupes(data, ...)
data |
A data frame or tibble |
... |
Variable names to check for duplicates |
The original data frame or tibble with an additional logical column 'has_dupes' which is 'TRUE' for rows that have duplicates based on the specified variables and 'FALSE' otherwise.
# Example data print(sailor_students) # Find duplicate cases based on 'sgic', 'school' and 'class' sailor_students_dupes <- find_dupes(sailor_students, sgic, school, class) # Rows where 'has_dupes' is `TRUE` indicate duplicates based on the provided columns print(sailor_students_dupes)# Example data print(sailor_students) # Find duplicate cases based on 'sgic', 'school' and 'class' sailor_students_dupes <- find_dupes(sailor_students, sgic, school, class) # Rows where 'has_dupes' is `TRUE` indicate duplicates based on the provided columns print(sailor_students_dupes)
Check whether a given string contains exactly one two-digit number that represents a valid day of the month (between 01 and 31). The string is assumed to be a code (e.g., a SGIC), which may include letters and digits.
inspect_birthday(code)inspect_birthday(code)
code |
A character string containing a SGIC or similar code that may include a numeric birthday-component. |
A logical value: 'TRUE' if the string contains only one valid birthday-component (between 01 and 31), otherwise 'FALSE'.
inspect_birthday("DEF66") # FALSE - 66 is not a valid day inspect_birthday("GHI02") # TRUE - 02 is a valid day inspect_birthday("ABC12DEF34") # FALSE - Multiple numeric components inspect_birthday("XYZ") # FALSE - No numeric component inspect_birthday("JKL31") # TRUE - 31 is a valid dayinspect_birthday("DEF66") # FALSE - 66 is not a valid day inspect_birthday("GHI02") # TRUE - 02 is a valid day inspect_birthday("ABC12DEF34") # FALSE - Multiple numeric components inspect_birthday("XYZ") # FALSE - No numeric component inspect_birthday("JKL31") # TRUE - 31 is a valid day
Checks whether a given string contains exactly one four-digit number representing a valid combination of a day (birthday) and a month (birth month). Numeric components can be interpreted in either "DDMM" (day-month) or "MMDD" (month-day) format, depending on the specified format. The string is assumed to be a code (e.g., a SGIC), which may include letters and digits.
inspect_birthdaymonth(code, format = "DDMM")inspect_birthdaymonth(code, format = "DDMM")
code |
A character string containing a SGIC or similar code that may include a numeric component representing a birthday and birth month. |
format |
A string specifying the format of the date of birth components in code. Use "DDMM" for day-month format and "MMDD" for month-day format. Default is "DDMM". |
A logical value: 'TRUE' if the string contains exactly one valid numeric component that forms a valid birthday (day and month), otherwise 'FALSE'.
inspect_birthdaymonth("DEF2802") # TRUE - 28th of February is a valid date inspect_birthdaymonth("GHI3002") # FALSE - 30th of February is invalid inspect_birthdaymonth("XYZ3112") # TRUE - 31st of December is valid inspect_birthdaymonth("18DEF02") # FALSE - Multiple numeric components inspect_birthdaymonth("XYZ") # FALSE - No numeric components inspect_birthdaymonth("ABC1231", format = "MMDD") # TRUE - December 31st is validinspect_birthdaymonth("DEF2802") # TRUE - 28th of February is a valid date inspect_birthdaymonth("GHI3002") # FALSE - 30th of February is invalid inspect_birthdaymonth("XYZ3112") # TRUE - 31st of December is valid inspect_birthdaymonth("18DEF02") # FALSE - Multiple numeric components inspect_birthdaymonth("XYZ") # FALSE - No numeric components inspect_birthdaymonth("ABC1231", format = "MMDD") # TRUE - December 31st is valid
Check whether a given string contains exactly one two-digit number that represents a valid month of the year (between 01 and 12). The string is assumed to be a code (e.g., a SGIC), which may include letters and digits.
inspect_birthmonth(code)inspect_birthmonth(code)
code |
A character string containing a SGIC or similar code that may include a numeric birth month-component. |
A logical value: 'TRUE' if the string contains only one valid birth month-component (between 01 and 12), otherwise 'FALSE'.
inspect_birthday("DEF66") # FALSE - 66 is not a valid month inspect_birthday("GHI02") # TRUE - 02 (February) is a valid month inspect_birthday("ABC12DEF10") # FALSE - Multiple numeric components inspect_birthday("XYZ") # FALSE - No numeric component inspect_birthday("JKL11") # TRUE - 11 (November) is a valid dayinspect_birthday("DEF66") # FALSE - 66 is not a valid month inspect_birthday("GHI02") # TRUE - 02 (February) is a valid month inspect_birthday("ABC12DEF10") # FALSE - Multiple numeric components inspect_birthday("XYZ") # FALSE - No numeric component inspect_birthday("JKL11") # TRUE - 11 (November) is a valid day
Check whether a given string matches a specified pattern using regular expressions (regex). The string is assumed to be a code (e.g., a SGIC), which should follow a predefined format.
inspect_characterid(code, pattern)inspect_characterid(code, pattern)
code |
A character string containing a SGIC or similar code that should follow a predefined format. |
pattern |
A character string specifying the expected pattern using regular expressions (regex). The pattern defines the format 'code' should match. |
A logical value: 'TRUE' if the code matches the expected pattern, otherwise 'FALSE'
inspect_characterid("ABC1234", "^[A-Za-z]{3}[0-9]{4}$") #TRUE - Matches the pattern inspect_characterid("12DBG45FG", "^[A-Za-z]{3}[0-9]{4}$") #FALSE - Does not match the patterninspect_characterid("ABC1234", "^[A-Za-z]{3}[0-9]{4}$") #TRUE - Matches the pattern inspect_characterid("12DBG45FG", "^[A-Za-z]{3}[0-9]{4}$") #FALSE - Does not match the pattern
Check whether a given numeric value has the expected number of digits.
inspect_numberid(number, expected_length)inspect_numberid(number, expected_length)
number |
A numeric value. |
expected_length |
An integer specifying the expected number of digits. |
A logical value: 'TRUE' if 'number' has the expected length and consists only of digits, otherwise 'FALSE'.
inspect_numberid(12345, 5) # TRUE - 5 digits inspect_numberid(1234, 5) # FALSE - 4 digitsinspect_numberid(12345, 5) # TRUE - 5 digits inspect_numberid(1234, 5) # FALSE - 4 digits
Check whether a given value is present as a key in a specified recode map. Inputs can be validated against a set of predefined categories or labels.
inspect_valinvec(value, recode_map)inspect_valinvec(value, recode_map)
value |
A single value to inspect, which is checked against the keys of a recode map. |
recode_map |
A named vector where the names represent the keys to check against. The values of the vector are ignored. |
A logical value: 'TRUE' if the 'value' is a key in the 'recode_map', otherwise 'FALSE'.
recode_map <- c(male = "M", female = "F") inspect_valinvec("female", recode_map) # TRUE - "female" is a key in the recode map inspect_valinvec("other", recode_map) # FALSE - "other" is not a key in the recode maprecode_map <- c(male = "M", female = "F") inspect_valinvec("female", recode_map) # TRUE - "female" is a key in the recode map inspect_valinvec("other", recode_map) # FALSE - "other" is not a key in the recode map
Clean specified character columns in a data frame or tibble by removing non-alphanumeric characters, replacing them with a specified character (default is "#"). Also replaces NA values and allows for additional characters to keep in the cleaned strings. The resulting strings are converted to uppercase.
purge_string(data, ..., replacement = "#", keep = "")purge_string(data, ..., replacement = "#", keep = "")
data |
A data frame or tibble containing columns to be cleaned. |
... |
Variables to clean. If none are provided, all character columns will be processed. |
replacement |
A character string used to replace unwanted characters and empty strings. Default is "#". |
keep |
A character string containing any additional characters that should be retained in the cleaned strings. |
A data frame or tibble with the specified character columns cleaned and modified as per the given parameters.
# Example data print(sailor_students) # Clean all character columns, replacing unwanted characters with "#", retaining "-" sailor_students_cleaned <- purge_string(sailor_students, sgic, school, class, gender, keep = "-") # Tibble with cleaned 'sgic', 'school', 'class' and 'gender' columns print(sailor_students_cleaned)# Example data print(sailor_students) # Clean all character columns, replacing unwanted characters with "#", retaining "-" sailor_students_cleaned <- purge_string(sailor_students, sgic, school, class, gender, keep = "-") # Tibble with cleaned 'sgic', 'school', 'class' and 'gender' columns print(sailor_students_cleaned)
Recode a specified variable in a data frame or tibble based on a provided recode map. If the recode map is empty, the original variable is retained under a new name.
recode_valinvec(data, var, recode_map, new_var)recode_valinvec(data, var, recode_map, new_var)
data |
A data frame or tibble. |
var |
A variable to be recoded. |
recode_map |
A named vector specifying the recode map. |
new_var |
Name of the new variable holding the recoded values. |
A data frame or tibble with the new variable added.
# Example data print(sailor_students) # Define a recode map for gender recode_map_gender <- c("Female" = "F", "Male" = "M", "Other" = "X") # Recode gender sailor_students_recoded <- recode_valinvec(sailor_students, gender, recode_map_gender, recode_gender) # A tibble with a recoded gender variable print(sailor_students_recoded)# Example data print(sailor_students) # Define a recode map for gender recode_map_gender <- c("Female" = "F", "Male" = "M", "Other" = "X") # Recode gender sailor_students_recoded <- recode_valinvec(sailor_students, gender, recode_map_gender, recode_gender) # A tibble with a recoded gender variable print(sailor_students_recoded)
A fictional key data set.
sailor_keyssailor_keys
'sailor_keys' A tibble with 12 rows and 6 columns:
schoolyear
hexadecimal ID number
student information
school information
subject generated ID
A fictional assessment data set.
sailor_studentssailor_students
'sailor_students' A tibble with 12 rows and 6 columns:
Subject generated ID
schoolnumber
class designation
gender
testscores