---
title: "Inspect SGICs"
output: rmarkdown::html_vignette
description: >
This vingette shows you how to use the package for checking SGIcs and related survey data on plausibility
vignette: >
%\VignetteIndexEntry{inspect_sgics}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
Linking survey data with SGICs (Subject Generated Identification-Codes)? Awesome! Just remember, you need to validate those IDs. That's how you get clean data and make sure the link-up goes smoothly.
This vignette shows you:
- How to perform plausibility checks on different SGIC components.
- How to perform plausibility checks on non-SGIC variables that may serve as additional identifiers.
- How to detect duplicate cases using a combination of variables as unique identifiers.
To check the plausibility of ID-related variables in a dataset, `trustmebro` provides several functions beginning with the prefix *inspect*. Every *inspect*-function returns a boolean value, indicating whether a value has passed or failed the plausibility check.
We\`ll start by loading trustmebro and dplyr:
```{r setup, message=FALSE}
library(trustmebro)
library(dplyr)
```
# Data: sailor_students
The survey data we use is the `trustmebro::sailor_students` dataset. It contains fictional student assessment data from students of the sailor moon universe.
```{r}
sailor_students
```
# SGIC Plausibility
The variable `sgic` stores SGICs created by students. Each SGIC is a seven-character string created according to the following instructions:
Characters 1-3 (letters):
- First letter of given name (1st character)
- Last letter of given name (2nd character)
- First letter of family name (3rd character)
Characters 4-7 (digits):
- Birthday (4th and 5th character)
- Month of birth (6th and 7th character)
## Check Character IDs
We can use `trustmebro::inspect_characterid` to check if the provided SGICs adhere to the expected pattern of three letters followed by four digits. The expected structure can be defined using the regular expression `"^[A-Za-z]{3}[0-9]{4}$"`, which we can then pass to the function using the `pattern =` argument. For seamless integration into your data workflow, this function can be conveniently combined with `dplyr::mutate`:
```{r}
sailor_students %>%
mutate(structure_check =
inspect_characterid(
sgic, pattern = "^[A-Za-z]{3}[0-9]{4}$")) %>%
select(sgic, structure_check)
```
We created `trustmebro::inspect_characterid` with SGICs in mind, but of course, any other non-SGIC strings can also be checked using a specified regular expression.
## Check Birthdate-Components
Since the SGIC should end with a date of birth, you can verify the plausibility of this date of birth using `trustmebro::inspect_birthdaymonth`. This function checks if a string contains exactly four digits representing a valid date of birth. As before, you can combine `trustmebro::inspect_birthdaymonth` with `dplyr::mutate` to generate a plausibility check variable:
```{r}
sailor_students %>%
mutate(birthdate_check =
inspect_birthdaymonth(sgic)) %>%
select(sgic, birthdate_check)
```
Some SGICs only use the single day or month a person was born. In this case, you can use of `trustmebro::inspect_birthday` or `trustmebro::inspect_birthmonth` accordingly.
# Non-SGIC variables' plausibility
Besides a SGIC, other variables in a given dataset might be used to identify cases. As mentioned above, `trustmebro::inspect_characterid` can be used for any string that should follow a specific pattern. Furthermore, this package also provides functions for checking other data types beyond strings.
## Check Numbers
We can use `trustmebro::inspect_numberid` to check if a number matches an expected length. In our dataset, `school` should be a five-digit number. combined with `dplyr::mutate`, we can add a plausibility variable for the schoolnumber, just as we did before:
```{r}
sailor_students %>%
mutate(school_check =
inspect_numberid(school, 5)) %>%
select(school, school_check)
```
## Check the presence of a value within the recode map
In the process of using non-SGIC variables as identifiers, categorical data is often recoded to ensure consistency within a workflow. We can use `trustmebro::inspect_valinvec` to check if a value exists in a recode map. The recode map should be a named vector, where the names represent the keys. In our dataset, we want to inspect if all values in `gender` conform to this recode map:
```{r}
recode_gender <- c(Male = "M", Female = "F")
```
The function checks if a value is present as a key. Combine with `dplyr::mutate` to add a variable that contains the check results:
```{r}
sailor_students %>%
mutate(gender_check =
inspect_valinvec(gender, recode_gender)) %>%
select(gender, gender_check)
```
# Identify Duplicate Cases
So far, we've checked if `SGIC`, `school` and `gender` contain plausible values. Last, we want to ensure that these variables, when used together as identifiers, uniquely identify a single case and that there are no duplicate entries based on these variables. `trustmebro::find_dupes` checks whether the combination of identifiers is unique by adding a has_dupes variable to the dataset. To find duplicates in your data, use it like this:
```{r}
sailor_students %>% find_dupes(school, sgic, gender) %>%
select(school, sgic, gender, has_dupes)
```