| Title: | A Comprehensive Hit or Miss Probabilistic Entity Resolution Model |
|---|---|
| Description: | Provides Bayesian probabilistic methods for record linkage and entity resolution across multiple datasets using the Comprehensive Hit Or Miss Probabilistic Entity Resolution (CHOMPER) model. The package implements three main inference approaches: (1) Evolutionary Variational Inference for record Linkage (EVIL), (2) Coordinate Ascent Variational Inference (CAVI), and (3) Markov Chain Monte Carlo (MCMC) with split and merge process. The model supports both discrete and continuous fields, and it performs locally-varying hit mechanism for the attributes with multiple truths. It also provides tools for performance evaluation based on either approximated variational factors or posterior samples. The package is designed to support parallel computing with multi-threading support for EVIL to estimate the linkage structure faster. |
| Authors: | Hyungjoon Kim [aut, cre], Andee Kaplan [aut], Matthew Koslovsky [aut] |
| Maintainer: | Hyungjoon Kim <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.1.3 |
| Built: | 2026-05-17 09:33:54 UTC |
| Source: | https://github.com/hjkim8987/chomper |
Provides Bayesian probabilistic methods for record linkage and entity resolution across multiple datasets using the Comprehensive Hit Or Miss Probabilistic Entity Resolution (CHOMPER) model. The package implements three main inference approaches: (1) Evolutionary Variational Inference for record Linkage (EVIL), (2) Coordinate Ascent Variational Inference (CAVI), and (3) Markov Chain Monte Carlo (MCMC) with split and merge process. The model supports both discrete and continuous fields, and it performs locally-varying hit mechanism for the attributes with multiple truths. It also provides tools for performance evaluation based on either approximated variational factors or posterior samples. The package is designed to support parallel computing with multi-threading support for EVIL to estimate the linkage structure faster.
Maintainer: Hyungjoon Kim [email protected]
Authors:
Andee Kaplan [email protected]
Matthew Koslovsky [email protected]
Useful links:
Fit the CHOMPER model with a single Coordinate Ascent Variational Inference (CAVI) to estimate the linkage structure across multiple datasets. It returns the approximate variational factors of the linkage structure that maximize the evidence lower bound (ELBO) and other parameters of the CHOMPER model.
chomperCAVI( x, k, n, N, p, M, discrete_fields, continuous_fields, hyper_beta, hyper_phi = c(), hyper_tau = c(), hyper_epsilon_discrete = c(), hyper_epsilon_continuous = c(), hyper_sigma = matrix(nrow = 0, ncol = 2), overlap_prob = 0.5, tol_cavi = 1e-05, max_iter_cavi = 100, max_time = 86400, custom_initializer = FALSE, use_checkpoint = FALSE, initial_values = NULL, checkpoint_values = NULL, verbose_internal = TRUE )chomperCAVI( x, k, n, N, p, M, discrete_fields, continuous_fields, hyper_beta, hyper_phi = c(), hyper_tau = c(), hyper_epsilon_discrete = c(), hyper_epsilon_continuous = c(), hyper_sigma = matrix(nrow = 0, ncol = 2), overlap_prob = 0.5, tol_cavi = 1e-05, max_iter_cavi = 100, max_time = 86400, custom_initializer = FALSE, use_checkpoint = FALSE, initial_values = NULL, checkpoint_values = NULL, verbose_internal = TRUE )
x |
A list of data frames, each representing a dataset. |
k |
The number of datasets to be linked. |
n |
The number of rows in each dataset (vector of length k). |
N |
The number of columns in each dataset. |
p |
The number of fields in each dataset. |
M |
The number of categories for each discrete field (vector of length of discrete fields). |
discrete_fields |
The indexes of the discrete fields (1-based index). |
continuous_fields |
The indexes of the continuous fields (1-based index). |
hyper_beta |
The hyperparameters for the beta distribution (matrix of size p x 2). |
hyper_phi |
The hyperparameters for softmax representation (vector of length of discrete fields). |
hyper_tau |
The temperature parameter (vector of length of discrete fields). |
hyper_epsilon_discrete |
The range parameter for the comprehensive hit of discrete fields (vector of length of discrete fields). |
hyper_epsilon_continuous |
The range parameter for the comprehensive hit of continuous fields (vector of length of continuous fields). |
hyper_sigma |
The hyperparameters for the Inverse Gamma distribution (matrix of size length of continuous fields x 2). |
overlap_prob |
The presumed probability of overlap across the datasets. |
tol_cavi |
The tolerance for the coordinate ascent variational inference for the convergence. |
max_iter_cavi |
The maximum number of iterations for the coordinate ascent variational inference. |
max_time |
The maximum time limit for the execution in seconds. |
custom_initializer |
Whether to use a custom initializer for the initial values. |
use_checkpoint |
Whether to use a checkpoint. |
initial_values |
The initial values for the parameters (optional). |
checkpoint_values |
The checkpoint values for the parameters (optional). |
verbose_internal |
Whether to print the internal C++ messages (TRUE: print, FALSE: not print). |
A list of the approximated parameters of the variational factors and other information containing:
nu: A list of parameter matrices for the approximate multinomial posterior of the linkage structure.
omega: A matrix of parameter vectors for the approximate beta posterior of the distortion ratio.
rho: A list of parameter matrices for the approximate Bernoulli posterior of the distortion indicators.
gamma: A list of parameter matrices for the approximate multinomial posterior of discrete true latent values.
alpha: A list of parameter vectors for the approximate Dirichlet posterior (theta) of the discrete true latent values.
eta_tilde: A matrix of parameter vectors for the mean of the approximate normal posterior of continuous true latent values.
eta_mean: A vector of mean parameters for the approximate normal posterior (eta_tilde) of the continuous true latent values.
eta_var: A vector of variance parameters for the approximate normal posterior (eta_tilde) of the continuous true latent values.
sigma_tilde: A matrix of parameter vectors for the variance of the approximate normal posterior of continuous true latent values.
sigma_shape: A vector of shape parameters for the approximate inverse gamma posterior (sigma_tilde) of the continuous true latent values.
sigma_scale: A vector of scale parameters for the approximate inverse gamma posterior (sigma_tilde) of the continuous true latent values.
ELBO: The final maximum ELBO from a single CAVI.
niter: The number of iterations for a single CAVI.
interruption: Whether the CHOMPER-CAVI is interrupted. The fitting is interrupted if the elapsed time reaches the maximum time limit.
cavi_elapsed_time: The maximum elapsed time of a single CAVI iteration.
elapsed_time: The elapsed time of the entire CAVI process in seconds.
# 1. Generate sample data for testing sample_data <- generate_sample_data( n_entities = 10, n_files = 3, overlap_ratio = 0.7, discrete_columns = c(1, 2), discrete_levels = c(3, 3), continuous_columns = c(3, 4), continuous_params = matrix(c(0, 0, 1, 1), ncol = 2), distortion_ratio = c(0.1, 0.1, 0.1, 0.1) ) # 2. Get file information and remove `id` from the original data n <- numeric(3) x <- list() for (i in 1:3) { n[i] <- nrow(sample_data[[i]]) x[[i]] <- sample_data[[i]][, -1] } N <- sum(n) # 3. Set Hyperparameters hyper_beta <- matrix( rep(c(N * 0.1 * 0.01, N * 0.1), 4), ncol = 2, byrow = TRUE ) hyper_sigma <- matrix( rep(c(0.01, 0.01), 2), ncol = 2, byrow = TRUE ) # 4. Fit CHOMPER-CAVI result <- chomperCAVI( x = x, k = 3, # number of datasets n = n, # rows per dataset N = N, # columns per dataset p = 4, # fields per dataset M = c(3, 3), # categories for discrete fields discrete_fields = c(1, 2), continuous_fields = c(3, 4), hyper_beta = hyper_beta, # hyperparameter for distortion rate hyper_sigma = hyper_sigma, # hyperparameter for continuous fields hyper_phi = c(2.0, 2.0), hyper_tau = c(0.01, 0.01), hyper_epsilon_discrete = c(0, 0), hyper_epsilon_continuous = c(0.001, 0.001), )# 1. Generate sample data for testing sample_data <- generate_sample_data( n_entities = 10, n_files = 3, overlap_ratio = 0.7, discrete_columns = c(1, 2), discrete_levels = c(3, 3), continuous_columns = c(3, 4), continuous_params = matrix(c(0, 0, 1, 1), ncol = 2), distortion_ratio = c(0.1, 0.1, 0.1, 0.1) ) # 2. Get file information and remove `id` from the original data n <- numeric(3) x <- list() for (i in 1:3) { n[i] <- nrow(sample_data[[i]]) x[[i]] <- sample_data[[i]][, -1] } N <- sum(n) # 3. Set Hyperparameters hyper_beta <- matrix( rep(c(N * 0.1 * 0.01, N * 0.1), 4), ncol = 2, byrow = TRUE ) hyper_sigma <- matrix( rep(c(0.01, 0.01), 2), ncol = 2, byrow = TRUE ) # 4. Fit CHOMPER-CAVI result <- chomperCAVI( x = x, k = 3, # number of datasets n = n, # rows per dataset N = N, # columns per dataset p = 4, # fields per dataset M = c(3, 3), # categories for discrete fields discrete_fields = c(1, 2), continuous_fields = c(3, 4), hyper_beta = hyper_beta, # hyperparameter for distortion rate hyper_sigma = hyper_sigma, # hyperparameter for continuous fields hyper_phi = c(2.0, 2.0), hyper_tau = c(0.01, 0.01), hyper_epsilon_discrete = c(0, 0), hyper_epsilon_continuous = c(0.001, 0.001), )
Fit the CHOMPER model with Evolutionary Variational Inference for record linkage (EVIL) to estimate the linkage structure across multiple datasets. It returns the approximate variational factors of the linkage structure that maximize the evidence lower bound (ELBO) and other parameters of the CHOMPER model.
chomperEVIL( x, k, n, N, p, M, discrete_fields, continuous_fields, hyper_beta, hyper_phi = c(), hyper_tau = c(), hyper_epsilon_discrete = c(), hyper_epsilon_continuous = c(), hyper_sigma = matrix(nrow = 0, ncol = 2), overlap_prob = 0.5, n_parents = 5, n_children = 10, tol_cavi = 1e-05, max_iter_cavi = 100, tol_evi = 1e-05, max_iter_evi = 50, n_threads = 1, max_time = 86400, custom_initializer = FALSE, use_checkpoint = FALSE, initial_values = NULL, checkpoint_values = NULL, verbose_internal = TRUE )chomperEVIL( x, k, n, N, p, M, discrete_fields, continuous_fields, hyper_beta, hyper_phi = c(), hyper_tau = c(), hyper_epsilon_discrete = c(), hyper_epsilon_continuous = c(), hyper_sigma = matrix(nrow = 0, ncol = 2), overlap_prob = 0.5, n_parents = 5, n_children = 10, tol_cavi = 1e-05, max_iter_cavi = 100, tol_evi = 1e-05, max_iter_evi = 50, n_threads = 1, max_time = 86400, custom_initializer = FALSE, use_checkpoint = FALSE, initial_values = NULL, checkpoint_values = NULL, verbose_internal = TRUE )
x |
A list of data frames, each representing a dataset. |
k |
The number of datasets to be linked. |
n |
The number of rows in each dataset (vector of length k). |
N |
The number of columns in each dataset. |
p |
The number of fields in each dataset. |
M |
The number of categories for each discrete field (vector of length of discrete fields). |
discrete_fields |
The indexes of the discrete fields (1-based index). |
continuous_fields |
The indexes of the continuous fields (1-based index). |
hyper_beta |
The hyperparameters for the beta distribution (matrix of size p x 2). |
hyper_phi |
The hyperparameters for softmax representation (vector of length of discrete fields). |
hyper_tau |
The temperature parameter (vector of length of discrete fields). |
hyper_epsilon_discrete |
The range parameter for the comprehensive hit of discrete fields (vector of length of discrete fields). |
hyper_epsilon_continuous |
The range parameter for the comprehensive hit of continuous fields (vector of length of continuous fields). |
hyper_sigma |
The hyperparameters for the Inverse Gamma distribution (matrix of size length of continuous fields x 2). |
overlap_prob |
The presumed probability of overlap across the datasets. |
n_parents |
The number of parents for a generation. |
n_children |
The number of children for the next generation. |
tol_cavi |
The tolerance for the coordinate ascent variational inference for the convergence. |
max_iter_cavi |
The maximum number of iterations for the coordinate ascent variational inference. |
tol_evi |
The tolerance for the evolutionary variational inference for the convergence. |
max_iter_evi |
The maximum number of iterations for the evolutionary variational inference. |
n_threads |
The number of threads for parallel computation. |
max_time |
The maximum time limit for the execution in seconds. |
custom_initializer |
Whether to use a custom initializer for the initial values. |
use_checkpoint |
Whether to use a checkpoint. |
initial_values |
The initial values for the parameters (optional). |
checkpoint_values |
The checkpoint values for the parameters (optional). |
verbose_internal |
Whether to print the internal C++ messages (TRUE: print, FALSE: not print). |
A list of the approximated parameters of the variational factors and other information containing:
nu: A list of parameter matrices for the approximate multinomial posterior of the linkage structure.
omega: A matrix of parameter vectors for the approximate beta posterior of the distortion ratio.
rho: A list of parameter matrices for the approximate Bernoulli posterior of the distortion indicators.
gamma: A list of parameter matrices for the approximate multinomial posterior of discrete true latent values.
alpha: A list of parameter vectors for the approximate Dirichlet posterior (theta) of the discrete true latent values.
eta_tilde: A matrix of parameter vectors for the mean of the approximate normal posterior of continuous true latent values.
eta_mean: A vector of mean parameters for the approximate normal posterior (eta_tilde) of the continuous true latent values.
eta_var: A vector of variance parameters for the approximate normal posterior (eta_tilde) of the continuous true latent values.
sigma_tilde: A matrix of parameter vectors for the variance of the approximate normal posterior of continuous true latent values.
sigma_shape: A vector of shape parameters for the approximate inverse gamma posterior (sigma_tilde) of the continuous true latent values.
sigma_scale: A vector of scale parameters for the approximate inverse gamma posterior (sigma_tilde) of the continuous true latent values.
ELBO: A vector of maximum ELBO at each generation.
niter: The number of generations EVIL created.
interruption: Whether the CHOMPER-EVIL is interrupted. The fitting is interrupted if the elapsed time reaches the maximum time limit.
maximum_elapsed_time: The maximum elapsed time of a single CAVI iteration throughout the entire EVIL process.
elapsed_time: The elapsed time of the entire EVIL process in seconds.
# 1. Generate sample data for testing sample_data <- generate_sample_data( n_entities = 10, n_files = 3, overlap_ratio = 0.7, discrete_columns = c(1, 2), discrete_levels = c(3, 3), continuous_columns = c(3, 4), continuous_params = matrix(c(0, 0, 1, 1), ncol = 2), distortion_ratio = c(0.1, 0.1, 0.1, 0.1) ) # 2. Get file information and remove `id` from the original data n <- numeric(3) x <- list() for (i in 1:3) { n[i] <- nrow(sample_data[[i]]) x[[i]] <- sample_data[[i]][, -1] } N <- sum(n) # 3. Set Hyperparameters hyper_beta <- matrix( rep(c(N * 0.1 * 0.01, N * 0.1), 4), ncol = 2, byrow = TRUE ) hyper_sigma <- matrix( rep(c(0.01, 0.01), 2), ncol = 2, byrow = TRUE ) # 4. Fit CHOMPER-EVIL result <- chomperEVIL( x = x, k = 3, # number of datasets n = n, # rows per dataset N = N, # columns per dataset p = 4, # fields per dataset M = c(3, 3), # categories for discrete fields discrete_fields = c(1, 2), continuous_fields = c(3, 4), hyper_beta = hyper_beta, # hyperparameter for distortion rate hyper_sigma = hyper_sigma, # hyperparameter for continuous fields hyper_phi = c(2.0, 2.0), hyper_tau = c(0.01, 0.01), hyper_epsilon_discrete = c(0, 0), hyper_epsilon_continuous = c(0.001, 0.001), n_threads = 1 )# 1. Generate sample data for testing sample_data <- generate_sample_data( n_entities = 10, n_files = 3, overlap_ratio = 0.7, discrete_columns = c(1, 2), discrete_levels = c(3, 3), continuous_columns = c(3, 4), continuous_params = matrix(c(0, 0, 1, 1), ncol = 2), distortion_ratio = c(0.1, 0.1, 0.1, 0.1) ) # 2. Get file information and remove `id` from the original data n <- numeric(3) x <- list() for (i in 1:3) { n[i] <- nrow(sample_data[[i]]) x[[i]] <- sample_data[[i]][, -1] } N <- sum(n) # 3. Set Hyperparameters hyper_beta <- matrix( rep(c(N * 0.1 * 0.01, N * 0.1), 4), ncol = 2, byrow = TRUE ) hyper_sigma <- matrix( rep(c(0.01, 0.01), 2), ncol = 2, byrow = TRUE ) # 4. Fit CHOMPER-EVIL result <- chomperEVIL( x = x, k = 3, # number of datasets n = n, # rows per dataset N = N, # columns per dataset p = 4, # fields per dataset M = c(3, 3), # categories for discrete fields discrete_fields = c(1, 2), continuous_fields = c(3, 4), hyper_beta = hyper_beta, # hyperparameter for distortion rate hyper_sigma = hyper_sigma, # hyperparameter for continuous fields hyper_phi = c(2.0, 2.0), hyper_tau = c(0.01, 0.01), hyper_epsilon_discrete = c(0, 0), hyper_epsilon_continuous = c(0.001, 0.001), n_threads = 1 )
Fit the CHOMPER model with Markov chain Monte Carlo (MCMC) with split and merge process to estimate the linkage structure across multiple datasets. It returns the posterior samples of the linkage structure and other parameters of the CHOMPER model.
chomperMCMC( x, k, n, N, p, M, discrete_fields, continuous_fields, hyper_beta, hyper_phi = c(), hyper_tau = c(), hyper_epsilon_discrete = c(), hyper_epsilon_continuous = c(), hyper_sigma = matrix(nrow = 0, ncol = 2), n_burnin = 1000, n_gibbs = 1000, n_split_merge = 10, max_time = 86400, custom_initializer = FALSE, use_checkpoint = FALSE, initial_values = NULL, checkpoint_values = NULL, verbose_internal = TRUE )chomperMCMC( x, k, n, N, p, M, discrete_fields, continuous_fields, hyper_beta, hyper_phi = c(), hyper_tau = c(), hyper_epsilon_discrete = c(), hyper_epsilon_continuous = c(), hyper_sigma = matrix(nrow = 0, ncol = 2), n_burnin = 1000, n_gibbs = 1000, n_split_merge = 10, max_time = 86400, custom_initializer = FALSE, use_checkpoint = FALSE, initial_values = NULL, checkpoint_values = NULL, verbose_internal = TRUE )
x |
A list of data frames, each representing a dataset. |
k |
The number of datasets to be linked. |
n |
The number of rows in each dataset (vector of length k). |
N |
The number of columns in each dataset. |
p |
The number of fields in each dataset. |
M |
The number of categories for each discrete field (vector of length of discrete fields). |
discrete_fields |
The indexes of the discrete fields (1-based index). |
continuous_fields |
The indexes of the continuous fields (1-based index). |
hyper_beta |
The hyperparameters for the beta distribution (matrix of size p x 2). |
hyper_phi |
The hyperparameters for softmax representation (vector of length of discrete fields). |
hyper_tau |
The temperature parameter (vector of length of discrete fields). |
hyper_epsilon_discrete |
The range parameter for the comprehensive hit of discrete fields (vector of length of discrete fields). |
hyper_epsilon_continuous |
The range parameter for the comprehensive hit of continuous fields (vector of length of continuous fields). |
hyper_sigma |
The hyperparameters for the Inverse Gamma distribution (matrix of size length of continuous fields x 2). |
n_burnin |
The number of burn-in iterations for the MCMC. |
n_gibbs |
The number of Gibbs sampling iterations for the MCMC. |
n_split_merge |
The number of split and merge iterations for the MCMC. |
max_time |
The maximum time limit for the execution in seconds. |
custom_initializer |
Whether to use a custom initializer for the initial values. |
use_checkpoint |
Whether to use a checkpoint. |
initial_values |
The initial values for the parameters (optional). |
checkpoint_values |
The checkpoint values for the parameters (optional). |
verbose_internal |
Whether to print the internal C++ messages (TRUE: print, FALSE: not print). |
A list containing the posterior samples.
A list of the posterior samples and other information containing:
lambda: A list of posterior samples (integer vectors) of the linkage structure.
z: A list of posterior samples (binary matrices) of the distortion indicators.
y: A list of posterior samples (matrices) of the true latent records.
beta: A list of posterior samples (numeric vectors) of the distortion ratio.
theta: A list of posterior samples (numeric vectors) of the probabilities of discrete true latent values.
eta: A list of posterior samples (numeric vectors) of the mean of continuous true latent values.
sigma: A list of posterior samples (numeric vectors) of the variance of continuous true latent values.
n_sample: Total number of posterior samples after burn-in.
n_shift: Total number of accepted split and merge results after burn-in.
elapsed_time: The elapsed time of the entire MCMC process in seconds.
interruption: Whether the CHOMPER-MCMC is interrupted. The fitting is interrupted if the elapsed time reaches the maximum time limit.
# 1. Generate sample data for testing sample_data <- generate_sample_data( n_entities = 10, n_files = 3, overlap_ratio = 0.7, discrete_columns = c(1, 2), discrete_levels = c(3, 3), continuous_columns = c(3, 4), continuous_params = matrix(c(0, 0, 1, 1), ncol = 2), distortion_ratio = c(0.1, 0.1, 0.1, 0.1) ) # 2. Get file information and remove `id` from the original data n <- numeric(3) x <- list() for (i in 1:3) { n[i] <- nrow(sample_data[[i]]) x[[i]] <- sample_data[[i]][, -1] } N <- sum(n) # 3. Set Hyperparameters hyper_beta <- matrix( rep(c(N * 0.1 * 0.01, N * 0.1), 4), ncol = 2, byrow = TRUE ) hyper_sigma <- matrix( rep(c(0.01, 0.01), 2), ncol = 2, byrow = TRUE ) # 4. Fit CHOMPER-MCMC result <- chomperMCMC( x = x, k = 3, # number of datasets n = n, # rows per dataset N = N, # columns per dataset p = 4, # fields per dataset M = c(3, 3), # categories for discrete fields discrete_fields = c(1, 2), continuous_fields = c(3, 4), hyper_beta = hyper_beta, # hyperparameter for distortion rate hyper_sigma = hyper_sigma, # hyperparameter for continuous fields hyper_phi = c(2.0, 2.0), hyper_tau = c(0.01, 0.01), hyper_epsilon_discrete = c(0, 0), hyper_epsilon_continuous = c(0.001, 0.001), n_burnin = 0, n_gibbs = 100, n_split_merge = 10 )# 1. Generate sample data for testing sample_data <- generate_sample_data( n_entities = 10, n_files = 3, overlap_ratio = 0.7, discrete_columns = c(1, 2), discrete_levels = c(3, 3), continuous_columns = c(3, 4), continuous_params = matrix(c(0, 0, 1, 1), ncol = 2), distortion_ratio = c(0.1, 0.1, 0.1, 0.1) ) # 2. Get file information and remove `id` from the original data n <- numeric(3) x <- list() for (i in 1:3) { n[i] <- nrow(sample_data[[i]]) x[[i]] <- sample_data[[i]][, -1] } N <- sum(n) # 3. Set Hyperparameters hyper_beta <- matrix( rep(c(N * 0.1 * 0.01, N * 0.1), 4), ncol = 2, byrow = TRUE ) hyper_sigma <- matrix( rep(c(0.01, 0.01), 2), ncol = 2, byrow = TRUE ) # 4. Fit CHOMPER-MCMC result <- chomperMCMC( x = x, k = 3, # number of datasets n = n, # rows per dataset N = N, # columns per dataset p = 4, # fields per dataset M = c(3, 3), # categories for discrete fields discrete_fields = c(1, 2), continuous_fields = c(3, 4), hyper_beta = hyper_beta, # hyperparameter for distortion rate hyper_sigma = hyper_sigma, # hyperparameter for continuous fields hyper_phi = c(2.0, 2.0), hyper_tau = c(0.01, 0.01), hyper_epsilon_discrete = c(0, 0), hyper_epsilon_continuous = c(0.001, 0.001), n_burnin = 0, n_gibbs = 100, n_split_merge = 10 )
This function converts a list of posterior samples of lambda into a matrix.
Before calculating the posterior similarity matrix using psm_mcmc, it is necessary to flatten the posterior samples into a matrix.
flatten_posterior_samples(samples, k, N)flatten_posterior_samples(samples, k, N)
samples |
a list of MCMC samples |
k |
number of files to be linked |
N |
total number of records |
an N by number of MCMC samples matrix
# 1. Create a list of posterior samples of linkage structure number_of_files <- 2 n_file1 <- 2 n_file2 <- 3 number_of_records <- n_file1 + n_file2 number_of_samples <- 10 lambda <- list() for (i in 1:number_of_samples) { lambda[[i]] <- list( sample(1:number_of_records, n_file1, TRUE), sample(1:number_of_records, n_file2, TRUE) ) } # 2. Converts a list of posterior samples of lambda into a matrix flatten_posterior_samples(lambda, number_of_files, number_of_records)# 1. Create a list of posterior samples of linkage structure number_of_files <- 2 n_file1 <- 2 n_file2 <- 3 number_of_records <- n_file1 + n_file2 number_of_samples <- 10 lambda <- list() for (i in 1:number_of_samples) { lambda[[i]] <- list( sample(1:number_of_records, n_file1, TRUE), sample(1:number_of_records, n_file2, TRUE) ) } # 2. Converts a list of posterior samples of lambda into a matrix flatten_posterior_samples(lambda, number_of_files, number_of_records)
Generate synthetic data for record linkage with given number of entities, files, and overlap ratio. Each variable can follow either a multinomial or a Gaussian distribution. User can specify the existence of multiple truths for each variable.
generate_sample_data( n_entities, n_files, overlap_ratio, discrete_columns, discrete_levels, continuous_columns, continuous_params, distortion_ratio, discrete_fuzziness = NULL, continuous_fuzziness = NULL )generate_sample_data( n_entities, n_files, overlap_ratio, discrete_columns, discrete_levels, continuous_columns, continuous_params, distortion_ratio, discrete_fuzziness = NULL, continuous_fuzziness = NULL )
n_entities |
The number of entities. |
n_files |
The number of files. |
overlap_ratio |
The ratio of overlapping entities across the files. |
discrete_columns |
The indices of the discrete columns (1-based index). |
discrete_levels |
The levels of the discrete columns (vector of length of discrete columns). |
continuous_columns |
The indices of the continuous columns (1-based index). |
continuous_params |
The parameters of the continuous columns (matrix of size length of continuous columns x 2). |
distortion_ratio |
The distortion ratio of the columns (vector of length of total columns). |
discrete_fuzziness |
The configuration of the multiple truths of the discrete columns (optional). |
continuous_fuzziness |
The configuration of the multiple truths of the continuous columns (optional). |
A list of matrices containing the noisy synthetic data. Each matrix represents a file.
# 1. Set number of entities, files, and overlap ratio n_entities <- 25 n_files <- 2 overlap_ratio <- 0.9 # 2. Set attributes information discrete_columns <- 1:4 discrete_levels <- rep(5, 4) continuous_columns <- 5:6 continuous_params <- matrix(c(0, 10, 10, 10), ncol = 2, byrow = TRUE # means and variances ) # 3. Set distortion ratio and fuzziness information distortion_ratio <- rep(0.01, 6) discrete_fuzziness <- matrix(c(4, 1), ncol = 2, byrow = TRUE ) continuous_fuzziness <- matrix(c(5, 0.5^2, 6, 0.5^2), ncol = 2, byrow = TRUE ) # 4. Generate synthetic data simulation_data <- generate_sample_data( n_entities, n_files, overlap_ratio, discrete_columns, discrete_levels, continuous_columns, continuous_params, distortion_ratio, discrete_fuzziness, continuous_fuzziness )# 1. Set number of entities, files, and overlap ratio n_entities <- 25 n_files <- 2 overlap_ratio <- 0.9 # 2. Set attributes information discrete_columns <- 1:4 discrete_levels <- rep(5, 4) continuous_columns <- 5:6 continuous_params <- matrix(c(0, 10, 10, 10), ncol = 2, byrow = TRUE # means and variances ) # 3. Set distortion ratio and fuzziness information distortion_ratio <- rep(0.01, 6) discrete_fuzziness <- matrix(c(4, 1), ncol = 2, byrow = TRUE ) continuous_fuzziness <- matrix(c(5, 0.5^2, 6, 0.5^2), ncol = 2, byrow = TRUE ) # 4. Generate synthetic data simulation_data <- generate_sample_data( n_entities, n_files, overlap_ratio, discrete_columns, discrete_levels, continuous_columns, continuous_params, distortion_ratio, discrete_fuzziness, continuous_fuzziness )
Two data files collected in 2020 and 2022 surveys. Totally 38,255 records from 28,000 participants exist across two survey. Categorical variables are recorded with integers, and continuous variables are standardized to have a mean of 0 and a variance of 1.
italyitaly
A list of matrices with 9 variables. 15,198 rows and 23,057 rows are contained in surveys in 2020 and 2022, respectively.
Unique ID of each entity
sex
year of birth
nationality (whether or not Italian)
administrative region of birth
education level
administrative region of current residence
net income
estimated price of residence
Based on the true linkage structure and the estimate, it will calculate several metrics including true positive, true negative, false positive, false negative, false positive rate, and false negative rate.
This package recommends using the output of links function of the blink package as an argument to performance function.
performance(estimation, truth, N, return_matrix = FALSE)performance(estimation, truth, N, return_matrix = FALSE)
estimation |
estimated linkage structure |
truth |
true linkage structure |
N |
total number of records |
return_matrix |
if true, it also returns the matrix of linkage structure |
a list with performance metrics. If return_matrix is true, it also returns the matrix of linkage structure used for the evaluation.
# 1. True linkage structure total_number_of_records <- 6 truth <- matrix( list(c(1), c(2, 4), c(3), c(2, 4), c(5, 6), c(5, 6)) ) # 2. Estimated linkage structure estimation <- matrix( list(c(1), c(2, 4), c(3), c(2, 4), c(5), c(6)) ) # 3. Calculate performance metrics performance(estimation, truth, total_number_of_records, FALSE)# 1. True linkage structure total_number_of_records <- 6 truth <- matrix( list(c(1), c(2, 4), c(3), c(2, 4), c(5, 6), c(5, 6)) ) # 2. Estimated linkage structure estimation <- matrix( list(c(1), c(2, 4), c(3), c(2, 4), c(5), c(6)) ) # 3. Calculate performance metrics performance(estimation, truth, total_number_of_records, FALSE)
This function returns a posterior similarity matrix based on the MCMC samples of lambda, obtained from chomperMCMC.
psm_mcmc(samples)psm_mcmc(samples)
samples |
a total number of records by number of MCMC samples matrix with MCMC samples |
a posterior similarity matrix of all possible pairs
# 1. Create a matrix with posterior samples of linkage structure n_file1 <- 2 n_file2 <- 3 number_of_records <- n_file1 + n_file2 number_of_samples <- 10 lambda_matrix <- matrix(nrow = number_of_samples, ncol = number_of_records) for (i in 1:number_of_samples) { lambda_matrix[i, ] <- sample(1:number_of_records, number_of_records, TRUE) } # 2. Calculate a posterior similarity matrix psm_mcmc(lambda_matrix)# 1. Create a matrix with posterior samples of linkage structure n_file1 <- 2 n_file2 <- 3 number_of_records <- n_file1 + n_file2 number_of_samples <- 10 lambda_matrix <- matrix(nrow = number_of_samples, ncol = number_of_records) for (i in 1:number_of_samples) { lambda_matrix[i, ] <- sample(1:number_of_records, number_of_records, TRUE) } # 2. Calculate a posterior similarity matrix psm_mcmc(lambda_matrix)
This function returns a posterior similarity matrix based on the parameters of the approximated variational factor, nu, obtained from either chomperEVIL or chomperCAVI.
psm_vi(probs_field)psm_vi(probs_field)
probs_field |
a list of matrices with posterior probabilities |
a posterior similarity matrix of all possible pairs
# 1. Create an approximate posterior distribution of linkage structure n_file1 <- 2 n_file2 <- 3 nu1 <- matrix(runif(n_file1^2 * n_file2), nrow = n_file1) for (i in 1:n_file1) { nu1[i, ] <- nu1[i, ] / sum(nu1[i, ]) } nu2 <- matrix(runif(n_file1 * n_file2^2), nrow = n_file2) for (i in 1:n_file2) { nu2[i, ] <- nu2[i, ] / sum(nu2[i, ]) } # 2. Convert into the appropriate type to run a function approximate_posterior <- list(nu1, nu2) # 3. Calculate a posterior similarity matrix psm_vi(approximate_posterior)# 1. Create an approximate posterior distribution of linkage structure n_file1 <- 2 n_file2 <- 3 nu1 <- matrix(runif(n_file1^2 * n_file2), nrow = n_file1) for (i in 1:n_file1) { nu1[i, ] <- nu1[i, ] / sum(nu1[i, ]) } nu2 <- matrix(runif(n_file1 * n_file2^2), nrow = n_file2) for (i in 1:n_file2) { nu2[i, ] <- nu2[i, ] / sum(nu2[i, ]) } # 2. Convert into the appropriate type to run a function approximate_posterior <- list(nu1, nu2) # 3. Calculate a posterior similarity matrix psm_vi(approximate_posterior)
A synthetic data to illustrate the functionality of chomper model.
A list contains 2 matrices to perform entity resolution, and records are generated with 750 unique entities.
Each file contains 8 variables including unique id and 646 and 629 records, respectively.
Categorical variables are recorded with integers, and continuous variables are standardized to have a mean of 0 and a variance of 1.
simulation.highsimulation.high
A list of matrices with 8 variables. 646 rows and 629 rows are contained in data, respectively.
Unique ID of each entity
categorical variable with a single truth of 8 levels
categorical variable with a single truth of 8 levels
categorical variable with a single truth of 8 levels
categorical variable with 1-adjacent multiple truths of 8 levels
categorical variable with 1-adjacent multiple truths of 8 levels
continuous variable with multiple truths
continuous variable with multiple truths
Generated by the package authors. Authors used generate_sample_data function in R/generate_sample_data.R.
A synthetic data to illustrate the functionality of chomper model.
A list contains 2 matrices to perform entity resolution, and records are generated with 750 unique entities.
Each file contains 8 variables including unique id and 478 and 497 records, respectively.
Categorical variables are recorded with integers, and continuous variables are standardized to have a mean of 0 and a variance of 1.
simulation.lowsimulation.low
A list of matrices with 8 variables. 478 rows and 497 rows are contained in data, respectively.
Unique ID of each entity
categorical variable with a single truth of 8 levels
categorical variable with a single truth of 8 levels
categorical variable with a single truth of 8 levels
categorical variable with 1-adjacent multiple truths of 8 levels
categorical variable with 1-adjacent multiple truths of 8 levels
continuous variable with multiple truths
continuous variable with multiple truths
Generated by the package authors. Authors used generate_sample_data function in R/generate_sample_data.R.
A synthetic data to illustrate the functionality of chomper model.
A list contains 2 matrices to perform entity resolution, and records are generated with 750 unique entities.
Each file contains 8 variables including unique id and 569 and 556 records, respectively.
Categorical variables are recorded with integers, and continuous variables are standardized to have a mean of 0 and a variance of 1.
simulation.mediumsimulation.medium
A list of matrices with 8 variables. 569 rows and 556 rows are contained in data, respectively.
Unique ID of each entity
categorical variable with a single truth of 8 levels
categorical variable with a single truth of 8 levels
categorical variable with a single truth of 8 levels
categorical variable with 1-adjacent multiple truths of 8 levels
categorical variable with 1-adjacent multiple truths of 8 levels
continuous variable with multiple truths
continuous variable with multiple truths
Generated by the package authors. Authors used generate_sample_data function in R/generate_sample_data.R.