# Data Quality Process

**Data Glitches Examples:**

- missing data
- inconsistent data
- anomalies and outliers
- duplicates
- undocumented data

**Factors that complicate detection of glitches:**

- relevance:
- severity varies depending on the domain
- focus on the most relevant things

- ambiguity
- boundary between good and bad is unclear
- depending on the company

- complex dependencies
- one glitch can mask another.

- dynamic:
- not only new glitches appear, but also type of glitches changes.

**Classification of glitch complexes**

- multi-type glitch
- a value has multiple glitch types associated with it.

- concomitant glitches
- two or more values have glitches and appear in the same record.

- multi occurrent glitches
- same glitch type occurs in different records

## Detecting Glitches

- Missing data (values, records, attributes)
- values: easy, records & attr: hard

- Inconsistent erroneous data (values, references)
- (soft) constraint based
- typical errors: spelling

- Anomalies and outliers (values, records)
- statistical methods, ML-based methods

- semantic duplicate records
- matching

- undocumented data

### Missing data Mechanisms

Categories:

**MCAR**: missing completely at random- no correlation between missingness and any values, observed or missing.
- Random subset of the data.

**MAR**: missing at random- there is correlation between missing values and the observed data, but not the missing data.
- example: only men respond to weight question on questionnaire. (weight is only filled in for men)

**MNAR**: missing not at random- example: not everyone responds questionnaire, specific type of people.
- relationship between missing rows and values (of the missing rows).

Deletion of rows requires MCAR; otherwise bias.

if you delete rows and missingness is not at random the model will be biased.

Data imputation methods assume not MNAR ⇒ imputation is replacing missing data with substituted values.

### Quantifying the number of glitches

counting the amount of glitches

- Per value
- glitch signature: vector with glitch type indication bool
- determine weights per glitch
- glitch score of a value: weighted sum

- Global
- Sum of all glitch scores

## Constraints

🚧 **Constraint**

Any rule that should hold in your data.

Data is a simplified abstraction of the real world, so any rule which holds in the real world could and should hold for your data.

- ssn are unique for people

Also called integrity constraints or semantic constraints.

A soft constraint is a rule that almost always holds, but there can be exceptions.

- name of employee is unique within org
- person enrolled in course implies person is student.

### Functional dependencies

Notation:

attributes X determine one value for attributes Y.

Example:

`f(11) = “mark”`

, f(11) cannot be any other value.

Formal Notation: Students(snum, name)

**Referential Integrity Constraints:**

Notation:

*“every enrollment is done by a student”*

Formal Notation in predicate logic

**Consistency**

Database D is consistent iff it satisfies all constraints.

To quantify:

- count the number of violations of each constraint
- Sum over all constraints (possibly weighted)

### Conditional Dependencies

⏩ **Pattern Tableau**

a row template that if satisfied means that a certain functional dependency should apply to the row.

⚙ **Conditional Functional Dependency**

Defined by the pair

Example:

T1:

Country | Zip | Street | City |
---|---|---|---|

UK |

➕ Conditional Inclusion Dependencies

Defined by the pair

There are algorithms that can automatically discover CFD’s and CIND’s from a set of tables.

- There are also some algorithms that can discover them even if they do not hold perfectly in your data, but are valid up to a certain threshold of violations.
- Assuming we trust these discovered rules as true constraints to flag possible violating records and values.

## Finding Duplicates

Components of entity resolution

- entity reference extraction
- entity reference preparation
- entity reference resolution (comparaison of pairs:
*same real world entity or not?)* - entity identity information management
- entity analytics

### Entity Reference Preparation

- Data profiling
- Parsing
- Standardization
- Encoding
- Conversion (datatypes)
- Enhancement: adding info

### Data Profiling

Often used: pattern frequency analysis

replace digits with 9, capitals with A and lowercase with a.

then count occurrences of all the patterns.

⇒ find values of often occuring types: phone numbers, email, phone, credit cards.

### Entity Reference Resolution

Exact matching vs. probabilistic or fuzzy matching

Fuzzy matching with similarity functions:

- numerical
- syntactic
- semantic
- phonetic
- hybrid

### Jaccard Coefficient

Fraction of sharing between P and Q.

Example:

P={“J”,”o”,”h”,”n”,”D”,”o”,”e”} → {”e”,”h”,”n”,”o”,”D”,”J”}

Q={“S”,”e”,”a”,”n”,”D”,”o”,”e”} → {“a”,”e”,”n”,”o”,”D”,”S”}

Jacquard(P,Q) = 4/8 = 0.5

Images:

0 | 1 |
---|---|

1 | 0 |

1 | 1 |

jaccard = tp / tp + fp + fn

jaccard =

1 | 1 |
---|---|

0 | 0 |

1 | 1 |

Insensitive to word order.

### Levenshtein Distance

L(A,B): number of manipulations to turn A into B

Example: JIM→JAMES needs 3 manipulations

Update I→A, insert E, insert S

So, Normalized Levenshtein = 1-3/5 = 0.4

### Jaro-Winkler Distance

W1+W2+W3 = 1

C: number of common characters

T: number of transpositions. (LE → EL)

Example:

Jaro Winkler is a modification:

- N: number of first four characters that agree position

Example: 0.897+0.1*4(1.0-0.897) = 0.938

**Other distance metrics:**

- Smith-Waterman distance
- longest common subexpression

- Q-gram distance aka n-gram
- create substrings of length Q
- “John Doe” → [“Joh”, “ohn”,”hn “,”n D”,” Do”,”Doe”]
- Then use Jaccard or Jaro Winkler.

## Converting scores into probabilities

Common is to normalize:

- turns score into value between 0 and 1.
- But better is to use a probability.

better is to use `sim2p-mapping`

## Efficiency of entity mapping

`O(n^2)`

complexity because entity matching needs to compare every record with every record.

- improvement: sorted neighbourhood method
*(SNB)* - avoid matching those that are not likely to match
- sort on a sorting key, match only in a sliding window.

Result of matching set of records S is a **partitioning**

- Partition of S is set of non empty subsets
- Union of subsets is S
- Intersection of each pair of subsets is empty
- Subsets are often called clusters

## Measuring and analyzing matching results

Measuring

- T is a partition obtained by correct linking (sample known)
- X is a partition obtained by our matching algorithm
- V is a partition formed by all non-empty intersections between true clusters(T) and process clusters (X)

**Talburt-Wang Index**

- E: set of pairs of records in T
- L: set of pairs in X
- ~E: pairs in S not in E,
- ~L: pairs in S not in L

False Positve Rate

False Negative Rate

### Entity Information Management

We strive for

- Each RW entity has exactly one occurrence in the data
- Distinct RW entities have distinct occurrences

We have a cluster of records (duplicates)

- Survivor record EIS: keep the most complete one
- Attribute based EIS: surrogate record created with a most representative value for each attribute.
- Record based EIS: keep all possible records.

## Anomaly detection

Define or obtain a model of normal behavior

- measure likelihood or record/value to be generated by this model
- outlier are rare or inconsistent with the rest.

Categorization approaches

supervised

semi supervised

unsupervised

z-scores

k-nearest neighbour K-NN

- based on a distance measure, and a class measure.
- the larger the distance to the knn the lower the local density the more likely the point is an outlier.

autoencoders

- neural network that learns to reconstruct it’s input.
- using a middle layer or bottleneck, it has to compress and can only represent the most salient features.

Train a model of a normal record in the table.

Any value that is not reconstructed with similar value to the output is suspicious.