# Theory

## Background

`visions`

primary goal is to develop a robust mechanism for defining and using
*semantic* data types. These represent conceptual or logical notions of meaning
and is distinct from the machine representation of data on disk. For example,
we might say sequence \(S\) is of semantic type `probability`

when
\(S \in [0, 1]\) or `date`

when each element is stored on the computer as a
`datetime.datetime`

object with all time components identically `0`

.

Most data analytics libraries users are already familiar with have their own data types builtin, usually tightly coupled with the machine representation of the data used by the library. We wanted to give users as much flexibility to design and work in the representations specific to their problems and domain and in order to do that we had to give users the freedom to make their own types. This left a few key challenges to solve.

## Open Challenges

Finding a minimal representation of semantic types

Detecting the semantic type of a sequence

`visions`

is interested in the semantic meaning of data and therefore should be able to infer the “intended” type of a sequence regardless of it’s machine representation (e.g. the string`'1.0'`

should be recognized as the number`1`

).

We want to do all of this while keeping types easy to use, performant, and deterministic.

Since users are free to imagine any possible type, different problem domains might require contradictory notions of the same type. Where a data scientist might be interested in probability as a sequence of values bounded such that \(x \in [0, 1]\), a business analyst might instead be interested in a definition where \(x \in [0, 100]\).

## The `visions`

Solution

We solve all of these problems by introducing three conceptual ideas, `visions`

types, typesets, and relations.

A *type* at minimum requires only a single validation function which takes as its
argument a sequence and tests whether the input is of its type or not, returning
boolean. It optionally can contain relationships which we will describe in a moment.

A *typeset* is a collection of types. Behind the scenes visions uses the relationships
defined on each type in the typeset to construct a relationship graph. When properly
constructed this graph can be used to deterministically detect the current semantic types
of a sequence (or dataset) or to infer a more representative type for the data.

A *relation* object is responsible for mapping sequences between `visions`

types.
Each relation is composed of two functions, the first validates whether a
mapping can be performed without loss of precision (i.e. ‘1.0’ can be cast to
integer while 1.1 cannot), the second is a surjective function responsible for
actually performing the mapping.

In practice, we distinguish `relations`

into two categories as well, the first
called `Identity Relations`

require no transformation to the underlying machine
type of the data (float(1.1) -> probability(1.1) where the second, `Inference Relations`

,
have to coerce the sequence between machine types (‘1.1’ -> 1.1).

## Why it works

We will be using the language of trees and sets to understand how this all comes together and start by defining a semantic type as the set of all sequences with some consistent semantic meaning. A typeset is then a directed rooted tree whose nodes are types with the root defined as the generic type associated with the universal set.

Relations are directed edges between two nodes (types) in a relation graph (typeset).
They are also defined on types such that a relation between types `A`

and `B`

,
`Relation(A -> B)`

, would be defined as an attribute of B. In order words, they
are mappings *to* a type, not *from*.

Following this, we can construct a relation graph from *any* collection of provided
types and associated relations. We define our dual objective of type detection and
type inference as the task of determining the most unique possible type
specification available to the typeset either with coercion of machine types (inference),
or without (detection).

Both tasks are akin to simple traversal of the relation graph. In order to guarantee
all sequences map to only a single type we require the graph be *decidable*.
This is equivalent to saying sets of any two pairs of data types with the same
parent must be disjoint, except for the missing value indicator. Additionally,
relations are not permitted to introduce cycles into the tree.