Structured Version Control via Combinatorial Data Structures

A sketch of a plan, to be presented at the 2023 ARIA workshop

Owen Lynch

Topos Institute

December 7, 2023

Overview

  1. What are combinatorial data structures?
  2. Current approach: acsets. Why are acsets not enough?
  3. Type theory for combinatorial data structures
  4. Version control

Inspirations

  • Davidad’s chit (especially references in README)
  • Problems encountered in the practice of AlgebraicJulia and Semagrams

Combinatorial data structures

Definition 1 A combinatorial data structure is a data structure where parts of the data structure refer to other parts of the data structure, possibly cyclically.

Example: Wikis

Example: SQL

Example: Doubly Linked List

Example: Abstract binding tree

b

Example: Nested wiring diagram

Why is it important to formalize this?

Combinatorial data structures…

  • are everywhere, especially data structures for scientific models
  • are not adequately expressed by ADTs, i.e. sum+product types
  • are tricky to serialize correctly
  • come with a natural notion of “symmetries”; relabeling the parts shouldn’t change the interpretation
  • come with a natural notion of incremental changes: adding/removing parts

Existing formalisms (an incomplete list):

Current approach in AlgebraicJulia: acsets

From an implementation standpoint, acsets are in-memory databases.

Category theoretically, acsets are “functors with some extra stuff” from finitely presented categories to \mathsf{Set}

Advantage: higher operations like limits/colimits/rewriting/subobject classifier implemented generically.

Where does our current approach fall short?

  • No top-level attributes (e.g., name of the scientific model)

  • Higher-level structures like morphisms between acsets,

    or diagrams of such morphisms,

    are acsets, but have a natural hierarchical structure not expressed by a flat acset schema.

  • Recursive structures not supported

  • Things like “a vector of edge ids or strings” not expressible

Solution: type theory

Add two things to your favorite type theory that extends System F:

  1. A kind of finite types
  2. Existential quantification over finite types, e.g.
Graph = finexists (Vertex, Edge). {
    src : Edge -> Vertex,
    tgt : Edge -> Vertex
}

Connection to module systems/1ML

In (Rossberg 2018), module systems are embedded via existential types in System Fω.

The trick for 1ML: make a distinction between “large” and “small” types

  • small type: products and sums of elementary types
  • large type: the type of modules of a given signature

Then make sure large types only exist at compile time.

The trick for us: make a distinction between “small” and “finite” types, moreover for function types:

  • small → small = large
  • finite → small = small

Then all small types are countable/serializable, and the type of modules which only have finite associated types is small instead of large!

Why should this work?

  1. Existentially quantifying over types is part of System F, and implemented in existing non-dependently typed languages,
    (forall a. (t a -> r)) ≅ ((exists a. t a) -> r)
  2. (With work) it can be composed with other features in “the lambda cube”, like recursive data structures or subtyping
  3. Membership checking for finite sets can be done dynamically, so a type definition can turn into a dynamically checked data structure in an existing language

Useful extensions

Path dependent types

record GraphHom {
    dom: Graph
    codom: Graph

    apply(e: dom.Edge): codom.Edge
    apply(v: dom.Vertex): -> codom.Vertex
}

Refinement types

record GraphCospan {
    a: Graph
    b: Graph
    x: Graph
    f: GraphHom {
        dom = a
        codom = x
    }
    g: GraphHom {
        dom = b
        codom = x
    }
}

Version control

Definition 2 A version control system is a category, where we interpret the objects as states and the morphisms as patches, along with some form of merge operation.

Found, with variations in: (Mimram and Di Giusto 2013), (Edwards and Petricek 2021)

Let \mathsf{VCS} be some 2-category of version control systems (depending on the laws we choose for merge).

Formal theory should encompass both “git-like” and “live-editing” systems

Trivial version control

Example 1 For any set D, consider the codiscrete category on D + \bot, with 3-way merge

We call this the trivial VCS system on D.

Version control for finite sets

Example 2 Let \mathsf{UUIDPatch} be the category where

  • objects are D \subset A \subset_0 \mathsf{UUID}. (A = all, D = dead)
  • morphisms are

3-way merge is pushout, which is just union.

Structured Version Control

Big idea: build denotational semantics for our type theory in \mathsf{VCS}.

Interpret a type as a version control system, i.e. an object in the 2-category \mathsf{VCS}, given by structural induction on the type definition.

  • Interpret primitives like Int and String as trivial version control systems
  • Interpret type formers like product and via categorical operations on \mathsf{VCS}
  • Interpret a type with free fintype variables X_1,\ldots,X_n as a functor \mathsf{UUIDPatch}^n \to \mathsf{VCS}.
  • Existential types are given by Grothendieck construction.

What are morphisms?

  • Functors between version control systems can be seen as incremental functions
    • incremental computing
    • functional reactive programming
  • Delta lenses are another choice
    • Useful for editing views or substructure
  • Porque no los dos: double category?

Version controlling schemas too?

  • Serialize types themselves as ADT of typedef
  • Give denotational semantics as the category of types and migrations (what is a migration?)
  • Existential quantification over general types = version-control data along with its schema, again via Grothendieck construction

Work for ARIA

  • Work out the category theory/type theory for a basic variant.
  • Implement this like protobuffers. From a custom type-definition language, code-generate types in existing PLs (Rust, Python, Julia, Scala, Typescript, etc.) along with bindings to libraries for serialization, networking, and version control written in Rust.
  • Think about other type theory features like recursive types, subtyping.
  • Think about diffs for the type definitions, and how to migrate data along them.
  • Extend to not just finite types, but finitely-presented types, i.e. generators and relations in a Lawvere theory. Then “finfunctions” are serialized as their definition on generators.

Summary

  • Combinatorial data structures should be handled with type theory via finite types+existential types
  • That type theory should be given denotational semantics in version control systems
  • There is a lot of research to do, but we are on solid ground with existential types

Bibliography

Amin, Nada, Karl Samuel Grütter, Martin Odersky, Tiark Rompf, and Sandro Stucki, eds. 2016. “The Essence of Dependent Object Types.” A List of Successes That Can Change the World: Essays Dedicated to Philip Wadler on the Occasion of His 60th Birthday, Lecture Notes in Computer Science, no. 9600. https://doi.org/10.1007/978-3-319-30936-1_14.
Contributors, Haskell Wiki. 2023. “Existential Type - HaskellWiki.” https://wiki.haskell.org/Existential_type.
Edwards, Jonathan, and Tomas Petricek. 2021. “Typed Image-based Programming with Structure Editing.” arXiv. https://arxiv.org/abs/2110.08993.
Fiore, M., G. Plotkin, and D. Turi. 1999. “Abstract Syntax and Variable Binding.” In Proceedings. 14th Symposium on Logic in Computer Science (Cat. No. PR00158), 193–202. https://doi.org/10.1109/LICS.1999.782615.
Huey, Jack. 2022. “Generic Associated Types to Be Stable in Rust 1.65.” Rust Blog. https://blog.rust-lang.org/2022/10/28/gats-stabilization.html.
Kock, Joachim. 2016. “Notes on Polynomial Functors.”
Leroy, Xavier. 2000. “A Modular Module System.” Journal of Functional Programming 10 (3): 269–303. https://doi.org/10.1017/S0956796800003683.
Mimram, Samuel, and Cinzia Di Giusto. 2013. “A Categorical Theory of Patches.” Electronic Notes in Theoretical Computer Science 298 (November): 283–307. https://doi.org/10.1016/j.entcs.2013.09.018.
Perry, Nigel. 1991. “The Implementation of Practical Functional Programming Languages.” PhD thesis, University of London.
Rossberg, Andreas. 2018. 1ML Core and Modules United.” Journal of Functional Programming 28 (January): e22. https://doi.org/10.1017/S0956796818000205.
Spivak, David I. 2012. “Functorial Data Migration.” Information and Computation 217 (August): 31–51. https://doi.org/10.1016/j.ic.2012.05.001.
Wilf, Herbert S. 2005. Generatingfunctionology. 3rd edition. Wellesley, Mass: AK Peters/CRC.