Categorical Sum and Product types using struct tagged union in C

Implementation details

alex_ber
9 min readMar 4, 2024

Introduction

Let’s first of all talk about struct and tagged untion in C.

Arrays allow to define type of variables that can hold several data items of the same kind. Similarly structure (or struct)is another user defined data type available in C that allows to combine data items of different kinds.

Let’s see some example:

https://www.guru99.com/cpp-structures.html (slightly modified).

The easiest way to model struct was class with public data-members only. The main drawback of this approach is luck of the encapsulation.

In C aunion s a special data type available that allows to store different data types in the same memory location. You can define a union with many members, but only one member can contain a value at any given time. Unions provide an efficient way of using the same memory location for multiple-purpose.

For example:

https://medium.com/@almtechhub/c-c-self-referential-recursive-unions-22b334493eaa

In C a tagged union or discriminated union is union that have associated with them a piece of data that tracks which of the potential union properties is currently set.

Quote from Wikipedia:

In computer science, a tagged union, also called… discriminated union, disjoint union, sum type or coproduct, is a data structure used to hold a value that could take on several different, but fixed, types. Only one of the types can be in use at any one time, and a tag field explicitly indicates which one is in use. It can be thought of as a type that has several “cases”, each of which should be handled correctly when that type is manipulated. This is critical in defining recursive datatypes, in which some component of a value may have the same type as the value itself, for example in defining a type for representing trees, where it is necessary to distinguish multi-node subtrees and leafs. Like ordinary unions, tagged unions can save storage by overlapping storage areas for each type, since only one is in use at a time.

https://en.wikipedia.org/wiki/Tagged_union

For example:

In this example we have a simple union that holds a char and an int. This union is wrapped in a struct that has a char variable and an instance of the union. We use a single letter char to represent each type, either i or c. In main we initialize the tagged union and then loop. In each iteration we check which property is set, log the value, flip the tag and set the associated property. The output from running it is:

21
A
1
A
3

https://medium.com/@almtechhub/c-c-tagged-discriminated-union-ecd5907610bf

Another example (self-referential tagged union or recursive datatype):

https://medium.com/@almtechhub/c-c-tagged-discriminated-union-ecd5907610bf

Note: enum that is in implementation above is essentially int. It can be replace to anything we want, it can be enum class, char,int, char*, etc.

Also, this is not the only possible implementation of tagged union. For example, it can be alternatively implemented with class and char for tag. Another possible implementation is to use hasthtable — key in hasthtable will be tag and value will be struct.

The categorical productis, informally, of two objects A and B is the most common object in this category, for which there are projections on A and B. In many categories (sets, groups, graphs, programming types…) the product of the objects is their cartesian product. It is C struct or C++ std:tuple.

Let’s look on another example. All quotes in this subsections are taken from https://www.embedded.com/discriminated-unions/

Let’s try to implement following inheritance hierarchy in C using tagged union.

https://www.embedded.com/virtual-functions-in-c-2/

Suppose you have an application that employs two-dimensional geometric shapes, such as circles, rectangles, and triangles. At a minimum, each shape object contains some linear or angular distances sufficient to characterize the physical extent of the shape. For example, a circle has a radius, a rectangle has a height and a width, and a triangle has two sides and an angle. The shapes may have common attributes as well, such a position (planar coordinates), or outline and fill colors.

I want to re-iterate. We know all derived classes of shape in advance, the author of shape class knows every derived class that will be used in the application.

Possible C implementation for a shape is as following:

The value of the shape’s kind member indicates which union member is currently in use. The shape_kind type enumerates the possible values:

For example, setting a shape’s kind member to sk_rectangle indicates that it's now OK to access (write to or read from) the union's rectangle member and its height and width members.

Let’s continue:

You can write a small assortment of initialization functions to properly initialize each kind of shape. For example:

initializes a shape as a rectangle with a particular height and width. You can write similar initialization functions for circle and triangles.

How we can implement C function that will calculate area of any shape (all shapes here are Jordan measurable)? One possible implementation is as following:

Virtual functions provide an alternative to discriminated unions that are safe, efficient, and often more maintainable.

This way more powerful. The writer of the Shape can be unaware of all of it’s descenders and yet in every place where Shape reference appear it can by substituted to derived class and virtual functions machinery (typically provided by C++ compiler, but you can write you own in C if you wish) will ensure that correct method will be called in inheritance graph.

Quote from Wikipedia:

In a typical class hierarchy in object-oriented programming, each subclass can encapsulate data unique to that class. The metadata used to perform virtual method lookup identifies the subclass and so effectively acts as a tag identifying the particular data stored by the instance.

Nevertheless, a class hierarchy involves true subtype polymorphism; it can be extended by creating further subclasses of the same base type, which could not be handled correctly under a tag/dispatch model. Hence, it is usually not possible to do case analysis or dispatch on a subobject’s ‘tag’ as one would for tagged unions. Some languages such as Scala allow base classes to be “sealed”, and unify tagged unions with sealed base classes.

https://en.wikipedia.org/wiki/Tagged_union#Class_hierarchies_as_tagged_unions

Also,

One of the problems with discriminated unions is that it requires a lot of discipline to ensure that you don’t access a union member unless it’s the currently active member as indicated by the discriminator. In most cases, this means you should be scrupulous about wrapping the accesses to union members inside if- or switch-statements that test the discriminator.

Unfortunately, nothing in C prevents accidents such as:

This code will compile and then produce unpredictable, possibly erroneous, run-time results.

Another quote from Wikipedia:

The main disadvantage of tagged unions is that the tag occupies space. Since there are usually a small number of alternatives, the tag can often be squeezed into 2 or 3 bits wherever space can be found, but sometimes even these bits are not available. In this case, a helpful alternative may be …encoded tags, where the tag value is dynamically computed from the contents of the union field. Common examples of this are the use of reserved values, where, for example, a function returning a positive number may return -1 to indicate failure, and sentinel values, most often used in tagged pointers.

https://en.wikipedia.org/wiki/Tagged_union#Advantages_and_disadvantages

Let’s look on another example of tagged union, but this time in ML. For example, how we can create binary tree of integers?

This is a tagged union with two cases: one, the leaf, is used to terminate a path of the tree, and functions much like a null value would in imperative languages. The other branch holds a node, which contains an integer and a left and right subtree. Leaf and Node are the [type] constructors, which enable us to actually produce a particular tree, such as:

which corresponds to this tree:

Now we can easily write a typesafe function that, say, counts the number of nodes in the tree:

https://en.wikipedia.org/wiki/Tagged_union#Examples

Using tagged union and switch is better from performance point-of-view than polymorphism

Horrible performance

Enum as tagged Union

C struct can be viewed as categorical product type.

C Enum type can be viewed as degenerate case of categorical sum type— discriminated union of unit types. In more simple words:

Quote from Wikipedia:

An enumerated type [C enum] can be seen as a degenerate case: a tagged union of unit types. It corresponds to a set of nullary constructors and may be implemented as a simple tag variable, since it holds no additional data besides the value of the tag.

https://en.wikipedia.org/wiki/Tagged_union#Description

If we have language such as Haskell or ML, where tagged union is first-class citizen, we could implement enumusing it.

Let’s look on 2 extreme cases of such Cenum implementation first, one constant C enum and two constant C enum.

  1. C enum with one constant is actually stateless Singleton.

Something like:

In Haskell it can be expressed as

This expression is also equivalent to singleton type().

2. C enum with two constants

Something like:

In Haskell it can be expressed as

This is essentially primitive C bool.

Tagged union is rarely used in practice in C

Even storing char* or std::string in C union is problematic.

Here are some reasons why the use of a union as a sum of types in C is limited:

  • Lack of Type Safety: One of the biggest limitations of unions is the lack of type safety. Since all members of a union share the same memory space, accessing one member as another can lead to undefined behavior or data corruption. This lack of type safety makes it error-prone, especially in larger codebases where it can be challenging to track down such issues. For example, storing std::string directly in a union compromises type safety. Unions in C don't enforce type checking when accessing their members, which means you could inadvertently access the wrong member of the union as a std::string, leading to undefined behavior or data corruption.
  • No Built-in Discriminant: Unlike tagged unions in languages like Rust or Haskell, C unions do not have a built-in mechanism to determine which member is currently valid (also known as a discriminant). This means that you need to implement your own way to keep track of which member is being used, which can add complexity to the code and increase the chances of errors. For example, std::string has a non-trivial constructor and destructor, which are automatically invoked when objects are created and destroyed. Unions have restrictions on what types they can contain due to their memory-sharing nature. Since std::string has non-trivial special member functions, including a constructor, destructor, copy constructor, and assignment operator, using it directly in a union can lead to issues with object lifetime management. When you store a pointer in a union, for example, char*, you’re responsible for managing the memory it points to. If you dynamically allocate memory using malloc or a similar function, you need to ensure proper memory deallocation to avoid memory leaks.
  • Limited Expressiveness: Unions in C can only represent a fixed set of types defined at compile-time. This limits their expressiveness compared to languages with more advanced sum types, where you can define unions of arbitrary types, including user-defined types and recursive types.
  • Pointer Semantics: When working with unions in C, it’s common to use pointers to access members safely. However, managing pointers correctly adds overhead and increases the complexity of the code. It also introduces the potential for pointer-related bugs such as null pointer dereferences or memory leaks.
  • No Pattern Matching: Pattern matching is a powerful feature commonly found in languages with sum types, allowing for concise and expressive code when working with algebraic data types. In C, you don’t have built-in support for pattern matching, which makes working with unions less ergonomic and often leads to more verbose code.

Overall, while unions can be useful in certain low-level programming scenarios where strict memory layout control is necessary, their limitations make them less suitable for representing a sum of types compared to other languages with more advanced type systems.

Instead of directly storing an std::string in a union, consider alternative approaches such as using a pointer to std::string or using a std::variant if you're working in C++17 or later. These approaches allow for more flexible handling of multiple types while ensuring proper memory management and type safety. If you must use a union, consider using a C-style string (char*) or a fixed-size character array (char[]) to store string data within the union, along with a separate field to indicate the type of data stored.

While storing a char* in a union is possible, consider whether using other constructs such as structs or dynamic data structures like linked lists or arrays might be more appropriate for your specific use case, especially if you need to handle more complex data or multiple types simultaneously.

--

--