Categorical Sum and Product types using struct tagged union in C
Introduction
Let’s first of all talk about struct
and tagged untion
in C.
Arrays allow to define type of variables that can hold several data items of the same kind. Similarly structure (or struct)
is another user defined data type available in C that allows to combine data items of different kinds.
Let’s see some example:
https://www.guru99.com/cpp-structures.html (slightly modified).
The easiest way to model struct was class
with public data-members only. The main drawback of this approach is luck of the encapsulation.
In C aunion
s a special data type available that allows to store different data types in the same memory location. You can define a union with many members, but only one member can contain a value at any given time. Unions provide an efficient way of using the same memory location for multiple-purpose.
For example:
https://medium.com/@almtechhub/c-c-self-referential-recursive-unions-22b334493eaa
In C a tagged union
or discriminated union
is union that have associated with them a piece of data that tracks which of the potential union properties is currently set.
Quote from Wikipedia:
In computer science, a
tagged union
, also called…discriminated union
,disjoint union
,sum type
orcoproduct
, is a data structure used to hold a value that could take on several different, but fixed, types. Only one of the types can be in use at any one time, and a tag field explicitly indicates which one is in use. It can be thought of as a type that has several “cases”, each of which should be handled correctly when that type is manipulated. This is critical in defining recursive datatypes, in which some component of a value may have the same type as the value itself, for example in defining a type for representing trees, where it is necessary to distinguish multi-node subtrees and leafs. Like ordinary unions, tagged unions can save storage by overlapping storage areas for each type, since only one is in use at a time.
https://en.wikipedia.org/wiki/Tagged_union
For example:
In this example we have a simple union that holds a char and an int. This union is wrapped in a struct that has a char variable and an instance of the union. We use a single letter char to represent each type, either i or c. In main we initialize the tagged union and then loop. In each iteration we check which property is set, log the value, flip the tag and set the associated property. The output from running it is:
21
A
1
A
3
https://medium.com/@almtechhub/c-c-tagged-discriminated-union-ecd5907610bf
Another example (self-referential tagged union
or recursive datatype
):
https://medium.com/@almtechhub/c-c-tagged-discriminated-union-ecd5907610bf
Note: enum
that is in implementation above is essentially int
. It can be replace to anything we want, it can be enum class
, char
,int
, char*,
etc.
Also, this is not the only possible implementation of tagged union
. For example, it can be alternatively implemented with class
and char
for tag. Another possible implementation is to use hasthtable
— key in hasthtable
will be tag
and value will be struct
.
The categorical product
is, informally, of two objects A and B is the most common object in this category, for which there are projections on A and B. In many categories (sets, groups, graphs, programming types…) the product of the objects is their cartesian product. It is C struct
or C++ std:tuple.
Let’s look on another example. All quotes in this subsections are taken from https://www.embedded.com/discriminated-unions/
Let’s try to implement following inheritance hierarchy in C using tagged union.
https://www.embedded.com/virtual-functions-in-c-2/
Suppose you have an application that employs two-dimensional geometric shapes, such as circles, rectangles, and triangles. At a minimum, each shape object contains some linear or angular distances sufficient to characterize the physical extent of the shape. For example, a circle has a radius, a rectangle has a height and a width, and a triangle has two sides and an angle. The shapes may have common attributes as well, such a position (planar coordinates), or outline and fill colors.
I want to re-iterate. We know all derived classes of shape in advance, the author of shape
class knows every derived class that will be used in the application.
Possible C implementation for a shape is as following:
The value of the shape’s
kind
member indicates which union member is currently in use. Theshape_kind
type enumerates the possible values:
For example, setting a shape’s
kind
member tosk_rectangle
indicates that it's now OK to access (write to or read from) the union'srectangle
member and itsheight
andwidth
members.
Let’s continue:
You can write a small assortment of initialization functions to properly initialize each kind of shape. For example:
initializes a shape as a rectangle with a particular height and width. You can write similar initialization functions for circle and triangles.
How we can implement C function that will calculate area of any shape (all shapes here are Jordan measurable)? One possible implementation is as following:
Virtual functions
provide an alternative todiscriminated unions
that are safe, efficient, and often more maintainable.
This way more powerful. The writer of the Shape
can be unaware of all of it’s descenders and yet in every place where Shape
reference appear it can by substituted to derived class and virtual functions
machinery (typically provided by C++ compiler, but you can write you own in C if you wish) will ensure that correct method will be called in inheritance graph.
Quote from Wikipedia:
In a typical class hierarchy in object-oriented programming, each subclass can encapsulate data unique to that class. The metadata used to perform virtual method lookup identifies the subclass and so effectively acts as a tag identifying the particular data stored by the instance.
Nevertheless, a class hierarchy involves true subtype polymorphism; it can be extended by creating further subclasses of the same base type, which could not be handled correctly under a tag/dispatch model. Hence, it is usually not possible to do case analysis or dispatch on a subobject’s ‘tag’ as one would for tagged unions. Some languages such as Scala allow base classes to be
“sealed”
, and unify tagged unions with sealed base classes.
https://en.wikipedia.org/wiki/Tagged_union#Class_hierarchies_as_tagged_unions
Also,
One of the problems with discriminated unions is that it requires a lot of discipline to ensure that you don’t access a union member unless it’s the currently active member as indicated by the discriminator. In most cases, this means you should be scrupulous about wrapping the accesses to union members inside if- or switch-statements that test the discriminator.
Unfortunately, nothing in C prevents accidents such as:
This code will compile and then produce unpredictable, possibly erroneous, run-time results.
Another quote from Wikipedia:
The main disadvantage of tagged unions is that the tag occupies space. Since there are usually a small number of alternatives, the tag can often be squeezed into 2 or 3 bits wherever space can be found, but sometimes even these bits are not available. In this case, a helpful alternative may be …encoded tags, where the tag value is dynamically computed from the contents of the union field. Common examples of this are the use of
reserved values
, where, for example, a function returning a positive number may return -1 to indicate failure, andsentinel values
, most often used in tagged pointers.
https://en.wikipedia.org/wiki/Tagged_union#Advantages_and_disadvantages
Let’s look on another example of tagged union
, but this time in ML. For example, how we can create binary tree of integers?
This is a
tagged union
with two cases: one, the leaf, is used to terminate a path of the tree, and functions much like a null value would in imperative languages. The other branch holds a node, which contains an integer and a left and right subtree. Leaf and Node are the [type] constructors, which enable us to actually produce a particular tree, such as:
which corresponds to this tree:
Now we can easily write a typesafe function that, say, counts the number of nodes in the tree:
Using tagged union
and switch is better from performance point-of-view than polymorphism
Enum as tagged Union
C struct
can be viewed as categorical product
type.
C Enum
type can be viewed as degenerate case of categorical sum
type— discriminated union
of unit
types. In more simple words:
Quote from Wikipedia:
An
enumerated
type [Cenum
] can be seen as a degenerate case: atagged union
of unit types. It corresponds to a set of nullary constructors and may be implemented as asimple tag variable
, since it holds no additional data besides the value of the tag.
https://en.wikipedia.org/wiki/Tagged_union#Description
If we have language such as Haskell or ML, where tagged union
is first-class citizen, we could implement enum
using it.
Let’s look on 2 extreme cases of such Cenum
implementation first, one constant C enum
and two constant C enum.
- C
enum
withone constant
is actuallystateless Singleton
.
Something like:
In Haskell it can be expressed as
This expression is also equivalent to singleton type()
.
2. C enum
with two constants
Something like:
In Haskell it can be expressed as
This is essentially primitive C bool
.
Tagged union is rarely used in practice in C
Even storing char*
or std::string
in C union is problematic.
Here are some reasons why the use of a union
as a sum of types in C is limited:
- Lack of Type Safety: One of the biggest limitations of unions is the lack of type safety. Since all members of a union share the same memory space, accessing one member as another can lead to undefined behavior or data corruption. This lack of type safety makes it error-prone, especially in larger codebases where it can be challenging to track down such issues. For example, storing
std::string
directly in a union compromises type safety. Unions in C don't enforce type checking when accessing their members, which means you could inadvertently access the wrong member of the union as astd::string
, leading to undefined behavior or data corruption. - No Built-in Discriminant: Unlike tagged unions in languages like Rust or Haskell, C unions do not have a built-in mechanism to determine which member is currently valid (also known as a discriminant). This means that you need to implement your own way to keep track of which member is being used, which can add complexity to the code and increase the chances of errors. For example,
std::string
has a non-trivial constructor and destructor, which are automatically invoked when objects are created and destroyed. Unions have restrictions on what types they can contain due to their memory-sharing nature. Sincestd::string
has non-trivial special member functions, including a constructor, destructor, copy constructor, and assignment operator, using it directly in a union can lead to issues with object lifetime management. When you store a pointer in a union, for example,char*
, you’re responsible for managing the memory it points to. If you dynamically allocate memory usingmalloc
or a similar function, you need to ensure proper memory deallocation to avoid memory leaks. - Limited Expressiveness: Unions in C can only represent a fixed set of types defined at compile-time. This limits their expressiveness compared to languages with more advanced sum types, where you can define unions of arbitrary types, including user-defined types and recursive types.
- Pointer Semantics: When working with unions in C, it’s common to use pointers to access members safely. However, managing pointers correctly adds overhead and increases the complexity of the code. It also introduces the potential for pointer-related bugs such as null pointer dereferences or memory leaks.
- No Pattern Matching: Pattern matching is a powerful feature commonly found in languages with sum types, allowing for concise and expressive code when working with algebraic data types. In C, you don’t have built-in support for pattern matching, which makes working with unions less ergonomic and often leads to more verbose code.
Overall, while unions can be useful in certain low-level programming scenarios where strict memory layout control is necessary, their limitations make them less suitable for representing a sum of types compared to other languages with more advanced type systems.
Instead of directly storing an std::string
in a union, consider alternative approaches such as using a pointer to std::string
or using a std::variant
if you're working in C++17 or later. These approaches allow for more flexible handling of multiple types while ensuring proper memory management and type safety. If you must use a union, consider using a C-style string (char*
) or a fixed-size character array (char[]
) to store string data within the union, along with a separate field to indicate the type of data stored.
While storing a char*
in a union is possible, consider whether using other constructs such as structs or dynamic data structures like linked lists or arrays might be more appropriate for your specific use case, especially if you need to handle more complex data or multiple types simultaneously.