Real-life data for object-centric processing mining
View the project on GitHub LienBosmans/pystackt
Watch the demo on Youtube PyStack't Demo BPM 2025
PyStack’t has a modular design, enabled by the use of a fixed relational schema to store OCED data. Since the tables and columns are always the same, independent of object and event types, it’s easy to plug in new data sources and other functionality. Any new functionality is automatically compatible with existing code, which is a big win for maintainability.
PyStack’t uses the Stack’t relational schema to store object-centric event data. This schema was created specifically to support the data preparation stage, taking into account data engineering best practices. While any relational database can be used to store data in the Stack’t relational schema, PyStack’t uses DuckDB because it’s open-source, fast and simple to use. (Think SQLite but for analytical workloads.)
The Stack’t relational schema describes how to store object-centric event data in a relational database using a fixed set of tables and table columns. This absence of any schema changes makes the format well-suited to act as a central data hub, enabling the modular design of PyStack’t.
An overview of the tables and columns is included below. For more information on the design choices and the proof-of-concept implementation Stack’t, we recommend reading the paper Dynamic and Scalable Data Preparation for Object-Centric Process Mining.
Event-related tables. To maintain flexibility and support dynamic changes, event types and their attribute definitions are stored in rows rather than being defined by table and column names. This approach enables the use of the exact same tables across all processes, reducing the impact of schema modifications. Changing an event type involves updating foreign keys rather than moving data to different tables, and attributes can be added or removed without altering the schema.
event_types
contains an entry for each unique event type. id
is the primary key.description
should be human-readable.event_attributes
stores entries for each unique event attribute. id
is the primary key.event_type_id
is a foreign key referencing table event_types
.description
should be human-readable.datatype
of the attribute (integer, varchar, timestamp, …) of the attribute.events
records details for each event. id
is the primary key.event_type_id
is a foreign key referencing table event_types
.timestamp
, preferably using UTC time zone.description
should be human-readable.event_attribute_values
stores all attribute values for different events. This setup decouples events and their attributes by storing each attribute value in a new row, facilitating support for late-arriving data points. id
is the primary key.event_id
is a foreign key referencing table events
.event_attribute_id
is a foreign key referencing table event_attributes
.attribute_value
is the value of the attribute. This value should match the datatype of the attribute.Object-related tables also leverage row-based storage to manage attributes independently. This approach reduces the number of duplicate or NULL values significantly when attributes are updated asynchronously and frequently.
object_types
records entries for each unique object type.id
is the primary key.description
should be human-readableobject_attributes
contains entries for each unique object attribute. id
is the primary key.object_type_id
is a foreign key referencing table object_types
.description
should be human-readable.datatype
(integer, varchar, timestamp, …) of the attribute.object
stores details for each object.id
is the primary key.object_type_id
is a foreign key referencing table object_types
.description
should be human-readable.object_attribute_values
records attribute values for objects.id
is the primary key.object_id
is a foreign key referencing table objects
.object_attribute_id
is a foreign key referencing table object_attributes
.timestamp
indicates when the attribute was updated. Timestamps are preferably stored using the UTC time zone.attribute_value
is the updated value of the attribute. This value should match the datatype of the attribute.Relation-related tables serve as bridging tables to manage the different many-to-many relations between events and objects. The qualifier definitions are stored separately to minimize the impact of renaming them in case of changing business requirements
relation_qualifiers
stores qualifier definitions. In cases where relation qualifiers are not available in the source data, a dummy qualifier can be introduced.id
is the primary key.description
should be human-readable.datatype
(integer, varchar, timestamp, …) of the attribute.object_to_object
stores (dynamic) relations between objects.id
is the primary key.source_object_id
is a foreign key referencing table objects
.target_object_id
is a foreign key referencing table objects
.timestamp
indicates when the relationship became active. To signify the end of an object-to-object relationship, a NULL value is used for the qualifier value, rather than an end timestamp. This design choice facilitates append-only data ingestion. Timestamps are preferably stored using the UTC time zone.qualifier_id
is a foreign key referencing table qualifiers
.qualifier_value
provides additional relationship details. This value should match the datatype of the qualifier.event_to_object
stores relations between events and objects.id
is the primary key.event_id
is a foreign key referencing table events
.object_id
is a foreign key referencing table objects
.qualifier_id
is a foreign key referencing table qualifiers
.qualifier_value
provides additional relationship details. This value should match the datatype of the qualifier.event_to_object_attribute_value
stores relations between events and changes to object attributes.id
is the primary key.event_id
is a foreign key referencing table events
.object_attribute_value_id
is a foreign key referencing table object_attribute_values
.qualifier_id
is a foreign key referencing table qualifiers
.qualifier_value
provides additional relationship details. This value should match the datatype of the qualifier.Extracting data from different systems is an important part of data preparation. While PyStack’t does not include all functionality that a data stack offers (incremental ingests, scheduling refreshes, monitoring data pipelines…), it aims to provide simple-to-use methods to get real-life data for your object-centric process mining adventures.
The Stack’t relational schema is intended as an intermediate storage hub. PyStack’t provides export functionality to export the data to specific OCED formats that can be used by process mining applications and algorithms. This decoupled set-up has as main advantage that any future data source can be exported to all supported data formats, and any future OCED format can be combined with existing data extraction functionality.
Dispersing process data across multiple tables makes exploring object-centric event data less straightforward compared to traditional process mining. PyStack’t aims to bridge this gap by providing dedicated data exploration functionality. Notably, it includes an interactive data exploration app that runs locally and works out-of-the-box with any OCED data structured in the Stack’t relational schema.