Custum ML model for BPMN-similarity using spektral and keras.

Overview

This project aims to create embeddings for BPMN files to facilitate tasks like search, similarity, and clustering. The embeddings capture both the semantics and structure of BPMN files, allowing for effective retrieval and comparison of similar BPMN diagrams.

Important Note

The current model uses embeddings created by the paraphrase-multilingual-MiniLM-L12-v2 with an embedding dimension of 384. Using a different sentence-transformer will result in unexpected behavior. Ensure to use the correct sentence-transformer with the kerasEmbedder and adjust the 'dims' parameter in mu-search accordingly.

Motivation

The goal is to:

Capture the semantics of BPMN files, making similar BPMN files have similar embeddings.
Capture the structure of BPMN files, making structurally similar BPMN files have similar embeddings.
Enable measuring the similarity between two BPMN diagrams by estimating the number of changes needed to transform one into another (trained on Minimum Edit Distance).

Design Choices:

Preprocessing BPMN Files: Adjust the input size to fit the fixed input size of embedding models or handle large inputs by splitting them into smaller parts.
Encoding Structure: Use graph embedding techniques (e.g., GNNs, GCNs) to encode the structure of BPMN diagrams.
Graph Representation: Convert BPMN diagrams into graph representations using NetworkX.
Node and Edge Information: Extract labels and documentation fields from nodes and edges, converting them into numerical vectors using pre-trained embeddings or custom-trained embeddings.

Current Approach:

Convert BPMN Files to Graphs: Use NetworkX to represent BPMN files as graphs.
Node and Edge Embeddings: Use pre-trained embeddings to create vector representations of nodes and edges.
Graph Embedding: Use these embeddings as features for a GNN or GCN model (e.g., Spektral) to create a single embedding for each BPMN file.

Similarity Model Specifics:

Input: two BPMN files: (Batchsize, nodes, node_features) and Adjacency matrix Similarity Calculation: Uses precomputed embeddings and cosine similarity to rank BPMN files based on query similarity.
Efficiency: Designed to handle large volumes of BPMN files and queries efficiently.

Shortcomings:

Data Availability: Lack of sufficient BPMN files within ABB, we trained on the data of camunda/bpmn-for-research by converting the BPMN files to Networkx graphs and measuring the Minimum Edit Distance between random pairs.
Minimum Edit Distance: Does this translate to perceived similarity? It might require a different approach. Training data in the form of (BPMN, BPMN, similarity) is needed. If user interactions can be captured, this might be a better approach for gauging perceived similarity or general 'useful' similarity.

Future Steps:

Gather more BPMN files from the correct domain (OPH).
Train custom text embeddings for nodes and edges (e.g., using robbert-2023-dutch-base-abb).
Validate and refine the current model with new data.
Potentially merge the graph and text models into a unified architecture.

Suggestions:

Data Collection: Store search results and user interactions anonymously to gather diverse query data.
User Interaction Analysis: Use interaction data to train models for better search result ranking.

Requirements for future steps:

A large and varied dataset of BPMN files to ensure the model generalizes well.
Real user interactions (recommendation interface based on BPMN-BPMN: user uploads file -> what suggestion did he interact with) to validate and improve the model's effectiveness.