Pre-processing, Models, and Post-processing in Video and AI Development: When One Plus One Equals One

Pre-processing, Models, and Post-processing in Video and AI Development

Pre-processing, Models, and Post-processing in Video and AI Development: When 1+1=1

When one plus one equals one

As we push forward in our efforts to extend the functionality of the Nx AI manager and establish a community dedicated to fostering a more open ecosystem to move AI models to the edge, we are routinely engaged in captivating discussions with AI application developers and manufacturers of novel AI-specific hardware.

In this blog post, I want to provide a few insights regarding the surrounding jargon (pre- and post-processing, models, acceleration, etc.) and some discussion of the various ways in which AI models can be accelerated. This excerpt offers a brief glimpse into our ongoing discourse, but if you're eager to participate in the conversation, make sure to drop by one of our upcoming events or join the movement by participating in the 2024 Nx Hackathon.


Vision and AI pipelines: The High Level

At a high level, we are often concerned with vision applications (i.e. cameras as the “input” sensor) that use Artificial Intelligence (i.e. machine-learned models) to provide valuable insights into complex business processes without the need for human visual observation. These applications usually implement a so-called “pipeline” consisting of the following parts (Note: Each part may consist of multiple similar subparts and the delineation between the parts is not always exact):


  1. The sensor(s). Camera(s) in our case, although we often integrate additional sensors (lidar, vibration, etc.).
  2. Pre-processing. An (encoded) data stream coming from the sensor (i.e. H264/5) needs to be decoded, resized, recolored, split up into frames, etc.
  3. The model. The AI (or “machine-learned”) part of the pipeline. This is where frames (often one-by-one), represented as a numerical tensor, are “converted” into the sought-after information. For example, a tensor containing the coordinates of the bounding box surrounding the object of interest.  
  4. Post-processing. Frame-by-frame numerical lists of bounding box coordinates are impressive, but often not highly beneficial to humans. We might, for example, be interested in the number of people entering a store. To do so, we track objects over multiple frames and increase a counter when the trajectory intersects a (virtual) line on the original image.
  5. Visualizing, dashboarding, and automating. Tracking the number of visitors on a camera-by-camera basis may not necessarily drive business growth. We may want to analyze visitor patterns across numerous stores to effectively optimize staffing levels (something that would affect the bottom line). The actual applications built on the pipeline are what create value.

The above overview often suffices in more commercially oriented conversations. However, when thinking more technically about optimizing pipelines it becomes necessary to delve into the intricate details and gain a clearer understanding of the specific components that make up the pipeline. Perhaps surprisingly, separating out the individual parts proves to be more of an art than a science. Let’s focus on the middle part – the model – and discuss some of the common quirks.


Models: 1+1 = 1.

As introduced in the canonical vision and AI pipeline, the model consumes, on a frame-by-frame basis, a numerical representation of the frame (for example a h x w x 3 tensor representing, for each pixel, the RGB values). The model is simply a function (or algorithm, or bit-of-code, or mapping) that takes, as input, a tensor and returns, as an output, another tensor. For example, a list of x1,y1,x2,y2 coordinates representing bounding boxes surrounding an object of interest that the “model” is trained to recognize. However, note that we can easily stretch definitions: Does a resize of an image happen “inside” the model? Or is this a pre-processing operation? And, if I would like to have a count of the number of objects, is the counting of the bounding boxes part of the model (which now outputs just a single integer), or is it “post-processing”?

It is interesting, especially when thinking more about the inner workings of a model, to represent it as a Directed Acyclic Graph (DAG), in which the nodes are operations (add, multiply, convolution, etc.) and the edges are tensors. This leads to canonical representations of Neural Networks as DAGs, where elements of both pre and post-processing can be represented inside such a DAG, blurring the distinction between them. In practice, we often treat resizing and recolor as pre-processing, tensor in (often with comparatively small resolution compared to the camera sources) tensor out (bounding boxes) as the “model”, and any line-crossing, aggregation, etc. as post-processing.


Merging and splitting models

Once you treat models as DAGs, it becomes clear that the functionality contained within a single model is rather arbitrary. Suppose we have:




We could just as well make it 2 models:





Or the other way around… Let’s do an applied case this time: Suppose we want to identify license plates on passing cars captured by a camera. We would often use an object detection model to recognize whether there is a car in the image. Next, we recognize the location of the plate, and next, we have a model that turns the numerical RGB values inside the plate region into the characters representing the license plate. Three models that are… 


Model 1(Image) -> Location of the car

Model 2(Image of car) -> Location of the license plate

Model 3(Image of plate) -> Characters on the plate.


or one, if we merge them:


Model 1(Image) -> Characters on the plate.


Great. With models, one plus one plus one could easily equal one.



Things get even more complicated when considering post-processing options. Let's go back to the example of counting the number of people who enter a store: If the “model” puts out the location of all the people in a single frame, we can, if the frames are captured sufficiently quickly, use a “post-processor” to create a trajectory of an object over multiple frames. That trajectory can subsequently be used to see if the (hypothetical) line was crossed and someone entered the store. This is a post-processor that we provide out-of-the-box in the Nx Toolkit. Abstractly, in this case, we have something like this:


Single frame -> model -> bounding box -> 

post-processor aggregate over frames -> line crossed {0,1}. 


However, if we stretch our definitions of pre-processing, model, and post-processing, we could just as well have the following:


Multiple frames -> model -> line crossed {0,1}


Upon pre-processing, we gather a series of frames, with the "model" accepting a sequence of frames as input (and the line) and outputting whether or not the line was crossed.


It's important to note that this blog post is not at all intended to solve the potential jargon issues highlighted earlier. Rather, our goal is to emphasize how crucial it is to clarify the meaning behind seemingly simple terms like "pre-processing" or "model" when seeking a deeper understanding of the processes at play.


Optimization and Acceleration

After laying the groundwork, let's get to the meatier part of this discussion: What if you're looking to optimize your pipeline? I.e., you want it to run faster? Often, on the software side of things, there are faster, lower-level, or smarter ways to implement the pre-processing, model, and post-processing steps. Edge optimization is a craft in its own right and, within our common pipelines, we have spent a lot of effort to maximize the speed of each operation.

However, in the last few years, specialized hardware designed to accelerate specific parts of pipelines has entered the market. GPUs, TPUs, NPUs, etc., we call them “XPUs”, are specialized pieces of hardware geared toward accelerating, either through parallelism or optimization for specific types of data operations, the computations performed within (various parts of) the pipeline. Let's discuss a few ways in which one can use these XPUs to optimize pipelines:


Pipeline Acceleration

Many parts of the pipeline can be accelerated to an XPU. Camera stream decoding, for example, is commonly done using “dedicated” hardware. When feasible and beneficial (though not always the case) we implement this. The same could go for post-processing if, for example, bounding boxes are superimposed on the camera stream and the result requires encoding before being passed to the advanced visualization or dashboard, leveraging the XPU could be beneficial. Ultimately, optimizing the model for acceleration is a key focus for a number of new accelerators entering the market. (and a process we are actively working to make easier for everyone involved by contributing to the OAX Standard). 

Nevertheless, full pipeline optimization and acceleration, in its most general form (i.e., where anything can be moved freely between pre- and post-processing, and models can be any submodel or collection of models) is a challenge. For the canonical vision and AI applications, however, Nx has you covered with the Nx Toolkit.


Model Acceleration

As many new XPUs focus on accelerating the AI model specifically, (glossing over the definition issues introduced in the previous section) it is interesting to highlight some of the patterns that XPU manufacturers use to make this possible. Here, again, the DAG perspective on the models is useful. Suppose we have a model consisting of three operations:


INPUT -> OP1 -> OP2 -> OP3 -> OUTPUT.


Now, most XPUs only accelerate a select few operations within a model, rather than all of them. So we end up having to do something like:


CPU: INPUT -> OP1 ->
XPU: OP2 ->



Some XPU manufacturers make this simple, from a user point of view, by providing a “runtime” (i.e., a process that can be used to evaluate the model) which manages the operations across the CPU and XPU. Some XPU manufacturers make it difficult, albeit potentially more performant, and only allow acceleration for supported operators. In this case, the developer must dissect the model and write all the logic to move from one component to another.

Navigating the variations outlined above proves to be quite a challenge and, currently, needs to be sorted out on a XPU-by-XPU basis. Moreover, consider the added complexity where certain XPUs thrive on concurrency, some on memory management, and others on the "model" processing only a single data type (such as Integers). Understand these nuances and you will understand some of the intricacies of crafting highly performant edge AI pipelines.

The Nx Toolkit, specifically the Nx AI Manager tool, streamlines this complexity for developers.

Developers have the flexibility to make changes at the level of the model, i.e. change a model that is trained to recognize people to one that recognizes bees. Regardless of the hardware you choose to deploy on, as long as it is supported, we will make sure that we implement the acceleration of the model - and the surrounding pipeline - as optimally as we can.



This blog post contained some, admittedly loose, insights and discussion points that we have run into developing edge AI vision applications at scale. Things are not easy, but that’s why we do them.

With that said, we believe that the entire edge AI and computer vision industry could greatly benefit from a more organized framework in terms of jargon and methods to accelerate (parts of) pipelines. So, we are assembling a group of like-minded enthusiasts, partners, and companies willing to contribute to an open, standardized, way of making XPUs accessible. Interested? Join us during the general session at the Embedded Vision Summit to learn more.


And stay tuned for the upcoming launch of Gen 6 later this year to be among the first to try the latest addition to the Nx Toolkit - the Nx AI Manager, designed to add edge AI functionality to video solutions built using the Nx Enterprise Video Platform. 

Share this post!

Tags: News, Blog

Picture of Prof. Dr. Maurits Kaptein Prof. Dr. Maurits Kaptein
Prof. Dr. Maurits Kaptein, a full professor of statistics at the Technical University of Eindhoven, the Netherlands, also serves as the Chief Data Scientist at Network Optix Inc., based in California. With over two decades of experience, he has pioneered the practical application of machine learning and AI across diverse sectors such as e-commerce, healthcare and computer vision. Notably, he played a pivotal role in popularizing the multi-armed bandit (MAB) approach for online content optimization, developing innovative and scalable solutions for web-based MAB problems. He co-founded Scailable BV, which was dedicated to hardware-agnostic edge AI deployment, and led the company until its acquisition by Network Optix in early 2024. He now spearheads the creation of an open ecosystem at Network Optix, aimed at democratizing vision-based edge AI solutions. Maurits has authored or co-authored four books and over 100 papers in prestigious academic journals.