Real-time Object Detection: Lessons from Building a Low-Latency Pipeline

Practical experiences in building a low-latency real-time Object Detection pipeline, from handling initial blocking issues to a stable streaming architecture.

Recently, I embarked on a problem that seemed like the "Hello World" of the Computer Vision industry - a problem that appears simple but is extremely difficult when put into a real-world environment: Real-time Object Detection.

My hardware setup was quite basic:

Sensor: Consumer Tapo Camera (RTSP Stream, max 15 FPS, 720p resolution).
Compute: A high-spec PC (NVIDIA GPU) for dev and an NVIDIA Jetson Nano for deployment.
Goal: Detect objects and stream the results with the lowest possible latency.

I thought simply: "Just cv2.VideoCapture(), throw the frame into YOLO, then cv2.imshow() and it's done." But reality quickly shattered my illusions: the latency reached 3 seconds. In the world of automation, 3 seconds is enough to turn a system from "usable" to completely unfeasible in applications requiring fast feedback.

This article is a technical log of how I tore down and rebuilt this pipeline to achieve the lowest possible latency.

1. Naive approach and common misconceptions about real-time#

First, I approached the problem in the simplest way (Naive approach):

import cv2
cap = cv2.VideoCapture("rtsp://admin:pass@ip:554/stream", cv2.CAP_FFMPEG)
 
while True:
    ret, frame = cap.read() # Blocking I/O
    # ... AI processing ...
    cv2.imshow("Camera", frame) # Render UI
    if cv2.waitKey(1) == ord('q'): break

The result was that the displayed image lagged about 3 seconds behind reality. The cause lies in the Blocking I/O mechanism and the Internal Buffer:

Speed Discrepancy (Producer-Consumer Gap): The camera pushes 15 FPS (Producer). But if the while loop only runs at 10 FPS (Consumer) due to the rendering burden of imshow or the AI, where do the 5 excess frames per second go?
Buffer: They get stuffed into the internal buffer of the OpenCV/FFmpeg backend. This buffer keeps filling up.
Consequence: When calling cap.read(), I don't get the frame that just arrived; I get the frame standing first in the queue (the oldest frame).

Throughput vs. Latency

To optimize the system, it is necessary to clearly distinguish between these two concepts:

Throughput (FPS): How many images the system processes in 1 second. (Important for Server/Batch processing).
Latency: How long it takes for a real-world event to appear on the processing screen. (Important for Robot/Safety/Interactive systems).

You can have 60 FPS (smooth video) but 3 seconds of latency (video from the past). This is a state of High Throughput, High Latency.

The Problem with OpenCV Blocking I/O#

OpenCV's default pipeline operates on a Blocking mechanism:

cap.read(): The program stops, waiting to decode the next frame in the queue.
model.predict(): The program stops, waiting for the AI to finish running.
cv2.imshow(): The program stops, waiting to render the UI.

If the camera sends 30 FPS, but the total processing time of the loop only reaches 10 FPS, the remaining 20 frames will be stuffed into OpenCV's Internal Buffer. This buffer keeps filling up, and the frame you see on the screen is actually a frame sent from... 3 seconds ago.

FPS Standards in Video Processing Systems:

< 10 FPS: Suitable for passive surveillance applications where objects move slowly and immediate response is not required, such as smart parking or behavior analysis in retail.
15–24 FPS: This is the minimum threshold for the human eye to perceive motion as continuous. Below this level, images will appear choppy or laggy — commonly found in cheap CCTV systems.
30–60 FPS: This FPS range is considered the gold standard for applications requiring direct user interaction (UI/UX) or basic driver assistance systems.
> 60 FPS: At this level, the system not only serves human vision but also meets the reaction requirements of machines. Just a few milliseconds of delay can lead to accidents or product defects — typical in autonomous vehicles and industrial inspection.

Currently, a real-time object detection system considered 'good' in a commercial environment usually needs to achieve a stable 25–30 FPS on mid-to-high-end edge devices and on end-user interfaces (mobile/web). The 10–15 FPS level is only accepted in the ultra-cheap segment or extremely time-insensitive applications.

2. Solutions and Lessons Learned#

Multithreading#

The first idea was to separate Image Reading and Image Processing into two separate threads.

Thread A (Reader): Always reads the latest frame from the camera and pushes it into a fixed-size Queue (size=1). If the Queue is full, discard the old frame and overwrite with the new one.
Thread B (Processor): Takes the frame from the Queue to process.

Result: Latency decreased significantly because we actively discard old frames (Frame Dropping). However, the code became complex (Race conditions, Thread safety). Furthermore, the buffer could still fill up at the driver/FFmpeg layer before reaching our Python thread.

FFmpeg CLI vs. GStreamer#

I decided to bypass the default OpenCV wrapper to intervene deeper into the protocol.

Testing FFmpeg (Subprocess):

Used subprocess to call ffmpeg, forced UDP usage, nobuffer flag.
Result: Extremely low latency (almost realtime).
Problem: FFmpeg's philosophy is optimized for Throughput (process as much as possible $\rightarrow$ large buffer). Forcing it to run Low Latency requires configuring many parameters, and piping raw data into Python is quite manual.

cmd = [
    "ffmpeg",
    "-rtsp_transport", "udp",           # Use UDP to reduce delay
    "-fflags", "nobuffer",              # Disable internal buffer
    "-flags", "low_delay",              # Enable low latency mode
    "-reorder_queue_size", "0",         # Do not reorder packets
    "-use_wallclock_as_timestamps", "1",
    "-i", RTSP_URL,
    "-f", "rawvideo",                   # Output raw
    "-pix_fmt", "bgr24",                # Pixel format for OpenCV
    "-"
]

Switching to GStreamer (Industry Standard):

GStreamer operates as a modular Pipeline, allowing deep intervention into each link (Element) from source $\rightarrow$ demux $\rightarrow$ parse $\rightarrow$ decode $\rightarrow$ sink. Notably, it supports Zero-copy (limiting data copying between CPU/GPU) better than FFmpeg.
Major camera manufacturers (like Hikvision) or NVIDIA (DeepStream) both build SDKs on pipeline platforms similar to GStreamer to leverage Hardware Decoding.
GStreamer is the optimal choice for Low Latency, but the learning curve is quite steep.

3. Dependency hell when combining GStreamer and OpenCV#

This is the part that took me the most time. Combining GStreamer + Python + OpenCV is a nightmare for beginners.

Conda: I tried installing GStreamer inside Conda. The result was a disaster. Library packages conflicted constantly.
OpenCV-Python Issue: The pip install opencv-python package on PyPI by default DOES NOT support the GStreamer backend. If you use this package, the code will run but cannot open the pipeline.
Ultralytics (YOLO): When installing ultralytics, it automatically overwrites with opencv-python (pre-built version). This broke the GStreamer environment I had set up. ¹

Solution:

Instead of using Conda or Docker (Docker build image debug takes a long time), the fastest and most stable solution on Linux is to use venv combined with system packages:

# 1. Install GStreamer and OpenCV system-wide (Ubuntu)
sudo apt-get install python3-opencv libgstreamer1.0-dev ...
 
# 2. Create venv with --system-site-packages flag to inherit system libraries
python3 -m venv myenv --system-site-packages
source myenv/bin/activate
 
# 3. Note: If installing ultralytics, check carefully if it overwrites opencv

4. Optimal Pipeline Architecture (Stream-to-Stream)#

To achieve the highest performance, I changed the architecture:

Input: GStreamer Pipeline (Input RTSP $\rightarrow$ Appsink).
Process: OpenCV + YOLO.
Output: Instead of imshow (Block main thread), I stream the results to a local RTSP Server.

I used MediaMTX as the RTSP Server (running via Docker). ¹

Speed Limit Analysis of Each Component (FPS Hierarchy)#

Component	Actual Speed / Limit	Notes
Camera (Sensor)	15 FPS	Physical limit of the Tapo camera (720p, RTSP stream)
Inference (YOLO11n)	≈550–700 FPS (on RTX A4000, 640×640)	Latency per frame ≈ 1.4–1.8 ms (measured by Ultralytics benchmark)
Encoding + Network	> 200 FPS	x264enc preset ultrafast + zerolatency easily handles 60 FPS
Display / Screen	60 FPS (60 Hz)	Common refresh rate limit of screens and human eyes

Conclusions from the table above:

The slowest component (bottleneck) is the 15 FPS camera $\rightarrow$ the entire system cannot run faster than 15 FPS regardless of how fast the rest is.
Inference only takes ≈1.5 ms/frame while each new frame only arrives every ≈66 ms $\rightarrow$ AI is idle more than 98% of the time.
Therefore, in this case, complex multithreading/multiprocessing is not needed, yet real-time and extremely low latency (< 100 ms end-to-end) are still achieved.
When upgrading to a 60 FPS camera or using a heavier model (YOLO11m/x, RVT, etc.), only then is it necessary to separate processes or use queue + worker to avoid frame drops.

Important Note: The final system FPS is always equal to the lowest FPS in the pipeline. Here, it is the sensor's 15 FPS.

The Final Code#

Here is the code combining the OpenCV Wrapper for GStreamer pipeline, eliminating latency:

import cv2
from ultralytics import YOLO
 
def main():
    # Local RTSP Server
    rtsp_url = "rtsp://admin:pass@192.168.1.50:554/stream"
    output_url = "rtsp://localhost:8554/stream"
 
    model = YOLO("checkpoints/yolo11n.pt") 
 
    # --- INPUT PIPELINE ---
    # latency=0: Try to reduce network buffer delay
    # appsink sync=false drop=true max-buffers=1: 
    # -> This is the key! Keep only the very latest frame, discard all old frames.
    input_pipeline = (
        f"rtspsrc location={rtsp_url} latency=0 ! queue ! "
        f"rtph264depay ! h264parse ! avdec_h264 ! "
        f"videoconvert ! appsink sync=false drop=true max-buffers=1"
    )
 
    cap = cv2.VideoCapture(input_pipeline, cv2.CAP_GSTREAMER)
    
    if not cap.isOpened():
        print("Error: Cannot open input pipeline")
        return
 
    # Get stream parameters
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = cap.get(cv2.CAP_PROP_FPS) or 25
 
    # --- OUTPUT PIPELINE ---
    # appsrc -> x264enc -> rtspclientsink
    # tune=zerolatency: Optimize encoder for real-time
    # speed-preset=ultrafast: Sacrifice compression for speed
    output_pipeline = (
        "appsrc is-live=true ! "
        "videoconvert ! video/x-raw,format=I420 ! "
        "x264enc tune=zerolatency bitrate=2000 speed-preset=ultrafast key-int-max=30 ! "
        f"rtspclientsink location={output_url}"
    )
 
    writer = cv2.VideoWriter(
        output_pipeline,
        cv2.CAP_GSTREAMER,
        0, fps, (width, height), True
    )
 
    if not writer.isOpened():
        print("Error: Cannot open output pipeline")
        return
 
    try:
        while True:
            ret, frame = cap.read()
            if not ret: break
 
            # Inference YOLO
            results = model(frame, verbose=False)
            annotated_frame = results[0].plot()
 
            # Stream output (Do not render GUI)
            writer.write(annotated_frame)
            
    except KeyboardInterrupt:
        print("Stopping...")
    finally:
        cap.release()
        writer.release()
 
if __name__ == "__main__":
    main()

The result is very promising: latency dropped to below 100 ms and there is almost no significant overhead left.

A Few Risks to Note#

The pipeline above works very well in a LAN environment because the network is clean, jitter is low, and there is almost no packet loss. However, if deployed later to weak WiFi, multi-hop networks, or the Internet, some issues will appear that need consideration:

latency=0: not all cameras can handle this. GStreamer turns off the entire jitter buffer $\rightarrow$ extremely low latency, but when the network is unstable, it will easily cause stuttering or frame loss.
max-buffers=1 drop=true: helps with extremely low latency, but in return, critical events might be missed if inference occasionally slows down.
UDP is fast but prone to packet loss. It's fine in LAN, but going out to WAN/WiFi with high interference $\rightarrow$ the output will stutter or skip frames.
ultrafast + zerolatency of x264 creates very high bitrate, demanding a strong network. In weak Internet environments, it can cause congestion and increase latency.

5. Conclusion#

Building a real-time Object Detection pipeline with low latency is not as simple as initially imagined. Through this process, I learned many valuable lessons about:

Understanding buffering mechanisms and latency in video streaming.
Using GStreamer to control the pipeline in detail.
Setting up a suitable development environment to avoid library conflicts.
Designing a system architecture suitable for hardware limitations and practical requirements.

Hope these shares will be helpful for those of you struggling with building real-time video processing systems. Good luck!

views

— views

Nguyen Xuan Hoa

nguyenxuanhoakhtn@gmail.com