Research

My work at GTI-UPM, Universidad Politécnica de Madrid.

Research Lines

Egocentric Video Understanding
active

First-person video perception and action recognition from wearable cameras. Focused on understanding human activity from the camera wearer's perspective.
VLM Optimization for Edge Devices
active

Quantization and efficient inference of vision-language models on constrained hardware — Raspberry Pi, embedded systems, and other resource-limited devices.
Embodied Intelligence
active

Connecting perception models to physical world interaction and action understanding — bridging the gap between what a model sees and what it can do.

Publications

CV4Animals Workshop, CVPR 2024 · 2024

AnimalMotionCLIP: Embedding Motion in CLIP for Animal Behavior Analysis

Enmin Zhong, Carlos R. Del-Blanco, Daniel Berjón, Fernando Jaureguizar, Narciso García

We extend CLIP for animal behavior recognition by interleaving video frames with optical flow, adding motion awareness to a model designed for static images. Multiple temporal aggregation strategies (dense, semi-dense, sparse) are compared, achieving state-of-the-art results on the Animal Kingdom dataset.

PDF Poster Project
MDPI Sensors · 2023

Real-time monocular skeleton-based hand gesture recognition using 3D-Jointsformer

Enmin Zhong, Carlos R. Del-Blanco, Daniel Berjón, Fernando Jaureguizar, Narciso García

Automatic hand gesture recognition in video sequences has widespread applications, ranging from home automation to sign language interpretation and clinical operations. A hybrid approach combining 3D Convolutional Neural Networks (3D-CNNs) and Transformers is proposed: a 3D-CNN computes high-level semantic skeleton embeddings capturing local spatial and temporal characteristics, while a Transformer with self-attention efficiently captures long-range temporal dependencies. Evaluation on the Briareo and Multimodal Hand Gesture datasets achieved accuracy scores of 95.49% and 97.25% respectively, with real-time performance on a standard CPU.

PDF Code