Enmin Zhong

Enmin Zhong

Computer Vision Researcher

GTI · Universidad Politécnica de Madrid · Madrid, Spain

Researcher by day, builder by curiosity.

Egocentric videoEmbodied intelligenceVLM for edge devices3D printing & NFC
Currently Building

EgoAgent

Building a wearable egocentric AI agent on Raspberry Pi 5 with Hailo AI HAT 2 — from hardware assembly to real-time first-person vision-language interaction.

Latest update

Assembly: Raspberry Pi 5 + AI HAT+ 2

Follow the build

Publications

  • 📄

    What's next?

    Work in progress — stay tuned.

  • CV4Animals Workshop, CVPR 2024 · 2024

    AnimalMotionCLIP: Embedding Motion in CLIP for Animal Behavior Analysis

    Enmin Zhong, Carlos R. Del-Blanco, Daniel Berjón, Fernando Jaureguizar, Narciso García

    We extend CLIP for animal behavior recognition by interleaving video frames with optical flow, adding motion awareness to a model designed for static images. Multiple temporal aggregation strategies (dense, semi-dense, sparse) are compared, achieving state-of-the-art results on the Animal Kingdom dataset.

  • MDPI Sensors · 2023

    Real-time monocular skeleton-based hand gesture recognition using 3D-Jointsformer

    Enmin Zhong, Carlos R. Del-Blanco, Daniel Berjón, Fernando Jaureguizar, Narciso García

    Automatic hand gesture recognition in video sequences has widespread applications, ranging from home automation to sign language interpretation and clinical operations. A hybrid approach combining 3D Convolutional Neural Networks (3D-CNNs) and Transformers is proposed: a 3D-CNN computes high-level semantic skeleton embeddings capturing local spatial and temporal characteristics, while a Transformer with self-attention efficiently captures long-range temporal dependencies. Evaluation on the Briareo and Multimodal Hand Gesture datasets achieved accuracy scores of 95.49% and 97.25% respectively, with real-time performance on a standard CPU.

Featured Projects

Side projects at the intersection of AI and the physical world.