Hirokatsu KATAOKA, Ph.D.
National Institute of Advanced Industrial Science and Technology (AIST), Japan

[Top] [Research] [Curriculum Vitae] [Publications]

Human Action Recognition without Human

The objective of this paper is to evaluate "human action recognition without human". Motion representation is frequently discussed in human action recognition. We have examined several sophisticated options, such as dense trajectories (DT) and the two-stream convolutional neural network (CNN). However, some features from the background could be too strong, as shown in some recent studies on human action recognition. Therefore, we considered whether a background sequence alone can classify human actions in current large-scale action datasets (e.g., UCF101). In this paper, we propose a novel concept for human action analysis that is named "human action recognition without human". An experiment clearly shows the effect of a background sequence for understanding an action label. The paper was accepeted to ECCV 2016 BNMW.

Motion Representation with Acceleration Images

Information of time differentiation is extremely important cue for a motion representation. We have applied first-order differential velocity from a positional information, moreover we believe that second-order differential acceleration is also a significant feature in a motion representation. However, an acceleration image based on a typical optical flow includes motion noises. We have not employed the acceleration image because the noises are too strong to catch an effective motion feature in an image sequence. On one hand, the recent convolutional neural networks (CNN) are robust against input noises. In this paper, we employ acceleration-stream in addition to the spatial- and temporal-stream based on the two-stream CNN. We clearly show the effectiveness of adding the acceleration stream to the two-stream CNN. The paper was accepeted to ECCV 2016 BNMW.

Transitional Action Recognition

We address transitional actions class as a class between actions. Transitional actions should be useful for producing short-term action predictions while an action is transitive. However, transitional action recognition is difficult because actions and transitional actions partially overlap each other. To deal with this issue, we propose a subtle motion descriptor (SMD) that identifies the sensitive differences between actions and transitional actions. The two primary contributions in this paper are as follows: (i) defining transitional actions for short-term action predictions that permit earlier predictions than early action recognition, (see Figure below) and (ii) utilizing convolutional neural network (CNN) based SMD to present a clear distinction between actions and transitional actions. The paper was accepeted to BMVC 2016.

Dominant Codewords Selection

In this paper, we propose a framework for recognizing human activities that uses only in-topic dominant codewords and a mixture of intertopic vectors. Latent Dirichlet allocation (LDA) is used to develop approximations of human motion primitives; these are mid-level representations, and they adaptively integrate dominant vectors when classifying human activities. In LDA topic modeling, action videos (documents) are represented by a bag-of-words (input from a dictionary), and these are based on improved dense trajectories [Wang+, ICCV2013]. The output topics correspond to human motion primitives, such as finger moving or subtle leg motion. We eliminate the impurities, such as missed tracking or changing light conditions, in each motion primitive. The assembled vector of motion primitives is an improved representation of the action. The paper is on the CVPR2016_LAP.

Semantic Change Detection

This research proposes the concept of semantic change detection, which involves intuitively inserting semantic meaning into detected change areas. The problem to be solved consists of two parts, semantic segmentation and change detection. In order to solve this problem and obtain a high-level of performance, we propose an improvement to the hypercolumns representation, hereafter known as hypermaps, which effectively uses convolutional maps obtained from convolutional neural networks (CNNs). We applied our method to the TSUNAMI Panoramic Change Detection dataset, and re-annotated the changed areas of the dataset via semantic classes. The results show that our multi-scale hypermaps provided outstanding performance on the re-annotated TSUNAMI dataset. The paper is on the arXiv preprint.

CNN Feature Evaluation

In this paper, we evaluate convolutional neural network (CNN) features using the AlexNet architecture developed by [9] and very deep convolutional network (VGGNet) architecture developed by [16]. To date, most CNN researchers have employed the last layers before output, which were extracted from the fully connected feature layers. However, since it is unlikely that feature representation effectiveness is dependent on the problem, this study evaluates additional convolutional layers that are adjacent to fully connected layers, in addition to executing simple tuning for feature concatenation (e.g., layer 3 + layer 5 + layer 7) and transformation, using tools such as principal component analysis. In our experiments, we carried out detection and classification tasks using the Caltech 101 and Daimler Pedestrian Benchmark Datasets. The paper is on the arXiv preprint.

cvpaper.challenge in 2015

cvpaper.challenge [Twitter] [SlideShare] is focusing on reading top conference papers in the fields of computer vision, image processing, pattern recognition and machine learning. In this challenge, we simultaneously read papers and create documents for easy understanding top conference papers. The first challenge is to completely read the CVPR2015 papers. The conference includes the 602 papers which are there main titles such as CNN architectures, 3D, video processing, action recognition, photography. More details are on the Twitter and SlideShare pages. We temporally update paper information there. Recently, the paper is published on the arXiv pre-print.

Fine-grained Walking Activity Recognition

The paper presents a fine-grained walking activity recognition toward an inferring pedestrian intention which is an important topic to predict and avoid a pedestrian’s dangerous activity. The fine-grained activity recognition is to distinguish different activities between subtle changes such as walking with different directions. We believe a change of pedestrian’s activity is significant to grab a pedestrian intention. The dense trajectories (DT) method is employed for highlevel recognition to capture a detailed difference. Here, we evaluated our proposed approach on “self-collected dataset” and “near-miss driving recorder (DR) dataset” by dividing several activities– crossing, walking straight, turning, standing and riding a bicycle. Our proposal achieved 93.7% on the self-collected NTSEL traffic dataset and 77.9% on the near-miss DR dataset. The paper is on the IEEE Conference on Intelligent Transportation Systems (ITSC2015).

Activity Prediction

Through the years, the techniques of human sensing have been studied in the field of computer vision. Human tracking, posture estimation, activity recognition and face recognition are one of the examples, and these topics will be applied in our real-worlds. However, previous computer vision techniques are "post-event analysis". On the other hand, we'll improve the computer vision applications if we can predict next activity, for example, abnormal/dangerous behaviors avoidance and next activity recommendation. Now we need a consideration about "pre-event analysis". We believe this is a latest challenge for "Future Prediction" in computer vision. Our framework is integration of computer vision and data analysis. The paper is on the VISAPP2016.

Fine-grained Activity Recognition

We propose a novel feature descriptor Extended Co-occurrence HOG (ECoHOG) and integrate it with dense point trajectories demonstrating its usefulness in fine grained activity recognition. This feature is inspired by original Co-occurrence HOG (CoHOG) that is based on histograms of occurrences of pairs of image gradients in the image. Instead relying only on pure histograms we introduce a sum of gradient magnitudes of co-occurring pairs of image gradients in the image. This results in giving the importance to the object boundaries and straightening the difference between the moving foreground and static background. We also couple ECoHOG with dense point trajectories extracted using optical flow from video sequences and demonstrate that they are extremely well suited for fine grained activity recognition. Using our feature we outperform state of the art methods in this task and provide extensive quantitative evaluation. The paper is on the Asian Conference on Computer Vision (ACCV2014).

Human Action Recognition with Feature Integration

This paper presents an approach for real-time human activity recognition. Three different kinds of features (flow, shape, and a keypoint-based feature) are applied in activity recognition. We use random forests for feature integration and activity classification. A forest is created at each feature that performs as a weak classifier. The international classification of functioning, disability and health (ICF) proposed by WHO is applied in order to set the novel definition in activity recognition. Experiments on human activity recognition using the proposed framework show - 99.2% (Weizmann action dataset), 95.5% (KTH human actions dataset), and 54.6% (UCF50 dataset) recognition accuracy with a real-time processing speed. The feature integration and activity-class definition allow us to accomplish high-accuracy recognition match for the state-of-the-art in real-time.

Trajectory Analysis

We're accumulating trajectory data for many services (e.g. crowded analysis and tendency understanding). Now we have more than 50,000,000 trajectories. These data is becoming "Big Trajectory Data". It is necessary to analyze trajectory data and information extraction. In our research, we propose the method for creating main trajectory map and clustering using large-scale trajectory data. We use small and large number of data, in order to show the effectiveness of big data analysis. The paper is on the IAPR Conference on Machine Vision Applications (MVA2013).

Pedestrian Active Safety

In order to reduce traffic deaths, we are currently developing a pedestrian active safety (collision avoidance) system, which is able to detect pedestrians by means of an in-vehicle sensor and employ automatic braking. We're finding a high-speed, highly accurate pedestrian detection using a monocular camera. In our study, symmetry judgment to select pedestrian candidate areas and improvement of CoHOG to accurately detect pedestrians are proposed for effective active safety system. The paper is on the IEICE Transactions on Information and Systems.

Soccer Video Analysis

We are studying about players tracking and locating technique to put "Soccer Video Analysis" into practice. Soccer players are specified by using Particle Filter and classifier. We apply Particle Filter as tracking method, and classifier detects and resamples the center of gravities in the context of occlusion. We use homography to project from input video to bird's eye view. As a result of these methods, we can capture the player's position, speed and defense/offense area in order to analyze soccer scenes.

Human Tracking using Main-parts Link Model

Tracking has been studied in the field of computer vision for many kinds of applications such as visual surveillance, intelligent room, sports video and so on. Human tracking in real environment is challenging problem due to various factors. These include illumination variation, occlusion and posture changing. In our sense, dividing parts tracking (we call this model as "Main-parts Link Model") will be an effective method in real-scene. This model can efficiently represent postural changing under occluded situation and wide variation of postures. We used silhouette and edge features for optimization. And more, we can divide human-situation into several classes (e.g. standing, walking, bending) from linking information.

Copyright Hirokatsu Kataoka