Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data
- First Online:
- Received:
- Accepted:
- 3 Citations
- 927 Downloads
Abstract
Activity recognition has shown impressive progress in recent years. However, the challenges of detecting fine-grained activities and understanding how they are combined into composite activities have been largely overlooked. In this work we approach both tasks and present a dataset which provides detailed annotations to address them. The first challenge is to detect fine-grained activities, which are defined by low inter-class variability and are typically characterized by fine-grained body motions. We explore how human pose and hands can help to approach this challenge by comparing two pose-based and two hand-centric features with state-of-the-art holistic features. To attack the second challenge, recognizing composite activities, we leverage the fact that these activities are compositional and that the essential components of the activities can be obtained from textual descriptions or scripts. We show the benefits of our hand-centric approach for fine-grained activity classification and detection. For composite activity recognition we find that decomposition into attributes allows sharing information across composites and is essential to attack this hard task. Using script data we can recognize novel composites without having training data for them.
Keywords
Activity recognition Fine-grained recognition Script data Hand detection1 Introduction
While impressive progress has been made, we argue that most works are addressing only a part of the overall activity recognition challenge. Many application scenarios, such as human–robot interaction or elderly care require to understand complex activities (e.g. does the person prepare food?), consisting of multiple fine-grained activities and object manipulations (e.g. is it fried and what is in it?). Frequently it is important to recognize both, the individual steps and the high level composite activities, e.g. as we have shown for the task of video description (Rohrbach et al. 2014). Consequently we approach both problems in this work: recognizing fine-grained activities and recognizing composite activities. Fine-grained activities are defined as a set of activities which are visually very similar, i.e. have a low inter-class variability. Composite activities are activities which can be temporally decomposed into multiple shorter activities, i.e. they consist of multiple steps. We note that both the terms are not exclusive, i.e. composite activities can also be fine-grained. In fact some of our composites are very similar. However, in our work we consider composite activities which consist of fine-grained activities.
When surveying the field we also noticed a lack of datasets allowing to pursue the challenges of fine-grained and composite activity recognition. Specifically this is reflected in the following limiting factors of current benchmark databases. First, while datasets with large numbers of activities exist, the typical inter-class variability is high. This seems rather unrealistic for many domains such as surveillance or elderly care where we need to differentiate between consequentially different but visually similar activities e.g. hug someone versus hold someone or throw in garbage versus put in drawer. Second, the activities considered so far are full-body activities, e.g. jumping or running. This appears rather untypical for many applications where we want to differentiate between more small motion and frequently hand centric activities. Consider e.g. the cutting activity in domains such cooking (see Fig. 1), handicraft work or surgeries, as well as different repairing activities in the domain of house keeping or machine maintenance with subtle difference in motion and low inter-class variability. As a third limitation we found that many available databases contain videos of few second length and focus on simple basic-level activities such as walking or drinking. In contrast, the recognition of longer-term, complex, and composite activities such as assembling furniture, food preparation, or surgeries have been rarely addressed in computer vision. Notable exceptions exist (see Sect. 2) even though these have other limiting factors such as small number of classes.
In this work, which is an extension of our original publications (Rohrbach et al. 2012a, b), we recorded, annotated, and publicly released a large-scale dataset in a kitchen scenario which addresses the discussed limitations. This allows us to work on the challenges of fine-grained and composite activity recognition as follows.
Overview of activity recognition datasets
Dataset | cls, det | Classes | Clips/videos | Subjects | # Frames | Resolution |
---|---|---|---|---|---|---|
Full body pose datasets | ||||||
KTH (Schuldt et al. 2004) | cls | 6 | 2391 | 25 | \(\approx \)200,000 | 160 \(\times \) 120 |
USC gestures (Natarajan and Nevatia 2008) | cls | 6 | 400 | 4 | 740 \(\times \) 480 | |
MSR action (Yuan et al. 2009) | cls, det | 3 | 63 | 10 | 320 \(\times \) 240 | |
Movie and web video datasets | ||||||
Hollywood2 (Marszalek et al. 2009) | cls | 12 | 1707/69 | |||
UCF 101 (Soomro et al. 2012) | cls | 101 | 13,320 | \(\approx \)2,400,000 | 320 \(\times \) 240 | |
Sports-1M (Karpathy et al. 2014) | cls | 487 | 1.1 mil | |||
HMDB51 (Kuehne et al. 2011) | cls | 51 | 6766 | Height: 240 | ||
ASLAN (Kliper-Gross et al. 2012) | cls | 432 | 3631/1571 | |||
Coffee and Cigarettes (Laptev and Pérez 2007) | det | 2 | 264/11 | |||
High Five (Patron-Perez et al. 2010) | cls, det | 4 | 300/23 | |||
MPII Movie Description (Rohrbach et al. 2015) | cls, det | 68,327/94 | 1920 \(\times \) 1080 | |||
Surveillance datasets | ||||||
PETS 2007 (Ferryman 2007) | det | 3 | 10 | 32,107 | 768 \(\times \) 576 | |
UT interaction (Ryoo and Aggarwal 2009) | cls, det | 6 | 120 | 6 | ||
VIRAT (Oh et al. 2011) | det | 23 | 17 | 1920 \(\times \) 1080 | ||
Assisted daily living datasets | ||||||
TUM Kitchen (Tenorth et al. 2009) | det | 10 | 20/4 | 36,666 | 384 \(\times \) 288 | |
CMU-MMAC (De la Torre et al. 2009) | cls, det | \(>\)130 | 26 | 1024 \(\times \) 768 | ||
URADL (Messing et al. 2009) | cls | 17 | 150/30 | 5 | \(\le \) 50,000 | 1280 \(\times \) 720 |
MPII Cooking 2 (our dataset) | cls, det | 67/ 59 | 14,105/273 | 30 | 2,881,616 | 1624 \(\times \) 1224 |
For recognizing composite activities, state-of-the-art methods, which build on discriminative learning from low-level activity features, experience scalability issues due to the typically highly diverse composite activities and little training data. A promising approach towards scaling activity recognition methods to a large number of complex activities is to use intermediate representations that are shared and transferred across activities by exploiting their compositional nature. We exploit this technique and propose building on an attribute-based representation, with attributes denoting the fine-grained activities and the participating objects. For example in Fig. 1 the composite activity preparing scrambled egg shares the attributes stir and spatula with the composite activity preparing onion and the attributes open and egg with the composite activity separating egg. Instead of learning a holistic model for each composite activity we learn models for a large set of attributes shared across composite activity classes. Such approaches have been shown effective to recognize previously unseen object categories (Lampert et al. 2013) and have also been applied to activity recognition (Liu et al. 2011). A major challenge to recognize everyday activities is that these composite activities can often be performed in a wide variety of ways, and it is practically infeasible to create a visually annotated training set with all possible alternatives. Instead, we collect a large number of textual descriptions (scripts) for a composite activity to compute the association strength between attributes and composite activities. Using this script data we can not only handle the inherent variation of composites but also recognize unseen composite activities. As illustrated in Fig. 1, the attributes in red are determined to be important for preparing scrambled eggs using script data and can be transferred from known composites such as separating egg and preparing onion.
Our main contributions are as follows. First, we propose several hand- and pose-based activity recognition approaches to recognize fine-grained activities and their object participants. We benchmark them together with state-of-the-art activity recognition features on our dataset. Second, we contribute an attribute-based approach which shares knowledge across composite activities and exploits textual script data to handle their large variability and allows transfer to unseen composite activities. Third, we recorded and annotated a video dataset called MPII Cooking 2. It provides challenges for classification and detection of fine-grained activities and their participants, human pose estimation, and composite activity recognition (optionally) using script data. In addition to activity recognition, which is the focus of this work, the dataset is also being used for 3D human pose estimation (Amin et al. 2013), multi-frame pose estimation (Cherian et al. 2014), grounding semantic similarities of natural language sentences in video (Regneri et al. 2013), and for generating natural language descriptions (Rohrbach et al. 2013b, 2014).
The remaining article is structured as follows. We first make an extensive review of related datasets, activity recognition approaches, and the use of text data for visual recognition in Sect. 2. Then we introduce our MPII Cooking 2 dataset in Sect. 3 which we benchmark in the subsequent sections. In Sect. 4 we make a quantitative comparison of our pose-recognition and hand detection with related work on the pose challenge of our dataset. Using the pose-estimation and hand detections we define several visual features and discuss fine-grained activity detection in Sect. 5. In Sect. 6 we present our approach to combine the fine-grained activities to composite activities and integrate script data. In Sect. 7 we evaluate fine-grained and composite activity recognition and then we conclude with the most important findings and directions for future work in Sect. 8.
2 Related Work
We first present an overview of the different video activity recognition datasets (Sect. 2.1) and then review recent approaches to activity recognition (Sect. 2.2), putting a focus on works which use human pose as a cue. Next we discuss works which use textual information for improved recognition of activities (Sect. 2.3). We conclude by relating them to our work (Sect. 2.4).
2.1 Activity Datasets
Even when excluding single image action datasets such as the Stanford-40 Action Dataset (Yao et al. 2011b) or the Pascal Action Classification Challenge (Everingham et al. 2011), the number of proposed activity datasets is quite large (Chaquet et al. (2013) survey 68 datasets). Here, we focus on the most important ones with respect to database size, usage, and similarity to our proposed dataset (see Table 1). We distinguish four broad categories of datasets: full body pose, movie and web, surveillance, and assisted daily living datasets—our dataset falls in the last category.
The full body pose datasets are defined by actors performing full body actions. KTH (Schuldt et al. 2004), USC gestures (Natarajan and Nevatia 2008), and similar datasets (Singh and Nevatia 2011) require classifying simple full body and mainly repetitive activities. The MSR actions (Yuan et al. 2009) pose a detection challenge limited to three classes. In contrast to these full body pose datasets, our dataset contains more and in particular fine-grained activities.
The second category consists of movie clips or web videos with challenges such as partial occlusions, camera motion, and diverse subjects. UCF501 and similar datasets (Liu et al. 2009; Niebles et al. 2010; Rodriguez et al. 2008) focus on sport activities. Kuehne et al.’s evaluation suggests that these activities can already be discriminated by static joint locations alone (Kuehne et al. 2011). UCF50 has been extended to UCF 101 (Soomro et al. 2012), significantly increasing the number of categories to 101 and including 2.4 million frames at a rather low resolution of 320 \(\times \) 240. The Sports-1M dataset exceeds all datasets with respect to number of clips (1.1 million) and categories (487 different sports), which are, however, only weakly labeled. Hollywood2 (Marszalek et al. 2009), HMDB51 (Kuehne et al. 2011), and ASLAN (Kliper-Gross et al. 2012) have very diverse activities. Especially HMDB51 (Kuehne et al. 2011) is an effort to provide a large scale database of 51 activities while reducing the database bias. Although it includes similar, fine-grained activities, such as shoot bow and shoot gun or smile and laugh, most classes have a large inter-class variability and the videos are low-resolution. ASLAN (Kliper-Gross et al. 2012) focuses on a larger number of activities but with little training data per category. The task is to identify similar videos rather than categorising them. A significantly larger video collection is evaluated during the TRECVID challenge (Over et al. 2012). The 2012 challenge consisted of 291 h of short videos from the Internet Archive (archive.org) and more than 4000 h of multi-media (audio and video) data. The challenge covers different tasks including semantic indexing and multi-media event recognition of 20 different event categories such as making a sandwich and renovating a home. Large parts of the data are, however, only available to the participants during the challenge. Although our dataset is easier in respect to camera motion and background, it is challenging with respect to a smaller inter-class variability.
The datasets Coffee and Cigarettes (Laptev and Pérez 2007) and High Five (Patron-Perez et al. 2010) are different to the other movie datasets by promoting activity detection rather than classification. This is clearly a more challenging problem as one not only has to classify a pre-segmented video but also to detect (or localize) an activity in a continuous video. As these datasets have a maximum of four classes, our dataset goes beyond these by distinguishing a large number of classes. The recent MPII Movie Description dataset (Rohrbach et al. 2015) does not label clips with labels but with natural sentences which are sourced from movie scripts and audio descriptions for the blind.
The third category of datasets is targeted towards surveillance. The PETS (Ferryman 2007) or SDHA20102 workshop datasets contain real world situations from surveillance cameras in shops, subway stations, or airports. They are challenging as they contain multiple people with high partial occlusion. The UT interaction (Ryoo and Aggarwal 2009) requires to distinguish 6 different two-people interaction activities, such as punch or shake hands. The VIRAT (Oh et al. 2011) dataset is a recent attempt to provide a large scale dataset with 23 activities on nearly 30 h of video. Although the video is high-resolution people are only of 20 to 180 pixel height. Overall the surveillance activities are very different to ours which are challenging with respect to fine-grained hand motion.
Next we discuss the domain of Assisted daily living (ADL) datasets, which also includes our dataset. The University of Rochester Activities of Daily Living Dataset (URADL) (Messing et al. 2009) provides high-resolution videos of 10 different activities such as answer phone, chop banana, or peel banana. Although some activities are very similar, the videos are produced with a clear script and contain only one activity each. In the TUM Kitchen dataset (Tenorth et al. 2009) all subjects perform the same composite activity (setting a table) and rather similar actions with limited variation. Roggen et al. (2010) and De la Torre et al. (2009) present recent attempts to provide several hours of multi-modal sensor data (e.g. body worn acceleration and object location). But unfortunately people and objects are (visually) instrumented, making the videos visually unrealistic. In the CMU-MMAC dataset (De la Torre et al. 2009) all subjects prepare the identical five dishes with very similar ingredients and tools. In contrast to this our dataset contains 59 diverse dishes, where each subject uses different ingredients and tools in each dish. The authors also record an egocentric view. Similarly to (Farhadi et al. 2010; Fathi et al. 2011; Stein and McKenna 2013) the camera view mainly shows hands and manipulated cooking ingredients. Also recorded in an egocentric view, Pirsiavash and Ramanan (2012) propose a dataset of 18 diverse daily living activities, not restricted to the cooking domain, recorded in different houses in non-scripted fashion.
Overall our dataset fills the gap of a large database with on the one hand a detection challenge of fine-grained activities and on the other hand a recognition challenge of highly variable composite activities.
2.2 Advances in Activity Recognition
Activity recognition for still images has been advanced e.g. by jointly modeling people and objects (Yao and Li 2012) or scenes and objects (Li and Li 2007). In the following we focus on recognizing activities in video, distinguishing three aspects: holistic features for activity recognition, exploiting body pose, and modelling the temporal structure of activities.
To create a discriminative feature representation of a video, many approaches first detect space-time interest points (Chakraborty et al. 2011; Laptev 2005) or sample them densely (Wang et al. 2009a) and then extract diverse descriptors in the image-time volume, such as histograms of oriented gradients (HOG) and histograms of oriented flow (HOF) (Laptev et al. 2008) or local trinary patterns (Yeffet and Wolf 2009). Messing et al. (2009) found improved performance by tracking Harris3D interest points (Laptev 2005). The state-of-the-art Dense Trajectories approach from Wang et al. (2013a) uses this idea: it tracks dense feature points and extracts strong video features around these tracks, namely HOG, HOF, and Motion Boundary Histograms (MBH, Dalal et al. 2006). They report state-of-the art results on several datasets including KTH (Schuldt et al. 2004), UCF YouTube (Liu et al. 2009), Hollywood2 (Marszalek et al. 2009), and HMDB51 (Kuehne et al. 2011). Recently, Wang and Schmid (2013) improved their approach by removing background flow and by ensuring that detected humans do not contribute to the background motion estimation. Additionally they replace the BoW encoding with Fisher vectors. The computational effort of this approach can be significantly reduced by replacing dense flow with motion information from video compression Kantorov and Laptev (2014). As alternative to manually defined activity features, Taylor et al. (2010), Baccouche et al. (2011), Le et al. (2011), and Ji et al. (2013) use deep learning with convolutional neural networks to learn an activity feature representation. So far these approaches cannot reach the manually defined Dense Trajectories even when learning on a database of over a 1 million videos (Karpathy et al. 2014).
Human body poses and their motion frequently characterize human activities and interactions. This has been exploited in Microsoft’s Kinect, which uses human pose as a game controller but relies on a depth sensor to recognize human pose (Shotton et al. 2011). Earlier work in human pose based activity recognition employed motion capture systems using physical on-body markers to reliably capture human poses, e.g. (Campbell and Bobick 1995). Such an approach is impractical for recording realistic data. Recently a number of hand and pose-centric approaches have been proposed for activity recognition for more realistic video recordings (Fathi et al. 2011; Packer et al. 2012; Yao et al. 2011a; Sung et al. 2011; Raptis and Sigal 2013; Jhuang et al. 2013) as well as in static images (Yang et al. 2011; Yao and Li 2012). Packer et al. demonstrate impressive results in recognition of kitchen activities using body poses recovered from depth images. Fathi et al. (2011) propose a hand-centric approach for learning effective models of activities from egocentric video by observing regularities in hand-object interactions. Hand poses have been shown to facilitate extraction of appearance features for activity recognition in static images (Karlinsky et al. 2010). Pose-based models are effective for activity recognition when body poses can be estimated reliably, as e.g. in depth images (Packer et al. 2012; Sung et al. 2011). Mittal et al. (2011) and Gkioxari et al. (2013) aim for specialized representations for hands, but do not apply them to pose estimation or activity recognition. Jhuang et al. (2013) study the benefits of pose estimation for activity recognition on a subset of the HMDB dataset (Kuehne et al. 2011). They show that ground truth pose, estimated over time can significantly outperform the holistic Dense Trajectories features (Wang et al. 2013a); this is also true for estimated pose using (Yang and Ramanan 2013) but only on a subset where the full body is visible.
Although several interesting techniques have been proposed to model the temporal structure of videos, they typically perform only below or on par with bag-of-word based approaches: A simple temporal structure is encoded in the template-based Action MACH from Rodriguez et al. (2008), Brendel and Todorovic (2011) model temporal and spatial structure by segmenting the space-temporal volume, and Niebles et al. (2010) model activities as a temporal composition of primitive actions and discriminatively learn such models. While Niebles et al. fix anchor points and the length of the temporal segments before training, Tang et al. (2012) learn all parameters from data using a variable-duration hidden Markov model. An AND/OR graph structure can be used to combine different features at its nodes (Tang et al. 2013) or model co-occurring and consecutive actions (Gupta et al. 2009). Recently Pirsiavash and Ramanan (2014) have shown how to efficiently parse activity videos with segmental grammars.
2.3 Natural Language Text for Activity Recognition
Natural language descriptions have shown beneficial for image segmentation (Socher and Fei-Fei 2010) or recognizing object categories (Wang et al. 2009b; Elhoseiny et al. 2013). Similar to our work, Elhoseiny et al. use classifiers trained on the known classes. Representing the text descriptions with tf\(*\)idf (term frequency times inverse document frequency) vectors for relevant encyclopedic entries, they compare a regression, a domain adaptation, and a newly proposed constrained optimization formulation to learn a function from the textual vector to the visual classifier space. On two fine-grained visual recognition datasets, CU200 Birds (Welinder et al. 2010) and Oxford Flower-102 (Nilsback and Zisserman 2008), they show the benefit of their constraint optimization approach. Semantic similarity from linguistic resources has also been used to allow zero-shot recognition in images via attributes and direct similarity (Rohrbach et al. 2010) and by learning an embedding into a linguistic word vector space (Socher et al. 2013; Frome et al. 2013). Additionally to transferring knowledge one can exploit the unlabeled instances to improve recognition, assuming a transductive setting. For this, Fu et al. (2013) exploit the test-data distribution by performing a single round of self-training by averaging over the k-nearest neighbors.
Teo et al. (2012) improve activity recognition by adding object detectors, which are selected based on the linguistic co-occurrence statistics in the newswire Gigaword Corpus. A similar idea is pursued by Motwani and Mooney (2012), who mine and cluster verbs from descriptions of the video snippets in the MSVD dataset (Chen and Dolan 2011). Zhang et al. (2011) show that tf\(*\)idf can identify the most relevant terms in text descriptions collected for seven video scenes allowing to yields close to perfect (98 %) recognition accuracy on their dataset. Ramanathan et al. (2013) jointly recognize actions and roles in YouTube videos using their captions. They mine a large number of YouTube descriptions and use a topic model to estimate the semantic relatedness between an action/role and a description.
Another line of work focuses on describing videos with natural language descriptions. Recently Guadarrama et al. (2013) generated simple sentences for the Microsoft Video Description corpus (Chen and Dolan 2011) containing challenging web videos. Das et al. (2013) compose descriptions for kitchen videos of their YouCook dataset showing YouTube cooking videos. Finally, we have shown how to learn a translation model for generating natural sentences on our dataset (Rohrbach et al. 2013b).
2.4 Relations to Our Work
Most of the activity recognition approaches and datasets have been evaluated on full-body motion or challenging web or movie datasets but not on fine-grained motions with low inter-class variability. We therefore evaluate the holistic Dense Trajectories approach from Wang et al. (2013a) as well as two pose-based and two hand centric approaches on our MPII Cooking 2 dataset. Our pose-based approach encodes trajectories of body joints using features motivated from the sensor-based activity recognition community (Zinnen et al. 2009). The features are also similar to the relational and distance features defined on joints by Jhuang et al. Similarly to their work we define relational and distance metrics between joints per frame and over time. However, our activities contain very subtle motions and the people have a very similar pose for most activities, which reduces the benefits of this feature representation. Jhuang et al. examine the advantages of focusing Dense Trajectories (Wang et al. 2013a) on body joints. In our static scene (holistic) Dense Trajectories are already restricted to human body as the features are only extracted on moving points. However, in this work we propose to focus on hands, as they are the main cue for recognizing our fine-grained activities and participating objects.
In Amin et al. (2013) we improve the hand localization by leveraging multiple cameras to handle self-occlusion. In this work we remain monocular and propose to use a specialized hand detector to improve pose estimation and activity recognition.
To improve fine-grained activities and their participating objects we train a classifier on stacked classifier scores from co-occurring activities/objects as well as from temporal context after max pooling. Classifier stacking has previously been explored e.g. in (Ting and Witten 1997; Liu et al. 2012; Sill et al. 2009). Most relevant to our work, Liu et al. (2012) try to optimize the usage of training data and avoid over-fitting when learning stacked video classifiers. This could be beneficial when applied to our approach.
In this work we exploit cooking instructions (script data) to extract which activities, tools, and ingredients are relevant for a certain dish (composite activity). For this we compare co-occurrence statistics with tf\(*\)idf , which has also been used by Zhang et al. (2011) and Elhoseiny et al. (2013) to extract relevant concepts for video scene and object recognition. We find that tf\(*\)idf better discriminates different dishes and improves performance in most cases. Script data allows for zero-shot recognition, which has mainly been used for object recognition, but also for multi-media data by Fu et al. (2013). Fu et al. learn a latent attribute representation on the known classes, but then use manually defined attribute associations to transfer.
Composite activities (dishes) of MPII Cooking 2 dataset, composites marked in bold are part of the test split
MPII Cooking | Sandwich, salad, fried potatoes, potato pancake, omelet, soup, pizza, casserole, mashed potato, snack plate, cake, fruit salad, cold drink, and hot drink |
MPII Composites | Cooking pasta, juicing {lime, orange}, making {coffee, hot dog, tea}, pouring beer, preparing {asparagus, avocado, broad beans, broccoli and cauliflower, broccoli, carrots and potatoes, carrots, cauliflower, chilli, cucumber, figs, garlic, ginger, herbs, kiwi, leeks, mango, onion, orange, peach, peas, pepper, pineapple, plum, pomegranate, potatoes, scrambled eggs, spinach, spinach and leeks}, separating egg, sharpening knives, slicing loaf of bread, using {microplane grater, pestle and mortar, speed peeler, toaster, tongs}, zesting lemon |
Dataset statistics
Videos | Subjects | Categories | Ground truth | Attribute | Video | ||
---|---|---|---|---|---|---|---|
Composites | Attributes | time intervals | instances | duration (min) | |||
MPII Cooking (Rohrbach et al. 2012a) | 44 | 12 | 14 | 218 | 3824 | 15,382 | 3–41 |
MPII Composites (Rohrbach et al. 2012b) | 212 | 22 | 41 | 218 | 8818 | 33,876 | 1–23 |
Combined | 256 | 30 | 55 | 218 | 12,642 | 49,258 | 1–41 |
MPII Cooking 2 | 273 | 30 | 59 | 222 | 14,105 | 54,774 | 1–41 |
- Training set | 201 | 24 | 58 | 222 | 10,931 | 42,619 | 1–41 |
- Validation set | 17 | 1 | 17 | 107 | 445 | 1662 | 1–8 |
- Test set | 42 | 5 | 31 | 169 | 2102 | 8023 | 1–13 |
Finally we shortly summarize how this work extends our original publications (Rohrbach et al. 2012a, b). First, we updated the dataset by correcting and unifying some of the annotations and adding a few more videos. We refer to this new version as MPII Cooking 2. It supersedes both previous datasets, see Table 3. Second, we present hand-centric approaches for fine-grained recognition, namely an integration of pose-estimation and hand detector and Hand centric features for activity recognition (arXiv: Senina et al. 2014). Third, we integrated our Propagated Semantic Transfer (PST) from Rohrbach et al. (2013b) for composite recognition. Fourth, we extended qualitative and quantitative results. Fifth, we extended the discussion of related work. Sixth, we rerun experiments with updated version of Dense Trajectories (Wang and Schmid 2013). And last, we will release the updated version of the dataset, new intermediate features as well as the script data.
3 Dataset “MPII Cooking 2 ”
For our dataset we video-recorded human subjects cooking a diverse set of dishes, e.g. making pizza or preparing cucumber. The dishes form the composite activities and the individual steps taken are the fine-grained activities, e.g. cut, pour, or spice. All videos have a composite label and are annotated with time intervals. Each time interval has a fine-grained activity and the participating objects as labels. A subset of frames was annotated with human pose and hands. In the following we provide details and statistics of the dataset, Figs. 1 and 2 show example frames of the dataset.
3.1 Dataset Statistics and Versions
We recorded 30 subjects in 273 videos with a total length of more than 27 h or 2,881,616 frames. Each video contains a single subject preparing a certain dish.
The dataset was recorded in two batches. The first part contains few, but very diverse and complex dishes (see upper part of Table 2) and was presented in Rohrbach et al. (2012a). The second part, presented in Rohrbach et al. (2012b), focuses on composite activities and thus contains significantly more dishes/composites which are slightly shorter and simpler, see lower part of Table 2. The second set of composite activities are selected according to our script corpus which we describe below in Sect. 3.4. We ignored some of them which were either too elementary to form a composite activity (e.g. how to secure a chopping board), were duplicates with slightly different titles, or because of limited availability of the ingredients (e.g. butternut squash).
For this work we corrected and unified some of the annotations and added a few more videos. We refer to this new dataset version as MPII Cooking 2. It supersedes both previous datasets. Table 3 compares the different versions and shows different statistics about them. The table also shows the proposed training/validation/test split, which is selected in a way that for all 31 composite activities in the test set, there are at least 3 training/validation videos and there is no overlap between training, validation, and test subjects. In contrast to the earlier versions we avoid multiple test splits for simpler evaluation and to reduce the computational burden for other researchers evaluating on the dataset.
3.2 Dataset Recording and Annotation Protocol
To record realistic behavior we neither asked subjects to perform certain activities nor to follow a certain recipe but we told them only which dish they should prepare. This resulted in a larger variety of how subjects prepared things. This means subjects used different tools for preparation (knife or peeler for peeling), took different steps (e.g. some people cooked the vegetables some did not), and did things in different temporal orders for the same dish (e.g. washed the vegetable before or after they peeled it). Before the recording the subjects were shown our kitchen and places of tools and ingredients to feel at home. During the recording subjects could ask questions in case of problems and some listened to music. We always started the recording with an empty and clean kitchen, prior to the subject entering the kitchen and ended it once the subject declared to be finished, i.e. we did not include the final cleaning process. Most subjects were university students from different disciplines recruited by e-mail and publicly posted flyers. Subjects were paid per hour and cooking experience ranged from beginner cookers to amateur chefs.
Composite activities are annotated on the level of each video. Fine-grained activities were annotated with a two-stage revision phase with start and end frame using the annotation tool Advene (Aubert and Prié 2007). In addition to the activity category each annotation consists of used tools, ingredients, and locations (we refer to them as participants). Composite activities were chosen as described in Sects. 3.1 and 3.4. Activity, tool, ingredient, and location categories were chosen to describe all activities the human subjects were performing. The decision was made after the recording on the base what the human subjects did. With respect to the level of detail, we do not annotate the specific motions (e.g. move arm up or down) but what effect or semantic they have (e.g. open versus close). See Table 7 for the chosen granularity.
We recorded in our kitchen (see Fig. 2a) with a 4D View Solutions system using a Point Grey Grasshopper camera with 1624 \(\times \) 1224 pixel resolution at 29.4 fps and global shutter. The camera is attached to the ceiling, recording a person working at the counter from the front. We provide the sequences as single frames (jpg with compression set to 75) and as video streams (compressed weakly with mpeg4v2 at a bit-rate of 2500). For most videos we recorded 7 additional camera views on the kitchen, a subset was used and released by Amin et al. (2013). Although they are not used in this work we will make the remaining 7 views available upon publication. All fine-grained and composite activity annotations are also valid for the other cameras as each frame was synchronized across all 8 cameras.
Three example scripts for the composite activity preparing cucumber
1. Get a large sharp knife | 1. Gather your cutting board and knife. | 1. Wash the cucumber |
2. Get a cutting board | 2. Wash the cucumber. | 2. Peel the cucumber |
3. Put the cucumber on the board | 3. Place the cucumber flat on the cutting board. | 3. Place cucumber on a cutting board. |
4. Hold the cucumber in your weak hand | 4. Slice the cucumber horizontally into round slices. | 4. Take a knife and rock it back and forth on the cucumber |
5. Chop it into slices with your strong hand | 5. Make a clean thin slice each time. |
The dataset provides furthermore human body pose annotations (see Sect. 3.3), script data (see Sect. 3.4) and there exist textual descriptions in the TACoS (Regneri et al. 2013) and TACoS multi-level corpus (Rohrbach et al. 2014). The descriptions in TACoS describe what happens in a specific video and are temporally aligned to the video, i.e. they provide a textual annotation. In contrast, the scripts used in this work are collected independently of the video and thus contain domain or script knowledge, i.e. what activities and what objects are likely used for a certain dish. As they are not specific to the training videos they allow to transfer and generalize to novel test scenarios.
3.3 Pose Challenge
A subset of frames have articulated human pose and hand annotations to learn and evaluate pose estimation approaches and hand detectors. For human pose we annotated the frames with right and left shoulder, elbow, wrist, and hand joints as well as head and torso. We have 2994 frames of 10 subjects for training of pose annotation and an additional of 4250 training images with hand points used for training the hand detector. For testing we sample 1277 frames from all activities with 7 subjects as test set for the pose challenge. All training and test frames are from MPII Cooking (Rohrbach et al. 2012a) and thus avoid an overlap with the test subjects and test composites in MPII Cooking 2.
3.4 Mining Script Data for Composite Activities
Linguistics and psychology literature knows prototypical sequences of certain activities as so-called scripts (Schank and Abelson 1977; Barr and Feigenbaum 1981). Scripts describe a certain scenario which corresponds to composite activities in our case. Scenarios (e.g. eating in a restaurant) are temporally ordered events (the patron enters restaurant, he takes a seat, he reads the menu, ...) and subjects (patron, waiter, food, menu, ...). Written event sequences for a scenario can be collected on a large scale using crowd-sourcing (Regneri et al. 2010). We make use of this method to collect scripts for our composite activities and assembling a large number of written sequences for each of those.
We collect natural language sequences similar to Regneri et al. (2010) using Amazon’s Mechanical Turk3. For each composite activity, we asked the subjects to give tutorial-like sequential instructions for executing the respective kitchen task. The instructions had to be divided into sequential steps with at most 15 steps per sequence. We select 53 relevant kitchen tasks as composite activities by mining the tutorials for basic kitchen tasks on the webpage “Jamie’s Home Cooking Skills”4. All those tasks/scenarios are about processesing ingredients or using certain kitchen tools. In addition to the data we collected in this experiment, we use data from the OMICS corpus (Singh et al. 2002) and Regneri et al. (2010) for 6 kitchen-related composite activities. This results in a corpus with 59 composite activities and 2124 sequences in sum, having a total of 12,958 individual event descriptions. Note that for practical reasons we only recorded videos for 35 of these composite activities as discussed in Sect. 3.1. They are listed in Table 2 under “MPII Composites”.
This script corpus provides much more variation than the limited number of video training examples can capture. Of course this also poses a challenge, because we need to overcome the problem of different wordings and coordinated events: Table 4 shows three examples we collected for the composite activity preparing cucumber. They differ in verbalization (e.g. slice, chop, and make a slice) and granularity (getting something is often left out). Further, the sequences reflect different ways of preparing the vegetable, some include peeling it, some do not wash it, and so on. Some sentences contain conjugated events (take a knife and rock it...). While we clean the data to a certain degree by fixing spelling mistakes and resolving pronouns with the method from Bloem et al. (2012), we end up with both challenges and blessings of a noisy but big script corpus.
In Sect. 6.4 we will describe how we extract semantic relatedness from this data.
4 Hand Detection and Pose Estimation
In the following we introduce our hand detector (Sect. 4.1) and pose estimation method (Sect. 4.2) as well as how we combine them (Sect. 4.3). In Sect. 4.4 we evaluate our proposed approaches as well as state-of-the-art pose estimation methods on our dataset.
4.1 Hand Detection Based on Local Appearance
As a basis for our hand detector we rely on the deformable part models (DPM, Felzenszwalb et al. 2010). We discuss several design choices in order to achieve best performance.
4.1.1 Detection of Left and Right Hands
We aim for a hand detector that can correctly distinguish the left and right hand of a person. The rationale behind this is that for many activities left and right hands have different roles (e.g. for a cutting activity the dominant hand is typically holding a knife while the supporting hand is holding the object that is being cut). Further, we would like to avoid situations when two strong hypotheses for one of the hands are chosen over two hypotheses for both hands. We achieve this by dedicating separate DPM components to left and right hands and jointly training them within the same detector (see examples in Fig. 3). Note that in contrast to the default setting mirroring is switched off in DPM. At test time we pick the best scoring hypothesis among the components corresponding to left and right hands.
4.1.2 Component Initialization
We capture the variance of hand postures by decomposing the hands’ appearance into multiple modes and representing each mode with a specific DPM component. We found that a rather large number of components is necessary to achieve good detection performance. We initialize the components by clustering the HOG descriptors of the training examples using K-means as in Divvala et al. (2012). The detection further improves by first clustering the training examples by hand orientation and then by HOG.
4.1.3 Body Context
We improve the hand localization by augmenting the hand detector with the context provided by a person detector. We rely on the person detector to constrain the search for hands to the image locations within the extended person bounding box and also constrain the scale of the hands detector to the scale of the person hypothesis.
4.2 Pose Estimation
4.3 Combining Hand Detection and Pose Estimation
4.4 Evaluation: Pose Estimation and Hand Detection
We first evaluate the results on the upper-body pose estimation task. In order to identify the best 2D pose estimation approach we use our 2D body joint annotations (see Sect. 3.3). For evaluating these methods we adopt the PCP measure (percentage of correct parts) proposed by Ferrari et al. (2008). The results are shown in Fig. 4a. The first three lines compare three state-of-the-art methods: the cascaded pictorial structures (CPS, Sapp et al. 2010), the flexible mixture of parts model (FMP, Yang and Ramanan 2011) and the implementation of pictorial structures model (PS, Andriluka et al. 2011), using their published pose models. Lines 4 and 5 show the models of Yang and Ramanan and Adriluka et al. retrained on our data. Overall the model of Adriluka et al. performs best, achieving 66.0 PCP for all body-parts. We attribute the improvement of PS over FMP to the following. The FMP model encodes different orientation of parts via different appearance templates, whereas the PS model uses a single template that is rotation invariant and is evaluated at all orientations. The FMP model has a larger number of parameters because appearance templates are not shared across different part orientations. A larger number of parameters means that it is easier to overfit the FMP model than the PS model. This could explain the performance differences after retraining on our data. It could also be that finer discretization of body part orientations in the PS model compared to the FMP model is important for good performance. As described above we base our model (FPS) on PS, adding to it flexible part configuration.
The bottom part of the Fig. 4a shows that this as well as our other improvements (more training data comparing to Rohrbach et al. (2012a), color features, and hand detections) in the model each helps to improve performance. Overall, compared to PS, we achieve an improvement from 66.0 to 75.9 PCP and most notably an improvement from 48.9 to 74.4 and from 49.6 to 70.3 for lower arms, which are most important for recognizing hand-centric activities. We also would like to point to the benefit which hand detectors have to pose estimation (compare line 7 vs 8 and 9 vs 10).
Next we discuss the hand detection results. Our final hand detector handDPM is based on 32 components with 16 components allocated to each of the hands. The components are initialized by first grouping the training examples of each hand into 4 discrete orientations, and then clustering their HOG descriptors. In the experiments on hand localization we use a metric that reflects the localization accuracy and measures the percentage of hand hypotheses within a given distance from the ground truth. We visualize the results by plotting the localization accuracy for a range of distances.
We also compare our hand detector to a state-of-the-art hand detector of Mittal et al. (2011) using the code made publicly available by the authors. We perform the best-case evaluation and assign the hand hypothesis returned by the approach to the closest left and right hand in the ground-truth, as the hand detector does not differentiate between left and right hands. For a fair comparison we also filter the hand detections of Mittal et al. (2011) at irrelevant scales and image locations using body context as explained before. Our detector significantly improves over the hand detector of Mittal et al. (2011), which in addition to hand appearance also relies on color and context features, whereas our hand detector uses hand regions only. Note that there are significant differences between localization accuracy of left and right hands. We attribute this to the fact that the majority of people in our database are right handed. Since people perform many activities with their dominant hand, the pose of the right hand is more likely to be constrained by various activities due to the use of tools such as a knife or peeler. The left hand’s pose is far less deterministic and the hand is often occluded behind the counter or while holding various objects.
5 Approaches for Fine-Grained Activity Recognition and Detection
In this section we focus on fine-grained activity recognition to approach the challenges typical e.g. for assisted daily living. Along with the activities we want to recognize their participating objects. To better understand the state-of-the-art for this challenging task we benchmark three types of approaches on our new dataset. The first type (Sect. 5.1) uses features derived from upper body model motivated by the intuition that human body configurations and human body motion should provide strong cues for activity recognition. For body pose estimation we rely on our approach described in Sects. 4.2 and 4.3. The second type (Sect. 5.2) are the state-of-the-art Dense Trajectories (Wang et al. 2013a) which have shown promising results on various datasets. It is a holistic approach in a sense that it extracts visual features on the entire frame. As the third type (Sect. 5.3) we present our hand-centric visual features, targeted at recognizing our hand-centric activities and the participating objects which are typically in the hand neighbourhood. For this we propose a hand detector (Sections 4.1, 4.3). Finally, we discuss our approaches to activity classification and detection in Sect. 5.4.
5.1 Pose-Based Approach
Pose-based activity recognition approaches were shown to be effective using inertial sensors (Zinnen et al. 2009). Inspired by Zinnen et al. (2009) we build on a similar feature set, computing it from the temporal sequence of 2D body configurations.
We employ a person detector (Felzenszwalb et al. 2010) and estimate the pose of the person within the detected region with 50 % border around. This allows us to reduce the complexity of the pose estimation and simplifies the search to a single scale. To extract the trajectories of body joints we rely on search space reduction (Ferrari et al. 2008) and tracking. To that end we first estimate poses over a sparse set of frames (every 10-th frame in our evaluation) and then track over a fixed temporal neighborhood of 50 frames forward and backward. For tracking we match SIFT features for each joint separately across consecutive frames. To discard outliers we find the largest group of features with coherent motion and update the joint position based on the motion of this group. This approach combines the generic appearance model learned at training time with the specific appearance (SIFT) features computed at test time.
Given the body joint trajectories we compute two different feature representations. First is a manually defined statistics over the body model trajectories, which we refer to as body model features (BM). Second is Fourier transform features (FFT) from Zinnen et al. (2009), which have shown effective for recognizing activities from body worn wearable sensors.
5.1.1 Body Model Features (BM)
For the BM features we compute the velocity of all joints (similar to gradient calculation in the image domain). We bin it in an 8-bin histogram according to its direction, weighted by the speed (in pixels/frame). This is similar to the approach by Messing et al. (2009) which additionally bins the velocity’s magnitude. We repeat this by computing acceleration of each joint. Additionally we compute distances between the right and corresponding left joints as well as between all 4 joints on each body half. Similar to the joint trajectories (i.e. trajectories of x,y values) we build corresponding “trajectories” of distance values by stacking the values over temporally adjacent frames. For each distance trajectory we compute statistics (mean, median, standard deviation, minimum, and maximum) as well as a rate of change histogram, similar to velocity. Last, we compute the angle trajectories at all inner joints (wrists, elbows, shoulders) and use the statistics (mean etc.) of the angle and angle speed trajectories. This totals to 556 dimensions.
5.1.2 Fourier Transform Features (FFT)
The FFT feature contains 4 exponential bands, 10 cepstral coefficients, and the spectral entropy and energy for each x and y coordinate trajectory of all joints, giving a total of 256 dimensions.
5.1.3 Feature Representation
For both features (BM and FFT) we compute a separate codebook for each distinct sub-feature (i.e. velocity, acceleration, exponential bands etc.) which we found to be more robust than a single codebook. We set the codebook size to twice the respective feature dimension, which is created by computing k-means from all features (over 80,000). We compute both features for trajectories of length 20, 50, and 100 (centered at the frame where pose was detected) to allow for different motion lengths. The resulting features for different trajectory lengths are combined by stacking and give a total feature dimension of 3336 for BM and 1536 for FFT.
5.2 Holistic Approach
Most approaches for activity recognition are based on a bag-of-words representations. We pick the state-of-the-art Dense Trajectories approach (Wang et al. 2011, 2013a) which extracts histograms of oriented gradients (HOG), flow (HOF Laptev et al. 2008), and motion boundary histograms (MBH Dalal et al. 2006) around densely sampled points, which are tracked for 15 frames by median filtering in a dense optical flow field. The x and y trajectory speed is used as a fourth feature. Using their code and parameters which showed state-of-the-art performance on several datasets we extract these features on our data. Following Wang et al. (2013a) we generate a codebook for each of the four features of 4000 words using k-means from over a million sampled features.
5.3 Hand-Centric Approach
In domains where people mainly perform hand-related activities it seems intuitive to expect that hand regions contain important and relevant information for recognizing those activities and the participating objects. Thus, in addition to using the holistic and pose-based features, we suggest to focus on the hand regions. To obtain the hand locations we rely on our hand detector described in Sect. 4.1 as well as on the pose estimation method with integrated hand candidates (Sect. 4.3). In order to increase the robustness of the method we use both location candidates (provided by the handDPM detector and the final pose model) and sum the obtained features.
5.3.1 Hand-Trajectories
We want to represent different type of information: hand motion, hand shape, and shape variations over time, as well as the appearance of objects manipulated by the hands. We propose to densely sample the neighborhood of each hand and to track those points over time. For tracking and also representing the point trajectories with powerful features we adapt the approach of Wang et al. (2013a). We focus only on densely sampled points around the estimated hand positions instead of sampling the entire video frame. We specify a bounding box around each hand detection and densely sample points inside of it. In our experiment we use \(120\times 140\) pixels bounding box around hands to include the information about the hands’ context. We use 8 pixels grid spacing for points sampling and finally we get 136 interest point tracks for each frame. After extracting the features along computed tracks we create codebooks that contain 4000 words per feature.
5.3.2 Hand-cSift
Color information is another important cue for recognizing activities and even more prominent for recognizing the participating objects. Similar to the previous approach we densely sample the points in the hands’ neighborhood and extract color Sift features on 4 channels (RGB + grey). We quantize them in a codebook of size 4000.
5.4 Fine-Grained Activity Classification and Detection
5.4.1 Activity Classification
Given a long video we assume that it consists of multiple time intervals. Each such interval t depicts a single fine-grained activity and its participating objects (e.g. dry, hands, towel). In the following we refer to both, activities and participants, as activity attributes \(a_i, (i \in \{1,\ldots ,n\})\), i.e. \(a_i\) can be any attribute including cut, knife, or cucumber. We train one-vs-all SVM classifiers on the features described in the previous sections given the ground truth intervals and labels. The classifiers provide us with real valued confidence score functions \(f^{base}_i:\mathbb {R}^N\mapsto \mathbb {R}\) for attribute \(a_i\) and feature vectors of dimension N. Combining different features is achieved by concatenating, i.e. stacking, the corresponding feature vectors.
5.4.2 Activity Detection
While we use ground truth intervals for training the activity classifiers, we use a sliding window approach to find the correct interval of detection. To efficiently compute features of a sliding window we build an integral histogram over the histogram of the codebook features. We use non maximum suppression over different window lengths and start with the maximum score and remove all overlapping windows. In the detection experiments we use a minimum window size of 30 with a step size of 6 frames; we increase window and step size by a factor of \(\sqrt{2}\) until we reach a window size of 1800 frames (about 1 min). Although this will still not cover all possible frame configurations, we found it to be a good trade-off between performance and computational costs.
6 Modeling Composite Activities
In the previous section we discussed how we recognize fine-grained activities (such as peeling or washing) and their object participants (such as grater, knife, or cucumber). Now we focus on exploiting the temporal context and on recognizing different composite activities, e.g. preparing a cucumber or cooking pasta.
For this, we first show how we exploit temporal context and co-occurrence to improve the recognition of fine-grained activities and their object participants (Sect. 6.1). Then, we model composite activities as a flexible combination of attributes, where attributes refer jointly to the fine-grained activities and their object participants (Sect. 6.2). We then show how to use prior knowledge (Sect. 6.3) to improve the recognition of composite activities, overcoming the notorious lack of training data and handling the large variability of composite activities. In Sect. 6.4 we discuss how to mine the semantic relatedness from script data. Finally, in Sect. 6.5 we introduce an automatic approach to temporal video segmentation, which removes the necessity to manually annotate the ground truth intervals in a video.
6.1 Recognizing Activity Attributes Using Context and Co-occurrence
6.2 Composite Activity Classification Using Activity Attributes
6.3 Script Data for Recognizing Composite Activities
Composite activities show a high diversity which is practically impossible to capture in a training corpus. Our system thus needs to be robust against many activity variants that are not present in the training data. The use of attributes allows to include external knowledge to determine relevant attributes for a given composite activity. For this we assume associations between attribute \(a_i\) and composite activity class z in a matrix of weights \(w_{z,i}\), with Z being the number of composite activity classes. The vectors \(w_z\) are L1 normalized, i.e. \(\sum _{i=1}^n w_{z,i}=1\). Our system extracts those associations from script data (see Sect. 6.4), but the approach generalizes to other arbitrary external knowledge sources. We explore three options to use such information which we detail in the following.
6.3.1 Script data
6.3.2 NN + script data
When training data is available we can use a nearest neighbor classifier. Often, only a handful of attributes are likely to be indicative for a composite activity class, while the majority of other attributes will provide irrelevant, potentially noisy information. When searching for nearest neighbors such irrelevant attributes might dominate the distance, resulting in suboptimal performance. To reduce this effect we rely on the script data to constrain the attribute feature vector to the relevant dimensions.
6.3.3 Propagated Semantic Transfer (PST)
As the third approach to integrate external knowledge from script data we use Propagated semantic transfer (PST) which we proposed in Rohrbach et al. (2013a) and summarize shortly in the following. The approach builds on Eq. (10) and uses label propagation to exploit the distances within the unlabeled data, i.e. it assumes a transductive setting where all test data is available when predicting a single test label.
For computing the distance between the sequences we use the feature representation \(g^{seq}(S)\), as for the NN-classifier, which is much lower dimensional than the raw video feature representation and provides more reliable distances as we showed in Rohrbach et al. (2013a). We build a k-NN graph by connecting the k closest neighbours. We set the weights of the graph edges between sequences d and e to \(exp( -0.5 \sigma ^{0.5}\Vert g^{seq}(S_d) - g^{seq}(S_e)\Vert )\), where \(\sigma \) is set to the mean of the distances to the nearest neighbours. We initialize this graph with the scores \(s^{PST}_{z,d}\) and propagate them using label propagation from Zhou et al. (2004).
6.4 Prior Knowledge from Script Data
We want to quantify what activities and objects typically occur in a composite activity by leveraging the script data we collected (see Sect. 3.4). In order to use prior knowledge from textual script data, we have to match the (controlled) attribute labels from the video annotations to the (freely) written script instances (Sect. 6.4.1). Based on the matched attributes we compute two different word frequency statistics (Sect. 6.4.2).
6.4.1 Label Matching
literal: we look for the exact matching of the attribute label within the data.
WordNet: we look for attribute labels and their synonyms. We take synonyms as members of the same synset according to the WordNet ontology (Fellbaum 1998) and restrict them to words with the same part of speech, i.e. we match only verbal synonyms to activity predicates and only nouns to object terms.
6.4.2 Statistics Computed on the Script Data
freq: word frequency \(freq(a_i,\delta _z)\) for each attribute \(a_i\) and composite activities z.
- tf\(*\)idf (term frequency \(*\) inverse document frequency, Salton and Buckley 1988) is a measure used in Information Retrieval to determine the relevance of a word for a document. Given a document collection \(D=\{\delta _1,...,\delta _z,...,\delta _m\}\), tf\(*\)idf for a term or attribute \(a_i\) and a document \(\delta _z\) is computed as follows:where \(\{\delta \in D:a_i \in \delta \}\) is the set of documents containing \(a_i\) at least once. tf\(*\)idf represents the distinctiveness of a term for a document: the value increases if the term occurs often in the document and rarely in other documents.$$\begin{aligned} tfidf(a_i,\delta _z) = freq(a_i,\delta _z) * log\frac{|D|}{|\{\delta \in D:a_i \in \delta \}|}, \end{aligned}$$(14)
6.5 Automatic Temporal Segmentation
While we assume a segmented video during training time to learn attribute classifiers as described in Sect. 5.4, we want to segment the video automatically at test time. To avoid noisy and small segments we follow the idea we presented in (Rohrbach et al. 2014), namely we employ agglomerative clustering. We start with uniform intervals of 60 frames and describe each interval with an attribute-classifier score vector. We combine neighbouring intervals based on the cosine similarity of their score vectors and stop when we reach a threshold (found on the validation set). We aim for a segmentation with granularity similar to original manual annotation. After this a separately trained visual background classifier removes irrelevant or noisy segments. In our experiments we show that this leads to composite recognition results, similar to using the ground truth intervals for the attributes.
7 Evaluation
In this section we evaluate our approaches to fine-grained and composite activity recognition. We start with the fine-grained activity classification and detection and compare three types of approaches described in Sect. 5, namely pose-based, hand-centric and holistic approaches. Next we evaluate our approaches for composite activity recognition introduced in Sect. 6, evaluating our attributes enhanced with context and co-occurrence, the recognition of composite cooking activities using different levels of supervision, and the zero-shot approach using script data.
7.1 Experimental Setup
This section details our experimental setup. We will release evaluation code to reproduce and compare with our results. See Table 3 for the information on our training/validation/test split. We estimate all hyper parameters on the validation set and then retrain the models on the training and validation set with the best parameters.
7.1.1 Experimental Setup Fine-Grained Activity Classification and Detection
In the fine-grained recognition task we want to distinguish 67 fine-grained activities and 155 participating objects (see Table 7 for the lists of activities and objects). To learn the visual classifiers we use the annotated ground truth intervals provided with the dataset. We train one-vs-all SVMs using mean SGD (Rohrbach et al. 2011) with a \(\chi ^2\) kernel approximation (Vedaldi and Zisserman 2010). For detection we use the midpoint hit criterion to decide on the correctness of a detection, i.e. the midpoint of the detection has to be within the ground-truth. If a second detection fires for one ground-truth label, it is counted as false positive. In the following we report the mean over the average precision (AP) of each class. Combining features is achieved by stacking the bag-of-word histograms.
7.1.2 Experimental Setup Composite Activity Recognition
For localizing attributes within composite activities we rely on our automatic segmentation (Sect. 6.5). We aim to recognize 31 composite activities (see bold names in Table 2).
Attribute training on all composites. We use all available 218 training + validation videos for training the attribute classifiers. See left half of Tables 8, 9, and 10.
Attribute training on disjoint composites. We use all available videos apart from those showing the test composite categories (in total 92 videos). This means that attributes and composites are trained on disjoint sets of composite categories and thus also on disjoint sets of videos. This tests how well novel composite categories can be recognized without additional attribute labels. See right half of Tables 8, 9, and 10.
With training data for composites. We train on the 126 training + validation videos whose category is in the set of the 31 test categories. Note that in case of Attribute training on all composites the training videos are also part of the attribute training. See top part of Table 9.
No training data for composites. Here we do not rely on any training labels for the composite activities. See bottom part of Table 9 and all of Table 10. Combined with Attribute training on disjoint composites this is zero-shot recognition.
7.2 Fine-Grained Activity Classification and Detection
7.2.1 Activity Classification
Fine-grained activity and object classification results, mean AP in % (see Sect. 7.2 for discussion)
Approach | Activities | Objects | All |
---|---|---|---|
Pose-based approaches | |||
(1) BM | 18.9 | 13.8 | 15.7 |
(2) FFT | 19.0 | 16.2 | 17.2 |
(3) Combined | 24.1 | 19.0 | 20.8 |
Hand-centric approaches | |||
(4) Hand-cSift | 23.0 | 23.8 | 23.5 |
(5) Hand-trajectories | 45.1 | 31.5 | 36.4 |
(6) Combined | 43.5 | 34.2 | 37.5 |
Holistic approach | |||
(7) Dense trajectories | 44.5 | 31.3 | 36.1 |
Combinations | |||
(8) Dense Traj,BM,FFT | 43.1 | 30.7 | 35.2 |
(9) Dense Traj,Hand-Traj | 52.2 | 37.7 | 42.9 |
(10) Dense Traj,Hand-Traj,-cSift | 51.2 | 39.3 | 43.7 |
The body model features on the joint tracks (BM) achieve a mean average precision (AP) of 18.9 % for activities and 13.8 % for objects. Comparing this to the FFT features, we observe that FFT performs slightly better, improving over BM the AP by 0.1 and 2.4 % respectively. The combination of BM and FFT features (line 3 in Table 5) yields a significant improvement, reaching AP of 24.1 % for activities and 19.0 % for objects. We attribute this to the complementary information encoded in the features. While BM encodes among others velocity-histograms of the joint-tracks and statistics between tracks of different joints, FFT features encode FFT coefficients of individual joints. Still, this is a relatively low performance. It can be explained, on one hand, by failures of the pose estimation method and, on the other hand, the pose-based features might not contain enough information to successfully distinguish the challenging fine-grained activities and participating objects. Next we look at the performance of our proposed hand-centric features. Color Sift features, densely sampled in the hand neighborhood, allow us to improve the object recognition AP to 23.8 % (Hand-cSift), indicating their better suitability in particular for recognizing objects. Dense Trajectories features computed around hands (denoted as Hand-Trajectories) reach 45.1 and 31.5 % recognition AP for activities and objects, respectively. Combining both features leads to a small disimprovement for activities, however it helps to further improve the object recognition performance to 34.2 %. Overall our hand-centric approach reaches the recognition AP of 37.5 % for activities and objects together. The state-of-the-art holistic approach of Dense Trajectories (Wang et al. 2013a) obtains 44.5 and 31.3 % recognition AP for activities and objects. If compared to our hand-centric features, this is slightly below the Hand-Trajectories, which are restricted to the areas around hands. This supports our hypothesis that the most relevant information for recognizing our fine-grained activities is contained in the hand regions. We also consider several feature combinations (lines 8, 9, 10 in Table 5). Combining Dense Trajectories with the pose-based features does not improve the recognition performance. However, combining them with Hand-Trajectories improves the activity recognition by 7.7 % and object recognition by 6.4 % (line 7 vs 9 in Table 5). Finally, adding the Hand-cSift features allows to reach the impressive 43.7 % recognition AP for activities and objects together.
The detailed comparison of Dense Trajectories, Hand-Trajectories and the final feature-combination (line 10 in Table 5) can be found in Table 7. Hand-Trajectories loose to Dense Trajectories on activities that include “coarser” motion, e.g. push down, hang or plug, and corresponding objects such as hook or teapot. Note that Hand-Trajectories outperform the Dense Trajectories for 35 activity classes, while in the opposite direction this holds only 25 times (for objects, respectively 65 vs 43 times). This shows again that the hand-centric features consistently outperform the holistic features in both tasks. Some example cases where the hand-centric approach is significantly better, are such activities as rip open, take apart, and grate and such objects as cauliflower, oven, and cup. At the same time the final feature combination (line 10 in Table 5) consistently outperforms both aforementioned features in about 60 % of cases. We demonstrate some qualitative results comparing Dense Trajectories to the final feature combination in Table 11. We also looked closer at the performance of other features. e.g. the combined pose features (line 3 in Table 5) perform well on “coarser”, full-body activities, such as throw in garbage, take out, move, while rather poorly on more fine-grained activities. On the other hand the Hand-cSift features are good in recognizing objects with distinct shapes/colors, e.g. pineapple, carrot, bowl, etc.
7.2.2 Activity Detection
Fine-grained activity and object detection results, mean AP in % (see Sect. 7.2 for discussion)
Approach | Activities | Objects | All |
---|---|---|---|
Pose-based approaches | |||
(1) BM | 9.7 | 7.6 | 8.3 |
(2) FFT | 10.5 | 8.7 | 9.3 |
(3) Combined | 14.3 | 9.8 | 11.4 |
Hand-centric approaches | |||
(4) Hand-cSift | 10.5 | 10.9 | 10.7 |
(5) Hand-trajectories | 21.3 | 14.0 | 16.6 |
(6) Combined | 26.0 | 20.6 | 22.5 |
Holistic approach | |||
(7) Dense trajectories | 29.5 | 21.5 | 24.4 |
Combinations | |||
(8) Dense Traj,BM,FFT | 30.7 | 21.5 | 24.8 |
(9) Dense Traj,Hand-Traj | 34.3 | 25.2 | 28.5 |
(10) Dense Traj,Hand-Traj,-cSift | 34.5 | 25.3 | 28.6 |
Fine-grained activities and object classification performance of Dense Trajectories, Hand Trajectories, and their combination including Hand-cSift (line 10 in Table 5) for 67 fine-grained activities and 155 participating objects. AP in %. “-” denotes that the category is not part of the test set and not evaluated
Activity | Dense | Hand | Combi | Object | Dense | Hand | Combi | Object | Dense | Hand | Combi |
---|---|---|---|---|---|---|---|---|---|---|---|
Traj | Traj | +cSift | Traj | Traj | +cSift | Traj | Traj | +cSift | |||
Add | 19.8 | 16.3 | 24.0 | Apple | – | – | – | Mango | 3.8 | 7.0 | 2.5 |
Arrange | 61.9 | 32.1 | 33.8 | Arils | 19.8 | 57.8 | 12.5 | Masher | – | – | – |
Change temperature | 69.1 | 78.1 | 75.4 | Asparagus | – | – | – | Measuring-pitcher | 0.7 | 5.0 | 5.3 |
Chop | 36.6 | 35.4 | 48.3 | Avocado | 2.5 | 4.3 | 3.8 | Measuring-spoon | 34.1 | 12.6 | 7.3 |
Clean | 32.0 | 33.0 | 33.3 | Bag | – | – | – | milk | 0.4 | 0.4 | 0.4 |
Close | 76.3 | 68.8 | 77.0 | Baking-paper | – | – | – | Mortar | – | – | – |
Cut apart | 33.8 | 36.2 | 33.5 | Baking-tray | – | – | – | Mushroom | – | – | – |
Cut dice | 39.3 | 45.7 | 44.9 | Blender | – | – | – | Net-bag | 0.3 | 0.2 | 0.7 |
Cut off ends | 21.4 | 52.0 | 31.9 | Bottle | 57.1 | 49.3 | 57.7 | Oil | 52.3 | 47.6 | 55.6 |
Cut out inside | 2.2 | 0.8 | 2.0 | Bowl | 34.7 | 33.1 | 49.0 | Onion | 19.3 | 20.4 | 22.7 |
Cut stripes | 12.9 | 13.0 | 15.4 | Box-grater | – | – | – | Orange | 18.4 | 11.1 | 19.3 |
Cut | 28.3 | 44.9 | 27.2 | Bread | 3.7 | 6.5 | 8.9 | Oregano | – | – | – |
Dry | 81.9 | 85.1 | 84.5 | Bread-knife | 3.0 | 4.0 | 8.1 | Oven | 30.7 | 73.4 | 89.3 |
Enter | 100.0 | 100.0 | 100.0 | Broccoli | 2.0 | 2.3 | 5.7 | Paper | – | – | – |
Fill | 94.3 | 90.8 | 86.2 | Bun | 1.2 | 2.3 | 8.5 | Paper-bag | 20.5 | 10.3 | 33.0 |
Gather | 25.7 | 23.8 | 35.7 | Bundle | 0.5 | 1.1 | 1.4 | Paper-box | 1.0 | 1.2 | 3.6 |
Grate | 66.7 | 100.0 | 100.0 | Butter | 6.2 | 1.9 | 9.6 | Parsley | 23.4 | 25.5 | 49.6 |
Hang | 85.8 | 57.2 | 81.4 | Carafe | 44.4 | 46.7 | 54.4 | Pasta | 26.1 | 16.0 | 40.7 |
Mix | 10.3 | 5.4 | 52.9 | Carrot | 26.5 | 41.3 | 64.9 | Peach | – | – | – |
Move | 75.7 | 75.7 | 78.3 | Cauliflower | 29.3 | 68.9 | 73.8 | Pear | – | – | – |
Open close | 60.8 | 65.7 | 64.7 | Cheese | – | – | – | Peel | 40.3 | 28.6 | 35.2 |
Open egg | 50.0 | 28.1 | 39.2 | Chefs-knife | 59.9 | 73.3 | 63.1 | Pepper | 3.1 | 14.4 | 6.7 |
Open tin | – | – | – | Chili | 0.6 | 0.9 | 1.3 | Peppercorn | – | – | – |
Open | 22.0 | 22.0 | 34.5 | Chive | – | – | – | Pestle | – | – | – |
Package | 0.4 | 1.6 | 1.8 | Chocolate | – | – | – | Philadelphia | – | – | – |
Peel | 55.0 | 67.2 | 58.6 | Coffee | 3.3 | 25.0 | 100.0 | Pineapple | 19.5 | 47.0 | 49.7 |
Plug | 41.6 | 32.6 | 81.0 | Coffee-container | 34.6 | 24.8 | 73.4 | Plastic-bag | 36.4 | 37.7 | 43.6 |
Pour | 44.8 | 44.9 | 45.1 | Coffee-machine | 34.7 | 65.1 | 91.2 | Plastic-bottle | 4.7 | 2.8 | 9.1 |
Pull apart | 38.7 | 53.8 | 45.2 | Coffee-powder | 0.5 | 1.3 | 3.0 | Plastic-box | 2.6 | 9.0 | 5.3 |
Pull up | 79.2 | 21.7 | 75.6 | Colander | 63.4 | 62.2 | 77.9 | Plastic-paper-bag | 0.9 | 14.7 | 19.6 |
Pull | 1.3 | 9.1 | 1.2 | Cooking-spoon | – | – | – | Plate | 65.7 | 69.2 | 73.9 |
Puree | – | – | – | Corn | – | – | – | Plum | 0.7 | 2.5 | 1.3 |
Purge | 0.1 | 0.1 | 0.6 | Counter | 71.8 | 70.3 | 76.5 | Pomegranate | 5.1 | 0.8 | 2.3 |
Push down | 30.7 | 7.6 | 28.0 | Cream | 0.9 | 0.5 | 1.4 | Pot | 84.3 | 88.0 | 91.1 |
Put in | 55.5 | 50.8 | 58.0 | Cucumber | 4.3 | 5.2 | 4.1 | Potato | 0.4 | 0.4 | 0.6 |
Put lid | 87.3 | 85.3 | 90.0 | Cup | 27.0 | 26.7 | 43.6 | Puree | – | – | – |
Put on | 6.2 | 5.6 | 1.2 | Cupboard | 97.5 | 98.0 | 98.4 | Raspberries | – | – | – |
Read | 5.1 | 5.4 | 5.6 | Cutting-board | 84.4 | 85.4 | 88.9 | Salad | – | – | – |
Remove from package | 19.3 | 34.3 | 31.5 | Dough | – | – | – | Salami | – | – | – |
Rip open | 2.8 | 45.0 | 100.0 | Drawer | 98.2 | 98.4 | 98.5 | Salt | 59.8 | 48.7 | 64.1 |
Scratch off | 30.7 | 33.1 | 31.9 | Egg | 12.1 | 3.6 | 7.3 | Seed | – | – | – |
Screw close | 77.3 | 77.5 | 77.5 | Eggshell | 3.5 | 3.6 | 11.2 | Side-peeler | 50.0 | 11.7 | 37.8 |
Screw open | 78.7 | 69.4 | 79.2 | Electricity-column | 89.3 | 82.3 | 98.1 | Sink | 47.0 | 54.0 | 53.9 |
Shake | 73.0 | 75.7 | 77.3 | Electricity-plug | 74.3 | 70.6 | 87.7 | Soup | – | – | – |
Shape | – | – | – | Fig | 1.0 | 1.0 | 0.9 | Spatula | 72.9 | 76.2 | 78.2 |
Slice | 47.2 | 71.3 | 57.4 | Filter-basket | 1.3 | 3.4 | 13.1 | Spice | 19.1 | 13.3 | 12.4 |
Smell | 49.7 | 15.7 | 33.0 | Finger | 18.4 | 15.4 | 8.8 | Spice-holder | 95.6 | 94.4 | 96.3 |
Spice | 88.6 | 89.0 | 89.2 | Flat-grater | 31.7 | 27.7 | 40.9 | Spice-shaker | 88.3 | 87.3 | 91.5 |
Spread | 87.1 | 77.1 | 96.7 | Flower-pot | – | – | – | Spinach | – | – | – |
Squeeze | 90.1 | 92.9 | 91.9 | Food | – | – | – | Sponge | 17.2 | 45.4 | 38.2 |
Stamp | – | – | – | Fork | 8.7 | 7.5 | 10.5 | Sponge-cloth | 67.1 | 68.1 | 75.0 |
Stir | 91.2 | 81.9 | 91.7 | Fridge | 100.0 | 99.8 | 100.0 | Spoon | 2.8 | 5.9 | 8.9 |
Strew | 1.7 | 2.4 | 2.4 | Front-peeler | 21.8 | 6.0 | 17.6 | Squeezer | 52.5 | 67.0 | 59.3 |
Take apart | 1.6 | 32.1 | 53.3 | Frying-pan | 88.7 | 91.9 | 93.6 | Stone | 0.2 | 0.7 | 0.7 |
Take lid | 66.2 | 76.8 | 71.7 | Garbage | 13.7 | 17.9 | 27.5 | Stove | 84.4 | 87.2 | 90.4 |
Take out | 94.1 | 93.9 | 95.1 | Garlic-bulb | 0.3 | 0.6 | 0.8 | Sugar | 22.0 | 24.2 | 29.0 |
Tap | 3.3 | 4.2 | 6.2 | Garlic-clove | 11.7 | 3.6 | 9.3 | Table-knife | – | – | – |
Taste | 9.4 | 21.0 | 22.0 | Ginger | 1.9 | 3.3 | 3.6 | Tap | 70.2 | 71.8 | 79.1 |
Test temperature | 11.3 | 11.8 | 35.1 | Glass | 2.6 | 4.5 | 21.6 | Tea-egg | 37.2 | 28.7 | 36.1 |
Throw in garbage | 96.7 | 96.0 | 97.1 | Green-beans | 21.1 | 24.6 | 23.2 | Tea-herbs | 60.5 | 55.6 | 91.1 |
Turn off | 7.4 | 21.1 | 33.0 | Ham | – | – | – | Teapot | 46.4 | 6.7 | 69.1 |
Turn on | 27.8 | 30.6 | 48.5 | Hand | 95.9 | 95.2 | 96.4 | Teaspoon | 29.2 | 32.4 | 36.5 |
Turn over | – | – | – | Handle | 100.0 | 9.1 | 100.0 | Tin | – | – | – |
Unplug | 8.7 | 3.8 | 20.0 | Hook | 95.6 | 71.2 | 98.3 | Tin-opener | – | – | – |
Wash | 93.4 | 93.9 | 93.7 | Hot-chocolate-powder-bag | – | – | – | Tissue | – | – | – |
Whip | – | – | – | Hot-dog | 2.1 | 2.7 | 8.8 | Toaster | 1.3 | 8.1 | 6.7 |
Wring out | 3.3 | 4.5 | 5.3 | Jar | 5.4 | 14.2 | 17.8 | Tomato | – | – | – |
Ketchup | 2.0 | 3.1 | 19.6 | Tongs | – | – | – | ||||
Kettle-power-base | 14.4 | 9.8 | 41.4 | Top | – | – | – | ||||
Kiwi | 1.1 | 2.9 | 1.5 | Towel | 73.2 | 76.9 | 79.2 | ||||
Knife | 69.6 | 83.5 | 76.8 | Tube | 1.0 | 9.5 | 10.2 | ||||
Knife-sharpener | – | – | – | Water | 55.0 | 46.9 | 57.2 | ||||
Kohlrabi | – | – | – | Water-kettle | 40.7 | 25.9 | 53.7 | ||||
Ladle | – | – | – | Wire-whisk | – | – | – | ||||
Leek | 10.6 | 19.5 | 17.6 | Wrapping-paper | 2.9 | 0.4 | 2.0 | ||||
Lemon | – | – | – | Yolk | 0.5 | 0.5 | 0.3 | ||||
Lid | 67.1 | 70.8 | 71.8 | Zucchini | – | – | – | ||||
Lime | 14.2 | 3.7 | 14.6 |
7.3 Context and Co-occurrence for Fine-Grained Activities
While so far we looked at individual fine-grained activities, we now evaluate the benefit from co-occurrence and context as introduced in Sect. 6.1. Table 8 provides the results for recognizing activities and their participants, modeled as attributes. We evaluate in two settings. The left two columns of Table 8 show the results for training on all composites in training set, while the right two columns are trained only on composites absent in test set (Disjoint Composites), i.e. the second is a more challenging problem, as there is less training data and the attributes are tested in a different context (Table 7) . The performance in the first line is equivalent to the results in Table 5. The very left column shows results on Dense Trajectories. More specifically using only temporal context to recognize activity attributes performance drops from 36.1 % AP for the base classifier to 11.1 % AP. This is the expected result, because the context is similar for all activities of the same sequence and thus cannot discriminate attributes. In contrast, when using co-occurrence only (line 4 in Table 8), the performance increases by 2.0 % compared to the base classifiers due to the high relatedness between the attributes, namely between activities and their participants. Combining context and co-occurrence information with the base classifier gives 37.8 and 38.1 %, respectively. A combination of all training modes achieves a performance of 39.3 % AP, improving the base classifier’s result by 3.2 %. While results for Dense Trajectories are as expected i.e. adding context and co-occurrence improves performance, the performance drops slightly for the (in general) better performing combined features (second column). However, although the attribute prediction performance drops, we found that for recognizing the composites, context and co-occurrence are still useful.
In the second setting, we restrict the training dataset to composites absent in the test set (right two columns of Table 8), requiring the activity attributes to transfer to different composite activities. When comparing the right two the left columns, we notice a significant performance drop for all classifiers and both features. This decrease can mainly be attributed to the strong reduction of training data to about one third. The base classifier performs best and co-occurrence variants slightly below. Variants including context lead to tremendous performance drops in all combinations because the activity context changes from training to test (having different composite activities).
7.4 Composite Cooking Activity Classification
Attribute recognition using context and co-occurrence, mean AP in %. Combi+cSift refers to Dense Traj,Hand-Traj,-cSift, see Sect. 7.3 for discussion
Attribute training on: | All composites | Disjoint composites | ||
---|---|---|---|---|
Dense | Combi | Dense | Combi | |
Traj | +cSift | Traj | +cSift | |
(1) Base (\(s^{base}\)) | 36.1 | 43.7 | 33.5 | 35.9 |
(2) Context only (\(s^{con}\)) | 11.1 | 12.6 | 6.8 | 8.1 |
(3) Base + Context | 37.8 | 41.2 | 28.3 | 32.3 |
(4) Co-occ. only (\(s^{coocc}\)) | 38.1 | 41.7 | 32.6 | 35.3 |
(5) Base + Co-occ. | 38.1 | 41.4 | 32.7 | 35.2 |
(6) Base + Cont. + Co-occ | 39.3 | 41.5 | 30.8 | 32.6 |
Composite cooking activity classification, mean AP in %. Top left quarter: fully supervised, right column: reduced attribute training data, bottom section: no composite cooking activity training data, right bottom quarter: true zero shot. See Sect. 7.4 for discussion
Attribute training on: | All composites | Disjoint composites | ||
---|---|---|---|---|
Dense | Combi | Dense | Combi | |
Traj | +cSift | Traj | +cSift | |
With training data for composites | ||||
Without attributes | ||||
(1) SVM | 39.8 | 41.1 | - | - |
Attributes on gt intervals | ||||
(2) SVM | 43.6 | 52.3 | 32.3 | 34.9 |
Attributes on automatic segmentation | ||||
(3) SVM | 49.0 | 56.9 | 35.7 | 34.8 |
(4) NN | 42.1 | 43.3 | 24.7 | 32.7 |
(5) NN + Script data | 35.0 | 40.4 | 18.0 | 21.9 |
(6) PST + Script data | 54.5 | 57.4 | 32.2 | 32.5 |
No training data for composites | ||||
Attributes on automatic segmentation | ||||
(7) Script data | 36.7 | 29.9 | 19.6 | 21.9 |
(8) PST + Script data | 36.6 | 43.8 | 21.1 | 19.3 |
Examining the results in Table 9 we make several interesting observations. First, training composites on attributes of fine-grained activities and objects (line 3 in Table 9) outperforms low-level features (line 1 in Table 9), supporting our claim that for learning composite activities it is important to share information on an intermediate level of attributes.
The second somewhat surprising observation is that recognizing composites based on our segmentation (line 3 in Table 9) outperforms using ground truth segments (line 2 in Table 9). We attribute this to the fact that our segmentation is coarser than the ground truth and that we additionally remove noisy and background segments with a background classifier. This leads to more robust attributes and consequently better composite recognition. This allows to have separate training sets for composites and attributes. This setting is explored in the top right quarter of Table 9. Here the training sequences for attributes are disjoint with the ones for composites, i.e. we do not require the attribute annotataions for the composite training set.
Variants of script knowledge, AP in %. Combi+cSift refers to Dense Traj,Hand-Traj,-cSift. See Sect. 7.4 for discussion
Attribute training on: | All composites | Disjoint composites | ||
---|---|---|---|---|
Dense | Combi | Dense | Combi | |
Traj | +cSift | Traj | +cSift | |
No training data for composites | ||||
Script data | ||||
(1) freq-literal | 28.2 | 30.5 | 19.8 | 24.1 |
(2) freq-WN | 25.3 | 28.6 | 17.4 | 20.3 |
(3) tf\(*\)idf-literal | 35.9 | 31.8 | 20.0 | 23.6 |
(4) tf\(*\)idf-WN | 36.7 | 29.9 | 19.6 | 21.9 |
Fourth, using our Propagated Semantic Transfer (PST) approach is in most cases superior to other variants of incorporating script data (NN + Script data/ Script data). Most notably it reaches 57.5 % AP for our combined feature. This is the overall best performance and also outperforms the SVM with 56.6 % AP. PST slightly drops for the last number in table (19.3 %), which we found is due to rather suboptimal parameters selected on the validations set. We note that in the scenario of Disjoint Composites (top right quarter of Table 9) PST + Script data is outperformed by training an SVM. We attribute this to the fact that the attributes are less robust in this scenario (see Table 8) and the SVM can better adjust to that by learning which attributes are reliable and which not. NN and PST are based on distances between attribute score vectors, thus metric learning could be beneficial in these cases.
Qualitative results for Dense Trajectories and its combination with hand-centric features (line 10 in Table 5) with respect to ground-truth (Color table online)
Sixth, while in Table 9 we always used the variant tf\(*\)idf-WN for Script data, we show different variants of Script data for the case where they are not combined with NN or PST in Table 10. The main observation is that freq-WN performs in all cases worst, most likely the WordNet expansions make the results noisier. While in the first column the tf\(*\)idf-WN works best, there is overall no clear winner. However, when incorporated in PST, it is more important to select appropriate parameters for PST on the validation set rather than selecting the right variant of Script data.
Last, we want to look at an interesting comparison of the first line (SVM without attributes) versus line 8 (PST + Script data), which effectively compares the settings “only composite labels” versus “only attribute labels” (+ Script data). Although the latter does not have any labels for the actual task of composite recognition it either performs close (in case of Dense Trajectories) or slightly better (for combined features). This indicates that our PST + Script data approach is very good in transferring information from the original task it was trained on to another which is very important for adaptation to novel situations, typical for assisted daily living scenarios.
Table 11 provides qualitative results for three composite videos including how they are decomposed into attributes of fine-grained activities and participating objects.
8 Conclusion
In this work we address two challenges that have not been widely explored so far, namely fine-grained activity recognition and composite activity recognition. In order to approach these tasks we propose the large activity database MPII Cooking 2. We recorded and annotated 273 videos of more than 27 hours with 30 human subjects performing a large number of realistic cooking activities. Our database is unique with respect to size, length, complexity of the videos, and available annotations (activities, objects, human pose, text descriptions).
To estimate the complexity of fine-grained activity recognition in our database we compare three types of approaches: pose-based, hand-centric, and holistic. We evaluate on a classification and the often neglected detection task. Our results show that for recognizing fine-grained activities and their participating objects it is beneficial to focus on hand regions as the activities are hand-centric and the relevant objects are in the hand neighbourhood.
Composite activities are difficult to recognize because of their inherent variability and the lack of training data for specific composites. We show that attribute-based activity recognition allows recognizing composite activities well. Most notably, we describe how textual script data, which is easy to collect, enables an improvement of the composite activity recognition when only little training data is available, and even allows for complete zero-shot transfer.
As part of future work we plan to validate our hand-centric approach in other domains and exploit the scripts for composite activity recognition by modeling the temporal structure of the video.
Acknowledgments
This work was supported by a fellowship within the FITweltweit-Program of the German Academic Exchange Service (DAAD), by the Cluster of Excellence “Multimodal Computing and Interaction” of the German Excellence Initiative and the Max Planck Center for Visual Computing and Communication.