The problem of retrieval of visual information in large collections of images and video in digital form is basically that one of pattern recognition. In case of video documents the term “Indexing” is of primarily interest, as it means that a specific pattern is present in the spatio-temporal document at a given moment of time. The recognition of concepts, such as an action of a person, an object of a predefined category, … in a video document can be considered as a search of a concept in a collection of images in an image database. The efficiency of the search is very much dependent on the completeness of content description and on the discriminative power of classifiers in the proposed description space. As show recent research in concept detection, the increase of such efficiency is possible when multiple cues of content are considered. In video, the challenging task of concept retrieval can be addressed by using all modalities for content description: spatial (key-framing), temporal( motion features) and audio( audio-features). In this talk we are interested in the problem of recognition of Activities of Daily Living in specific video streams coming from video cameras weared by patients. This concept of egocentric motion has recently got the growing popularity and various research has been done in order to recognize scene elements and actions in such streams. In our solution we develop Hierarchical Markov models to represent a document and proposed a rich description space. Various combinations of description spaces and sub-spaces in an early - , intermediate and late fusion manner are studied yielding to promising results.