5th Seminar on Neural Network Applications in Electrical Engineering, NEUREL-2 Faculty of ElectricalEngineering, University of Belgrade, Yugoslavia, September 26-27,2000
NEURAL NETWORKS APPLICATIONS FOR MULTIMEDIA PROCESSING Invited paper 'Zoran Bojkovic, 'Dragorad Milovanovic, 'Nikos Mastorakis 'Faculty of Electrical Engineering, University of Belgrade Email: [email protected]
ac.yu 'h4ilitary Institutions of University Education, Hellenic Naval Academy, Piraeus, Greece Email: [email protected]
Abstract - Adaptive neural networks (NN) technologies present a unified solution to a broad spectrum of multimedia applications. In this paper we pointed out that current research frontier of multimedia signal processing is shifling from coding (MPEG-1.2,4) to automatic recognition (UPEG-7). Its research domain will cover techniques for object-based representation /coding, segmentation / tracking, pattern detection / recognition, multimodal signals fusion / conversion /synchronization, as well as content-based indexing and subject-based retrieval /browsing. Also, we reported successfully applications of NN for compression, denoising, contrast enhancement, snakes model segmentation / visualization, audio-to-video conversion and content-basedprocessing / indexing /retrieval. Keywork: neural networh, multimedia, compression, segmentation, audio-visual integration, MPEG-7
The technology and standards for multimedia systems are evolving quickly and, therefore, it is challenging to keep pace with wide spectrum of this rapidly advancing technologies [1,2]. Among them, neural networks are a core technology, presenting a unified solutions to a broad spectrum of multimedia processing applications: imagehide0 segmentation, texture classification, tracking of moving objects, image visualization, face-objects detection / recognition, audio classification, multimodal conversiodsynchronization ... Generally, multimedia systems can achieve their potential only when they are truly integrated in three key ways: 0 integration of content, integration with human users, and integration with other media systems. The great technical challenge calls for advances in two distinctive research areas: Computer networking technologies. Novel communications and networking technologies are critical for a multimedia system. A truly integrated media system must connect with individual users and content-addressable multimedia data bases. This will involve both logical connection to support information sharing and physical connection via computer networks and data transfer. 0-7803-5512-1/99/$10.00 0 2000 IEEE
Information processing technology. Multimedia systems must succesfully combine digital video and audio, text animation, graphics and knowledge about such information units and their interrelationships in real-time.
This paper mainly addresses emerging issues of research in information-processing technology. Neural networks have recently received increasing attention in following multimedia applications: 0 Human perception. Facial expression and emotional categorization, human color perception, multimedia data visualization. Computer-human communication: face recognition, lip-reading analysis, human-human and computerhuman communication. Multimodal representation and information retrieving: hyperlinking of multimedia objects, queries and search of multimedia information, 3D object representation and motion tracking, image sequence generation and animation. In the first part of the paper, we systematize main attributes of NN making it a core technology for multimedia processing. In the second part we report successfully applications of NN for compression, denoising, contrast enhancement, snakes model segmentation / visualization, audio-to-video conversion and multimedia database indexing / retrieval.
2. NN AND MULTIMEDIA PROCESSING The main reason why neural networks are a core technology for multimedia processing hinges upon following attributes: Neural networks offer a universal approximation capabilities, i.e., they are able accurately to approximate unknown systems based on sparse sets of noisy data. In this context, some neural models have also effectively incorporated statistical signalprocessing and optimization techniques.
Neural networks offer unsupervised clustering (i.e., no specific target 1abeVresponse is provided for any input) and/or supelvised learning mechanisms (the input and corresponding target labellresponse are both given) for recognition of objects that are deformed or with incomplete information. Ultimately, a neural
information engine (SOFM-Self organization feature map, EM-Maximum likelihood estimation, ICAPrincipal and independent components analysis) can be trained to see or hear, recognize objects or speech, or to perceive human gestures. Neural networks (MLP-Multilayer perceptron. RBFRadial basis functions, OCON-One calss in one network, DBNN-Decision based NN, MOE-Mixture of experts) are powerful pattern classifiers and encompass many similarities with statistical parfernrecognition approaches.
resulting codebooks are less sensitive to initial conditions than the standard LBG algorithm, and the topological ordering of the entries can be exploited to further increasing coding efficiency and reduce computational complexity.
3.2 Wavelet neural networks WiVN f o r denoising and contrast enhancement Image denoising and enhancement can be successfully solved by using novel multiresolution wavelet transfor WT . The aim of the application of WT to noise reduction is to compute an orthogonal wavelet decomposition and apply an appropriate soft thresholding rule on the coefficients. Techniques for image enhancement are very close to those for denoising. The enhancement techniques are also based on the application of an appropriate nonlinear transformation to the WT coefficients. It is important to emphasize that WT allow approaching the problem on the multiresolution basis, i.e. on the basis of decomposition to coarse and fine details, providing adaptation to specific image characteristics, as well as selective enhancement of the features to be extracted.
Temporal neural models (TDNN-Time delay NN, SRN-Simple recurrent network, RTRL-Real time recurrent learning, BPTT-Back propagation through time , HMM-Hidden Markov model), which arc specifically designed to deal with temporal signals. further expand the application domain in multimedia processing, particular audio, speech, and audio-visual integration and interactions [ 1,3]. Hierarchical nefwork of neural modules will be vital to facilitate a search mechanism used in a huge, or Web-wide, data base. Typically, a tree network structure would be adopted so that kernels that are common to all the models can be stored as the root of the tree. The leaves of the tree would correspond to the individual neural models, while the paths from root to leaf correspond to the kernels involved.
Recently, the attention has been paid to the unification of neural network and wavelet methodologies, resulting into the wavelet neural network (W").Namely, it has been shown that wavelets represent a very convenient choice for activation functions in the basic neuron models. (Fig.1).
3. APPLICATIONS From the commercial system perspective, there are many promising application-driven research problems. These include analysis of multimodal scene change detection. facial expressions and gestures, fusion of gesture/emotion and speech/audio signals, automatic captioning for the hearing impaired or second-language television audiences, multimedia telephone, and interactive multimedia services for audio, speech, image and video contents.
Figure 1. Wavelet neural network [ 9 ] .
3.3 Snake active contour models and NN f o r image segmentation and visualization
3.1 NN f o r image compression
A snake is an open or closed elastic curve represented by set of control points. Finding contours of distinct features (specified by the user's a priori in an energy formulation) is done by deforming and moving the elastic curve gradually from an initial shape residing on the image toward the positions where distinct features are to be extracted. This deformation process is guided by iteratively searching for nearby local minimum of an energy function, which consists of the internal energy (a smoothnes constraint of the snake curve: tension and bending) and the external energy that indicates the degree ol'matching for features (such as high image intensity for bright regions or large gradient strength for edges):
NN are well suited to the problem of image compression due to their massively parallel and distributed architecture . Their features are analogous to some of the features of human visual system. For example, multilayer. perceptrons can be used as nonlinear predictors in differential pulse-code modulation (DPCM). Such predictors have been shown to increase the predictive gain relative to linear predictor [ 5 ] . Another example is application of Hebbian learning to the extraction of' principal components, which are the basis vectors for the optimal linear Karhunen-Loeve transform (KLT) [ 6 , 7 ] .These learning algorithms are iterative, have some computational advantages over standard eigendecomposition techniques, and can be made to adapt to changes in the input signal. Yet another model, the self-organizing feature map (SOFM), has been used with a great deal of success in the design of codebooks for vector quantization (VQ). The
the number of snake points, v; = ( x , , y , ) is a coordinate of the ith snake point, a, is a constant imposing the tension
constraint between two adjacent snake points, fl, is a constant imposing the bending constraint among three neighboring snake points, and Ecy,(i)is usually some sort .of image gradient function if edge detection is the goal. In a 2D applications, the following equations were iteratively solved to find a local minmum of the snake energy function AX+ f,(x,y) = 0 AY + f,(X,Y)
where the pentadiagonal matrix A , whose band is a function of a and p , imposes constraints on the relationship aniong five neghboring snake points, vectors x and y are coordinates of N snake points, and vectors fx(x,y) and f,.(x,y) denote the partial derivatives of external energy on each snake point, i.e., .f,(x, Y , ) = RI(x,, Y ,1/ax . f J X , Y i ) = a , L ( X , > Y , ) Q 1
The formulation of a good external energy function is difficult because images are often too noisy and / or too complex to expect low-level image-processing techniques to generate a usable energy profile. Thank to its nonlinear mapping and generalization capability, the feed-forward multilayer neural networks can be used to generate systematically the extemal energy profile through data training . The snake algorithms have been widely used in many signal/piocessing applications: track the movements of mouth, lines extraction in fingerprint images, track the dynamics of a moving objects using Kalinan filter, tracking the 2D heart contour frame by frame, etc. We sucessfully used modified snakes models for interactive segmentation of ultrasound images in obstetrics, as well as for 3D reconstruction and segmentation of computed tomography images [IO, I I. 121.
3.4 Hidden Markov models (HMM) and NN f o r audio-to-visual mapping A recent trend in multimedia research is to integrate audio and visual processing in order to exploit media interaction (Fig. 2) [ 131. The problem for converting acoustic speech to mouth shape parameters can be solved from many different ways. Since there is a physical relationship between the shape of the vocal tract and the sound that is produced, there may exist a functional relationship between the speech parameters and the visual parameters set. The conversion problem becomes one of finding the best functional approximation given sets of training data. There are many algorithms that can be modified to perform this task. These approaches include neural networks (NN's) and hidden Markov models with Gaussian mixtures [ 11 (Fig. 4). An approach to classification-based conversion is given on a level of a block scheme presented in Fig. 3 . It contains two stages. In the first one, the acoustics must be classified into one of a number of classes. The second stage maps each acoustic class into a corresponding visual output. In the first stage, NN can be used to divide the
acoustic training data into a number of classes. A problem invoked by applying classification-based method is that it does no1 produce a continuous mapping but rather produces a distinct number of output levels. There exist a few application examples applying temporal neural models to conversion and/or synchronization . HMM's have been used in speech recognition for many years. Although the majority of speech recognition systems train HMM's on acoustic parameter sets, they can also be used to model the visual parameter sets. Consider estimating a single visual parameter v from the corresponding multidimensional acoustic parameter a. Defining the combined observation to be 0 = [*, VI'
the audio-to-visual conversion process using HMM's can be treated as a missing datu problem. More specifically, a continuous-density HMM was trained with a sequence of 0 for each word in the vocabulary. In the training phase, the Gaussian mixtures in each state of the HMM are modeling the joint distribution of the audio-visual parameters. When presented with a sequence of acoustic vectors that correspond to a particular word, conversion can be made by using the HMM to segment the sequence using the Viterbi algorithm [I].
3.5 Content-basedprocessing, indexing and retrieval As most digital text, audio, and visual archieves will exist on various servers all over the world, it will become increasingly diflicult to locate and access the information. This necessitates automatic search tools for indexing and access. Theory and tools that facilitate searching and manipulating images are, at the present moment, in their infancy, so that many problems are still unsolved . For decades, image processing researcher has focused on problems arising with medical and military images. Access to multimedia material according to its content, include a number of different scientific fields: digital imageiaudio processing, solving semantically problems with neural networks and artificial intelligence ... . Image content recognition. It is a general problem, intensively investigated in military applications. The aim of detection algorithm is to maximise the possibility of detcction for a given value of false alarm. However, for applications of searching a multimedia content, it is necessary to develop new criteria functions because that false alarm might correspond to pictures the user didn't request but which they might still enjoy seeing! Find regions of interest. Image segmentation is traditionally being defined as partitioning an images in homogenous regions. Dual problem is edge detection. Nevertheless, optimal image segmentation is unnecessary when searching through by content! Segmentation is defined as finding the regions of an image that are of interest for a user. Possible solution is dynamical hierarchy of set of models which are optimised for different targets. Video parsing. Instead of processing a whole video, it is useful to divide video on: shots (an unbroken sequence of frames from one camera), scenes (a sequence of shots that focus on the same point of interest) and segments (a
sequence of scenes that forms a story unit). However, till now, researches are limited on shots detection and are based on "edge" detection in time, while detection of scenes and sequences are much more complicated * problems, which are not systematically investigated. Space-time segmentation. Combination of information about space and motion are necessary when finding regions of interest for intents as "separate the persons which are walking". Till now, a small number of researches deael with this problem. Motion analysis. Motion analysis can be divided into scene motion (actions) and camera motion. Recognition of this motion is a completely new filed of research (stroboscopic, mosaic images and summary frames) which include affine models with correction of parallax. Compressed-domain imagehide0 indexing and searching. Existing imagehide0 compression techniques are optimised to achieve a maximal compression ratio with minimal perceptual image distortion and limited complexity of algorithms. "The forth criteria" (possibility of access to content) is very poorly researched! It is supposed that there exist a significant (an important) synergy between image compression, feature extraction and searching. First, the complexity of processing is significantly reduced due to a smaller amount of data in a compressed domain. Second, an important quantity of multimedia material is already compressed, in such a way that the processing overhead due to decoding is reduced so as re-encoding of existing material. Third, efficient compressing techniques actually already perform some form of information filtering (motion estimation) and content decomposition (spatial-frequency decomposition), that presents a good foundations for subsequent image content analysis. In ideal case, analysing techniques are applied directly on compressed data. In suboptimal case. minimal decoding of compressed data is still necessary for extraction of useful data. The &ey prohleni is to define a set of models that enable image compression and data access at the same time, and also to define a measure of similarity of images suitable for a human visual system. Compression of 3D graphical models. Threedimensional graphic models became very accessible to general end-users due to impulsive development of 3D 'sceners and virtual reality modeling languages (VRML). However, complex objects, modeled with poligonal mesh method, represent a big chalenge even for high-end computers (manipulation and visualisation od 3D objects require a huge memory resources and a rendering speed) so as for telecommunication data transmission system. Content-based processing is so critical because it offers a very broad application domain, including vide coding, compaction, object-oriented representation of video, content-based retrieval in the digital library, video mosaicing, video composition, etc. An NN-based tagging algorithm is proposed for subject-
based retrieval for image and video databases  Ob.ject classification for tagging is performed off-line. A hierarchical multiresolution approach is used, which helps cut down the search space looking for a feature in an image. The system allows a customer to search the image
database by semantic subject. Query is answer by searching over the tag database. A video indexing and browsing scheme based on human face is reported in . The scheme is implemented by applying the face detection and recognition techniques. The scheme contains three steps. The first step is to segment the video sequence by applying a scene change detection algorithm. After that, a probabilistic NN face detector is invoked to find the segments that most possibly contain human faces. The representative frames from every video shot arc annotated and serve as the indexes for browsing.
4. CONCLUSION Speech, image and video are playing increasingly dominant roles in multimedia information processing. Future multimedia technologies will need to handle these information with an increasing level of intelligence: automatic extraction, recognition, interpretation, and interactions of multimodal signals. Indeed, the technology frontier of information processing is shifting from coding (MPEG-I ,2,4) to automatic recognition (MPEG-7 multimedia content description interface [ 141). It research domain will cover techniques for object-based trackindsegmentation, pattern detection/recognition, content-based indexing and retrieval. and fusion of multimodal signals. For these, neural networks can offer a very promising horizon. From a long-term research perspective, there is a need to establish a fundamental and coherent theoretical ground for multimedia technologies. A synergistic balance and inferaction between representation and indexing must be carefully investigated. Another ,fundamental research subject is modeling and evaluation of perceptual quality in multimodal human communication.
REFERENCES [ I ] K.R.Rao, Z.BojkoviC, D.MilovanoviC, Mzdlimedia Communicalion Systems, PRENTICE HALL 2001. In progress.
 D.Milovanovid, Z.BojkoviC, S.StankoviC, "Multimedia and multimedia communications technologies", in Proc. IT, 2000.  S.-Y.Kung, I.-N.Hwang, "Neural networks for intelligent multimedia processing", Proc. o f f h e IEEE, vo1.86, no.6, pp.1244-1272, 1998.  D.MilovanoviC, Z.BojkoviC, B.Reljin, "Classical image coding techniques and neural network approaches: Comparative performance analysis", in Proc. NEUREL'95, pp.126-131, 1995.  Z.BojkoviC, D.MilovanoviC, "Predictive image coding in neural networks", in Proc. NEUREL'95, pp.132-135.  D.MilovanoviC et. al, "Advanced video compression techniques in digital multimedia systems", in Proc. YUINF0'9j, v01.2, pp.393-397, 1995.  D.MilovanoviC. et. al, "Implementation of digital video compression algorithms in multimedia information systems", INFOFEST'95. invited paper.
 A.VuEkovid, D.MilovanoviC, S.StankoviC, "Image denoising using Wavelet neural networks", in Proc. ETRAN, 1998.  A.VuEkoviC, S.StankoviC, D.MilovanoviC, "Wavelet neural networks", in Proc. TELFOR pp.424-427, 1996. [ 101 D.Milovanovi6, et. al, "Interactive segmentation of ultrasound images in obstetrics.using snakes-model of active contours", in Proc. DOCS,pp.167-170, 1998. [ 1 I ] R.Maksimovi6.
S.Stankovi6, D.MilovanoviC, "Computed tomography image analyzer: 3D reconstruction and segmentation applying active contour models - SNAKES". /nternationa/Journal of Medical /nforma/ics.Elsevier Science, 2000. Accepted.
methods in telemedicine, in New challenges in health care, Monograph series 1, pp. 142- 158, editors M.BabiC and R.Zajtchuk, KBC Bezanijska Kosa Medical Center, University of Belgrade - RUSH University, Medical center, Chicago, 1999. [ 131 Z.BojkoviC, D.Milovanovid, "Audio-visual integration in multimedia communications based on MPEG-4 facial animation", Journal Circuits, sy.rfems and signal processing, Special issue on Multimedia communication services, Birkhauser 2001. Submitted. [ 141 D.MilovanoviC, Z.BojkoviC, J.Stancu, "MPEG7 - A new standard for multimedia content description", in Proc. TELFOR98, pp.573-577, 1998.
[ 121 S.Stankovid, D.MilovanoviC, R.MaksimoviC, M.M i losav Ijevid. Advanced digital signal processing
Figure 2. Media interaction. Audio
Figure 3. Block scheme of a classified-based approach for converting acoustic speech to mouth shape parameters [ 11. q>*-
Figure 4. Results of the a) HMM-based method and b) neural network based method for audio-to-visual mapping. Lip height versus time will be observed. The dotted line represents the height variation of the mouth when speaking a particular phrase. The solid line represents the estimation .