Fusion D’Informations pour la Comprehension de Scènes

Fusion D’Informations pour la Comprehension de Scènes

Philippe Xu Franck Davoine  Jean-Baptiste Bordes  Thierry Denœux 

UMR CNRS 7253, Heudiasyc Université de Technologie de Compiègne BP 20529, 60205 Compiègne Cedex, France

LIAMA, CNRS Key Lab of Machine Perception (MOE) Peking University, Pékin, R.P. Chine

11 September 2013
2 June 2014
30 June 2014
| Citation



This paper addresses the problem of scene understanding for driver assistance systems. In order to recognize the large number of objects that may be found on the road, several sensors and classification algorithms have to be used. The proposed approach is based on the representation of all available information in over-segmented image regions. The main novelty of the framework is its capability to incorporate new classes of objects and to include new sensors or detection methods. Several classes as ground, vegetation or sky are considered, as well as three different sensors. The approach was evaluated on real and publicly available urban driving scene data.

Extended Abstract

This paper addresses the problem of information fusion for traffic scene understanding. In order to tackle the numerous tasks that may be expected from advanced driver assistance systems, we propose a multimodal information fusion system that is flexible enough to include new sensors, new processing modules and also new classes of objects. Several issues has to be dealt with in multisensor systems. A first issue is to combine information from sensors that perceive the environment differently. To address this point, we formulate the problem as an image labelling one.The image acquired by a camera is first over-segmented, then, the common task of all the modules, whatever the data representation they use, becomes to label each individual image segment. Another important issue is to combine processing modules that deal with different classes of objects. The theory of belief functions is used to overcome this problem as it can easily represent knowledge over sets of classes.

This paper shows how to construct mass functions using a distance to model formulation. The parameters of the mass functions are optimized by minimizing a loss function defined from the contour functions. A first module is built to detect the ground from 3D information computed by a stereo camera system. The 3D point cloud generated from a disparity map is used to estimate the ground plane. For each image segment, a mass function is then computed from the distance between the segment and the estimated ground plane. The same formulation is used to detect ground from 3D information acquired by a LiDAR sensor. A texture-based monocular module is then considered to detect the sky and vegetation. The texture of an image segment is encoded by the Walsh-Hadamard coefficients and a model is built from a bag-of-worlds approach. Finally, a temporal propagation module is proposed to link the segments of two consecutive images.

The KITTI Vision Benchmark Suite was used to validate our approach, considering two color cameras and a Velodyne 64-beam LiDAR. However, only one layer of the Velodyne LiDAR was used in order to simulate a single layer LiDAR, commonly employed in mobile robotics. A total of 110 images were manually annotated, 70 for training and 40 for testing. Several modules were first combined for a simplified task: ground/non-ground classification. The ability of the proposed approach to process any number of classes was then illustrated by adding vegetation and sky detectionmodules.Overall,fiveclassesweredefined:grass,road,tree,obstacleandsky. The grass and road classes were defined by intersecting the ground class with the vegetation and non-vegetation classes, respectively. Similarly, the tree class was defined as the intersection of the non-ground class and the vegetation class, while the obstacle class actually referred to anything that was neither the sky, the ground nor vegetation.


Cet article traite du problème de la compréhension de scènes routières pour des systèmes d’aide à la conduite. Afin de pouvoir reconnaître le grand nombre d’objets pouvant être présents dans la scène, plusieurs capteurs et algorithmes de classification doivent être utilisés. L’approche proposée est fondée sur la représentation de toutes les informations disponibles au niveau d’une image sur-segmentée. La principale nouveauté de la méthode est sa capacité à inclure de nouvelles classes d’objets ainsi que de nouveaux capteurs ou méthodes de détection. Plusieurs classes comme le sol, la végétation et le ciel sont considérées, ainsi que trois capteurs différents. L’approche est validée sur des données réelles de scènes routières en milieu urbain. 


informationfusion,trafficsceneunderstanding,theoryofbelieffunctions,intelligent vehicles.


fusion d’informations, compréhension de scènes routières, théorie des fonctions de croyance, véhicules intelligents. 

1. Introduction
2. Annotation D’Images sur-Segmentées
3. Théorie des Fonctions de Croyance
4. Application à la Compréhension de Scènes Routières
5. Résultats Expérimentaux
6. Conclusions et Perspectives

Achanta R., Shaji A., Smith K., Lucchi A., Fua P., Susstrunk S. (2012). SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 34, no 11, p. 2274–2282. 

Badino H., Franke U., Mester R. (2007). Free space computation using stochastic occupancy grids and dynamic programming. In Proc. International Conference on Computer Vision Workshop on Dynamical Vision. Rio de Janeiro, Brazil. 

BarnettJ.A. (1991). CalculatingDempster-Shaferplausibility. IEEETrans.onPatternAnalysis and Machine Intelligence, vol. 13, no 6, p. 599–602. 

Denœux T. (1995). A k-nearest neighbor classification rule based on Dempster-Shafer theory. IEEE Trans. on Systems, Man and Cybernetics, vol. 25, no 5, p. 804–813. 

Dollár P., Wojek C., Schiele B., Perona P. (2011). Pedestrian detection: an evaluation of the state of the art. IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 34, no 4, p. 743–761. ErnstI.,HirschmüllerH. (2008). Mutualinformationbasedsemi-globalstereomatchingonthe GPU. In Proc. International Symposium on Advances in Visual Computing, p. 228–239. 

Las Vegas, USA. Ess A., Müller T., Grabner H., Van Gool L. (2009). Segmentation based urban traffic scene understanding. In Proc. British Machine Vision Conference, p. 1–11. 

London, UK. Farabet C., Couprie C., Najman L., LeCun Y. (2012). Scene parsing with multiscale feature learning, purity trees, and optimal covers. In Proc. Internation Conference on Machine Learning. Edinburgh, Scotland. 

Geiger A., Lenz P., Urtasun R. (2012). Are we ready for autonomous driving? The KITTI visionbenchmarksuite. InProc.IEEEConf.onComputerVisionandPatternRecognition, p. 3354–3361. Providence, USA.

Geiger A., Moosmann F., Car O., Schuster B. (2012). Automatic camera and range sensor calibration using a single shot. In Proc. IEEE International Conference on Robotics and Automation, p. 3936–3943. Saint Paul, USA. 

Geiger A., Wojek C., Urtasun R. (2011). Joint 3D estimation of objects and scene layout. In Proc. Conf. on Neural Information Processing Systems, p. 1467–1475. 

Granada, Spain. Hoiem D., Efros A. A., Hebert M. (2007). Recovering surface layout from an image. International Journal of Computer Vision, vol. 75, no 1, p. 151–172. 

Hoiem D., Efros A. A., Hebert M. (2008). Closing the loop on scene interpretation. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, p. 1-8. Anchorage, USA. 

Ladický L., Sturgess P., Russell C., Sengupta S., Bastanlar Y., et al. (2012). Joint optimization for object class segmentation and dense stereo reconstruction. International Journal of Computer Vision, vol. 100, no 2, p. 122-133. 

Leibe B., Cornelis N., Cornelis K., Van Gool L. (2007). Dynamic 3D scene analysis from a moving vehicle. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition,p. 1–8. Minneapolis, USA. Levinshtein A., Stere A., Kutulakos K. N., Fleet D. J., Dickinson S. J., Siddiqi K. (2009). TurboPixels: Fast superpixels using geometric flows. IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 31, no 12, p. 2290–2297. 

Mathevet S., Trassoudaine L., Checchin P., Alizon J. (1999). Combinaison de segmentations en régions. Traitement du Signal, vol. 16, no 2, p. 93–104. 

Moras J., Cherfaoui V., Bonnifait P. (2011). Moving objects detection by conflict analysis in evidential grids. In Proc. IEEE Intelligent Vehicles Symposium, p. 1120–1125. BadenBaden, Germany. 

Morre A. P., Prince S. J. D., Warrel J., Mohammed U., Jones G. (2009). Scene shape priors for superpixel segmentation. In Proc. IEEE International Conference on Computer Vision, p. 771-778. Kyoto, Japan. 

RenC.Y.,ReidI. (2011). gSLIC:areal-timeimplementationofSLICsuperpixelsegmentation. Technical report. University of Oxford, Department of Engineering Science. 

Rodríguez S. A., Frémont V., Bonnifait P., Cherfaoui V. (2011). Multi-modal object detection and localization for high integrity driving assistance. Machine Vision and Applications, vol. 14, p. 1-16. Scharstein D., Szeliski R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, vol. 47, no 1–3, p. 7–42. 

Shafer G. (1976). A mathematical theory of evidence. Princeton, New Jersey, Princeton University Press. Smets P., Kennes R. (1994). The transferable belief model. Artificial Intelligence, vol. 66, p. 191–243. 

Thrun S., Burgard W., Fox D. (2005). Probabilistic robotics (intelligent robotics and autonomous agents). Cambridge, Massachusetts, The MIT Press.

Wang C. C., Thorpe C., Thrun S., Hebert M., Durrant-Whyte H. (2007). Simultaneous localization, mapping and moving object tracking. International Journal of Robotics Research, vol. 26, no 1, p. 889–916. 

Wedel A., Badino H., Rabe C., Loose H., Franke U., Cremers D. (2009). B-spline modeling of road surfaces with an application to free-space estimation. IEEE Trans. on Intelligent Transportation Systems, vol. 10, no 4, p. 572–583. 

Werlberger M. (2012). Convex Approaches for High Performance Video Processing. PhD thesis, Graz University of Technology. 

Zhang J., Marszalek M., Lazebnik S., Schmid C. (2007). Local features and kernels for classification of texture and object categories: a comprehensive study. International Journal of Computer Vision, vol. 73, no 2, p. 213–238.