In this paper is presented the DESAM project which was divided in two parts. The first one was devoted to the theoretical and experimental study of parametric and non-parametric techniques for decomposing audio signals into sound elements. The second part focused onsome musical applications of these decompositions. Most aspects that have been considered in this project have led to the proposal of new methods which have been grouped together into the so-called DESAM Toolbox, a set of Matlab® functions dedicated to the estimation of widely used spectral models for music signals. Although those models can be used in Music Information Retrieval (MIR) tasks, the core functions of the toolbox do not focus on any specific application. It is rather aimed at providing a range of state-of-the-art signal processing tools that decompose music recordings according to different signal models, giving rise to different “mid-level” representations.
Analyzing a polyphonic recording, in order to extract or to modify its musical content (e.g. the instruments, the beat, or the notes) is a difficult exercise, even for an experimented musician. The tools described in this paper aim at making a computer able to perform such tasks. Let us mention three of them:
1. Pitch estimation. Estimating the pitch of a sound (on a scale from low to high) is critical for identifying musical notes, but remains difficult in a polyphonic recording, because of the overlap of sounds in the time and frequency domains.
2. Automatic transcription. If producing a sound given a musical score happens to be relatively easy both for the skilled musician and computer, the inverse problem, called “automatic transcription”, which aims at recovering a musical score from a recording, proves to be much more complex and requires expert skills.
3. Audio coding. Storing and transmitting an increasing volume of musical recordings requires the coding of this data in a format that is as compact as possible. This involves a tradeoff between the quantity of coded information, and the quality of the reproduced sound.
In order to perform these tasks, one needs a model for polyphonic music. However, no single model can successfully account for all the characteristics of musical tones in general, and how they are intertwined with one another to form music. Musical notes are primarily characterized by their pitch and their timbre, specific to the instrument. They can thus be modeled as a mixture of sinusoids, whose frequencies and amplitudes are related to the pitch and timbre of the sound. In order to estimate the fine time variations of these two parameters, one needs precise analysis methods, such as the so-called “high-resolution” methods. Besides, since a musical piece is composed of multiple notes played at different times, it is naturally described as a combination of a number of elementary sound elements (which can be either isolated notes, combinations of notes, or parts of notes). Such a representation is called “sparse”, since a very limited number of such sound elements, if well selected, should approximately describe the whole musical content. A complementary framework is based on amathematical tool called “Non-negative Matrix Factorization” (NMF). It exploits the redundancies in a musical piece (a single tone being generally repeated within the piece), in order to identify the sound elements via their spectral characteristics and their various occurrences through time.
Amongst the results of the DESAM project, funded by the French ANR (Agence Nationale de la recherche), we have developed a number of analysis tools:
– an original pitch estimation method, capable of estimating up to ten simultaneous notes, which has been used in an automatic transcription algorithm for piano music.
– another transcription scheme based on NMF, which has been developed for a larger class of instruments.
– a coding method based on high-resolution analysis, which reaches very low bitrates (high compression ratio).
– another coding method, based on sparse decompositions, which is a scalable audio coder which can reach transparency (perceptively, the compressed sound cannot be distinguished from the original one).
Most aspects that have been considered in this project have led to the proposal of new algorithms which have been grouped together into the so-called DESAM Toolbox, a set of Matlab® functions dedicated to the estimation of widely used spectral models for music signals. This paper shortly presents the innovative tools that have been considered in order to build these systems. They are divided into two main parts: the first one is devoted to the theoretical and experimental study of parametric and nonparametric techniques for decomposing audio signals into sound elements; the second part focuses on some musical applications of these decompositions.
Although these models can be used in a wide range of Music Information Retrieval (MIR) tasks, the core functions of the toolbox do not focus on any specific application. Their goal is rather to provide a wide range of state-of-the-art signal processing tools, that decompose music recordings according to different signal models, giving rise to different “mid-level” representations.
Dans cet article sont présentés de manière synthétique les résultats du projet ANR DESAM (Décompositions en éléments sonores et applications musicales). Ce projet comportait deux parties, la première portant sur des avancées théoriques de techniques de décompositions de signaux audionumériques et la seconde traitant d’applications musicales de ces décompositions. La plupart des aspects abordés dans le projet ont donné lieu à de nouvelles méthodes et algorithmes qui sont regroupés au sein d’une boîte à outils, la DESAM Toolbox. Celle-ci rassemble un ensemble de fonctions Matlab® dédiées à l’estimation de modèles spectraux très utilisés pour les signaux musicaux. Les méthodes étudiées dans ce projet peuvent bien sûr être utiles pour la recherche automatique d’informations dans les signaux musicaux, mais elles constituent avant tout une collection d’outils récents pour décomposer les signaux selon différents modèles, avec pour résultat des représentations mi-niveaux variées, pouvant être utiles dans d’autres domaines d’application.
audio processing, spectral models, sound modeling.
traitement du signal audio, modèles spectraux, modélisation du son.
Auger F., Flandrin P. (1995). Improving the Readability of Time-Frequency and Time-Scale Representations by the Reassignment Method. IEEE Transactions on Signal Processing, vol. 43, p. 1068–1089.
Badeau R. (2005). Méthodes à Haute résolution pour l’estimation et le suivi de sinusoïdes modulées. Application aux signaux musicaux. Thèse de doctorat non publiée, École Nationale Supérieure des Télécommunications, Paris, France. http://www.perso.enst.fr/rbadeau/unrestricted/Thesis.pdf (Prix de thèse ParisTech 2006)
Badeau R., Boyer R., David B. (2002, September). EDS parametric modeling and tracking of audio signals. In 5th international conference on digital audio effects (dafx-02), p. 139–144. Hamburg, Germany.
Badeau R., David B. (2008a, June-July). Adaptive subspace methods for high resolution analysis of music signals. In Acoustics’08. Paris, France.
Badeau R., David B. (2008b, March-April). Weighted maximum likelihood autoregressive and moving average spectrum modeling. In International conference on acoustics, speech, and signal processing (icassp’08), p. 3761–3764. Las Vegas, Nevada, USA.
Badeau R., David B., Richard G. (2005, août). Fast Approximated Power Iteration Subspace Tracking. IEEE Transactions on Signal Processing, vol. 53, no 8, p. 2931–2941.
Badeau R., David B., Richard G. (2006, February). A new perturbation analysis for signal enumeration in rotational invariance techniques. IEEE Transactions on Signal Processing, vol. 54, no 2, p. 450–458.
Badeau R., Emiya V., David B. (2009, April). Expectation-maximization algorithm for multipitch estimation and separation of overlapping harmonic spectra. In International conference on acoustics, speech, and signal processing (icassp’09), p. 3073–3076. Taipei, Taiwan.
Bertin N., Badeau R., Vincent E. (2009, 18-21 octobre). Fast bayesian nmf algorithms enforcing harmonicity and temporal continuity in polyphonic music transcription. In Ieee workshopon applications of signal processing to audio and acoustics (waspaa), p. 29–32. New Paltz, New York, USA.
Bertin N., Badeau R., Vincent E. (2010a, March). Enforcing Harmonicity and Smoothness in Bayesian Non-negative Matrix Factorization Applied to Polyphonic Music Transcription. IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no 3, p. 538–549.
Bertin N., Badeau R., Vincent E. (2010b, mars). Enforcing harmonicity and smoothness in bayesian non-negative matrix factorization applied to polyphonic music transcription. IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no 3, p. 538–549.
Bertin N., Févotte C., Badeau R. (2009, April). A tempering approach for Itakura-Saito nonnegative matrix factorization. With application to music transcription. In International conference on acoustics, speech, and signal processing (icassp’09), p. 1545–1548. Taipei, Taiwan.
Bregman A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound. The MIT Press.
Christensen M. G., Stoica P., Jakobsson A., Jensen S. H. (2008). Multi-pitch estimation. Signal Processing, vol. 88, no 4, p. 972-983.
Daudet L. (2006). Sparse and structured decompositions of signals with the molecular matching pursuit. IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no 5, p. 1808–1816.
Daudet L. (2010, March). Audio sparse decompositions in parallel - Let the greed be shared! IEEE Signal Processing Magazine, Special Issue on Signal Processing on Platforms with Multiple Cores: Part 2 – Design and Applications, vol. 27, no 2, p. 90–96.
David B., Badeau R. (2007, octobre). Fast sequential LS estimation for sinusoidal modeling and decomposition of audio signals. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), p. 211–214. New Paltz, New York, USA.
David B., Badeau R., Richard G. (2006, mai). HRHATRAC algorithm for spectral line tracking of musical signals. In International Conference on Acoustics, Speech and Signal Processing (ICASSP’06), vol. III, p. 45–48. Toulouse, France.
David B., Richard G., Badeau R. (2003, August). An EDS modelling tool for tracking and modifying musical signals. In Stockholm music acoustics conference (smac 2003), vol. 2, p. 715–718. Stockholm, Sweden.
Derrien O. (2007, September). Time-Scaling Of Audio Signals With Multi-Scale Gabor Analysis. In 10th conference on digital audio effects (dafx’07), p. 1–6. Bordeaux, France.
Derrien O., Richard G., Badeau R. (2008, June-July). Damped sinusoids and subspace based approach for lossy audio coding. In Acoustics’08. Paris, France.
Dubnov S., Yazdani M. (2010). Computer audition toolbox (catbox). http://cosmal.ucsd.edu/cal/projects/CATbox/catbox.htm (online web resource)
Duxbury C., Sandler M., Davies M. (2002, September). A hybrid approach to musical note onset detection. In 5th international conference on digital audio effects (dafx-02). Hamburg, Germany.
Ellis D. P. W. (2005). PLP and RASTA (and MFCC, and inversion) in Matlab. http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/ (online web resource)
Emiya V., Badeau R., David B. (2008, August). Automatic transcription of piano music based on HMM tracking of jointly-estimated pitches. In 16th european signal processing conference (eusipco). Lausanne, Sweden.
Emiya V., Badeau R., David B. (2010). Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no 6, p. 1643–1654. http://www.ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5356234&isnumber=4358086
Févotte C., Torrésani B., Daudet L., Godsill S. (2008). Sparse linear regression with structured priors and application to denoising of musical audio. IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no 1, p. 174–185.
Hamilton B., Depalle P., Marchand S. (2009, October). Theoretical and Practical Comparisons of the Reassignment Method and the Derivative Method for the Estimation of the Frequency Slope. In Proceedings of the ieee workshop on applications of signal processing to audio and acoustics (waspaa’09). New Paltz, New York, USA.
Hennequin R., Badeau R., David B. (2010, March). NMF with time-frequency activations to model non-stationary audio events. In International conference on acoustics, speech, and signal processing (icassp’10), p. 445-448. Dallas, Texas, USA.
Lagrange M., Badeau R., Richard G. (2010, March). Robust similarity metrics between audio signals based on asymmetrical spectral envelope matching. In International conference on acoustics, speech, and signal processing (icassp’10), p. 405-408. Dallas, Texas, USA.
Lagrange M., Marchand S. (2007, May). Estimating the Instantaneous Frequency of Sinusoidal Components Using Phase-Based Methods. Journal of the Audio Engineering Society, vol. 55, no 5, p. 385–399.
Lagrange M., Marchand S., Rault J. (2007, mai). Enhancing the tracking of partials for the sinusoidal modeling of polyphonic sounds. IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no 5, p. 1625–1634.
Lagrange M., Raspaud M., Badeau R., Richard G. (2010). Explicit modeling of temporal dynamics within musical signals for acoustical unit similarity. Pattern Recognition Letters, vol. 31, no 12, p. 1498–1506.
Lagrange M., Scavone G., Depalle P. (2010). Analysis / synthesis of sounds generated by sustained contact between rigid objects. IEEE Transactions on Audio Speech and Language Processing, vol. 18-3, p. 509-518.
Lartillot O., Toiviainen P. (2007). A Matlab toolbox for musical feature extraction from audio. In International conference on digital audio effects (dafx-07).
Le Carrou J.-L., Gautier F., Badeau R. (2009, July-August). Sympathetic string modes in the concert harp. Acta Acustica united with Acustica, vol. 95, no 4, p. 744–752.
Marchand S., Depalle P. (2008, September). Generalization of the derivative analysis method to non-stationary sinusoidal modeling. In 11th international conference on digital audio effects (dafx-08), p. 281–288. Espoo, Finland.
Marchand S., Lagrange M. (2006, September). On the Equivalence of Phase-Based Methods for the Estimation of Instantaneous Frequency. In Proceedings of the 14th european conference on signal processing (eusipco’2006). Florence, Italy.
McAulay R. J., Quatieri T. (August 1986). Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 34, no 4, p. 744-754.
Moussallam M., Daudet L., Richard G. (2011, May). Audio signal representations for factorization in the sparse domain. In Proceedings of international conference on acoustics speech and signal processing (icassp’2011). Prague, Czech Republic.
Pampalk E. (2004, October). A Matlab Toolbox to Compute Similarity from Audio. In Proceedings of the ismir international conference on music information retrieval (ismir’04). Barcelona, Spain.
Plumbley M., Blumensath T., Daudet L., Gribonval R., Davies M. (2010). Sparse representations in audio and music: from coding to source separation. Proceedings of the IEEE, vol. 98, no 6, p. 995–1005.
Raspaud M., Marchand S. (2007, October). Enhanced resampling for sinusoidal modeling parameters. In IEEEWorkshop on Applications of Signal Processing to Audio and Acoustics (WASPAA’07), p. 327–330. New Paltz, New York, USA.
Ravelli E., Richard G., Daudet L. (2008, November). Union of MDCT bases for audio coding. IEEE Transactions on Audio, Speech and Language Processing, vol. 16, no 8, p. 1361–1372.
Ravelli E., Richard G., Daudet L. (2010). Audio signal representations for indexing in the transform domain. IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no 3, p. 434–446.
Schaeffer P. (1977). Traité des objets musicaux. Editions du Seuil.
Serra X., Smith J. O. (1990). Spectral Modeling Synthesis: A Sound Analysis/Synthesis System Based on a Deterministic plus Stochastic Decomposition. Computer Music Journal, vol. 14, no 4, p. 12-24.
Slaney M. (1998). Auditory toolbox version 2. Rapport technique. Interval Research Corporation.