Sequence-to-Sequence Autoencoder for Audio FingerprintingNolan Shah [1417950]

Sequence-to-Sequence Autoencoder for Audio FingerprintingNolan Shah 1417950 and Daniel Sitonic 1426332COSC6342: Machine LearningThe methodology in this paper are part of ongoing research.As a result, the code is not provided.I. IntroductionState-of-the-art audio fingerprinting techniques include the Shazam technique by Wang3 which builds a hashed time-frequency constellation analysis of audio and the Phillipstechnique by Haitsma 6 builds hashes based on energy differences in sub-spectral bands. Thesetechniques, while providing useful encodings of audio signals with fast lookup for similaritydetection, do not preserve the audio’s information space. We design and implement an audiosemantic hashing technique using an unsupervised sequence-to-sequence autoencoder thatgenerates distance-preserving hashes while useful for audio fingerprinting.II. BackgroundMel-Frequency Cepstral Coefficients.? Audio signals are encoded as a signal versus time graph;however, from a machine learning or information theory perspective, the signal information isnot particularly useful. Therefore, we convert the audio signal into a set of acoustic features perframe. Typically this is done with an analysis of the frequency components which compose thesignal using a Fourier transform (or the more computationally efficient FFT). The mostcommonly used acoustic features are Mel Frequency Cepstral Coefficients (MFCCs) describedby Logan 7. In the MFCC algorithm, the audio signal waveform is first converted to frames.Then, a Fourier transform and the log of the amplitude spectrum is calculated. Next, mel-scalingand smoothing is applied, and finally, with a discrete cosine transform, the features are created asa vector. Part of the utility in MFCCs comes from its more accurate perception of pitch similar tothe human auditory system. It achieves this by replicating “filters” that the human ear naturallyhave. As Logan shows, MFCCs fair better than linear spectral modeling at both speech andmusic analysis.Figure 2.1. A sample speech waveform and its MFCC.Audio Fingerprinting Techniques. ? An audio fingerprint is a compact content-based signaturethat summarizes an audio recording. It has received a lot of attention through song and audioidentifications systems such as Shazam. State-of-the-art audio fingerprinting techniques includethe Shazam technique by Wang 3 and Phillips technique by Haitsma 6.The Shazam technique generates a spectrogram of an audio signal, then creating aconstellation map of the peaks (high points in amplitude) in the spectrogram. Hashes aregenerated by hashing the frequency difference and time difference between peaks. The techniqueis very sensitive to the version of an audio file that are hashed, specifically, the model is notrobust to a live performance of a song compared against its original. However, it performsexceptionally well in recognizing recordings of identical audio and identifying individual audiosources in overlapping signals.The Phillips technique generates a spectrogram of the audio signal, then for each frame,generates a hash based on the differences in energy between non-overlapping frequency bands.This technique exploits the idea that the sign of energy differences between these bands is robustto many kinds of audio processing. It performs well with resampled audio, the addition of noise,and some simple transformations; however, large linear speed changes generally result in highbit error rates.Long Short-term Memory Cells (LSTM).? In comparison to other neural network architecturessuch as feedforward and convolutional neural networks, recurrent neural networks (RNNs) areable to store and process temporal information and information seen with prior data by means ofinformation persistence. This allows RNNs to learn sequences and perform well in problemssuch as speech, text-to-speech, and audio processing. RNNs are comprised of three components,the input, cell, and output. The cell unit feeds back into itself providing the persistence ofinformation stored. A useful way of visualizing RNNs is to “unroll” them with respect to time t.Figure 2.2. A simple RNN cell unrolled t steps. Figure from 5As seen in Figure 2.2, the cell pushes a hidden state from time t-1 to the cell at time t, and theoutput h is dependent on both x and the hidden state. The hidden state pushed to t+1 from t isthen updated given the prior state and the input. x This design introduces a difficult with thetraining process, the model cannot propagate error backwards through the RNN cell by aninfinite or arbitrary number of steps, thus a truncated backpropagation must be used where theerror is only propagated backwards in the cell by a fixed number N steps.While RNNs are capable of connecting prior information to the present task, they are notstrong at retaining long-term dependencies. 5, we omit citing papers that describe LSTMs andthe problems that resulted in them in lieu of Olah’s simpler explanation. LSTM cells are a newform of RNN cell that are more capable of learning long-term dependencies. Unlike a basic RNNcell, a LSTM cell has the ability to selectively forget and remember by separating the persistentstate (cell state) and selectively removing or adding information to it over time through the use ofgates. The three gates used by LSTM cells are the “forget” gate, “input” gate, and an “output”gate.Figure 2.3. Long Short-term Memory Cell. Figure from 5.Sequence to Sequence and Autoencoder Models. ? Deep neural networks architectures ingeneral require the input and output dimensions to be fixed. This leads to a significant limitationin tasks such as text-to-speech where the dimensions of the input and output can vary. A solutionto this problem is to use a sequence to sequence model.Figure 2.4. Sequence to Sequence. Figure from 2.In a sequence to sequence model, the input is encoded by the RNN network, then thestates of the encoder is then fed to the decoder where it will generate an output sequence basedon the encoder final state. This allows the RNN to learn the relationship between a particularsequence in relation with varying size of the input and output dimensions 2. Furthermore, themodel is a kind of autoencoder with an encoder transforming the input space into a minimizedfeature space and a decoder transforming the minimized feature space into an output space. Theautoencoder architecture allows the model to learn unsupervised by comparing the decodedoutput reconstruction against the original input sequence.Semantic Hashing. ? The chief methodology for deep learning based generated features spaceswith distance preservation is Semantic Hashing, introduced by Salakhutdinov 4, which buildshashes for text documents. Traditional methods of finding similar documents such as TF-IDFhave several limitations such as slow computation speed for large vocabularies and no use ofsemantic similarity between words. Semantic hashing address these problems by takingadvantage of inherit properties of computer memory.Figure 2.3. Long Short-term Memory Cell. Figure from 4.A audio equivalent of semantic hashing would, given an audio sample, use the encoder toencode the audio sample into a 32 bit address representation. Any other audio sample similar tothe first will be located nearby in the address (hash) space, thus we can express those similaraudio samples as binary shells. This takes advantage of the computer memory storage methodsand the memory bus controller ability to compute memory address intersection efficiently 4.III. MethodologyAt an abstract level, our goal is to build a deep learning model that can createrepresentations of audio signals with semantic meaning. To do this, we will use aSequence-to-Sequence Autoencoder model that is composed of two modules: an encoder and adecoder. The encoder will take a sequence of audio features generated from preprocessed audiosignal and transform them into a compressed feature representation (MFCC). The decoder willtake the compressed feature representation as input, and transform it into a reconstruction of theoriginal audio features. The objective function of this model will minimize the reconstructionerror between the encoder audio feature inputs and the decoder audio feature reconstruction.Model Architecture.? The architecture of our model is shown by Figure 3.1. The audiofeature input sequence X is created by generating mel-frequency cepstral coefficients (MFCCs)from a preprocessed and normalized audio signal. X will describe one second of audiodiscretized into one-hundred steps with thirteen features in each step. X? is the audio featurereconstruction sequence generated by the model. It has the same characteristics as X. Theencoder and decoder are each composed of a single LSTM cell; however, they share the sameweights W? I? and W? O? . The LSTM cells have a state size of 128. The encoder’s final state H is thecompressed feature representation of the audio signal and is the first state of the decoder.Figure 3.1. Model Architecture.The loss function to be minimized is the reconstruction error defined as:(L oss = X ? X2). The expectation is that the autoencoder model will generate a high qualityfeature compression that will both describe and minimize the audio features such that it canrecreate the audio features from the compression.The compressed acoustic features H are represented as a tuple of 128 floating point?numbers in the range of -1 to 1. We convert these acoustic features H into a hash H to furtherreduce the size of the hashes in storage and to apply our performance metric described below.The conversion to hashes is done by applying a thresholding function at 0, so H? i? > 0 results in bit1 and H? i? ? 0 results in bit 0.It is important to note that the learning process is in no way dependent on the acousticfeatures so there is no guarantee of success towards the task at hand. The model will have tolearn good hashes through reconstruction error minimization. This is a kind of unsupervisedhashing procedure.Dataset & Preprocessing? . We generated a music covers dataset composed of eightysongs: twenty songs and sixty covers (three covers songs per original song). Ten of the originalsare country songs, the other ten are classical songs. Most covers and original songs are bydifferent artists.Figure 3.2. Data Preprocessing.The preprocessing procedure is described by Figure 3. We begin with the raw audiosignal in uncompressed WAV format. The first channel is used if the audio is not mono. Thesample rate is typically greater than or equal to 16 kHz; however, in order to create consistentand comparable audio features, we reduce the sample rate to 16 kHz for all audio files. Nextmel-frequency cepstral coefficients (MFCCs) are generated with parameters described in Table3.1. A new 1 second long spectrogram is generated using MFCCs every 0.25 seconds and theresulting dimension of each spectrogram is 100×13 where 100 is the number of steps in a onesecond long audio segment and 13 is the number of cepstral features. Figure 3.3 shows a sampleMFCC.Window LengthWindow StepNum. CepstralsNum. Filters (FFT)Num. FFTs25ms10ms1326512Table 3.1. MFCC Generation Parameters.Figure 3.3. Sample MFCC from the first second of Beethoven’s Moonlight Sonata 3rdMovement.Colorized for visual clarity, blue is low amplitude, red is high amplitude, white is 0.Performance Metrics. ? To calculate the performance of the model, we use three metrics.The first metric is quantitative — the accuracy classification of the covers to the original songs.The second is quantitative — the collision and unique hash rates generated by the model over thedataset. The third is qualitative — the structure of the compressed feature representation as seenwith Principal Component Analysis (PCA).The accuracy classification metric will identify the model’s ability to perform the task ofcover song identification. It is calculated as follows. We begin by generating a database of”known” songs which with this dataset is the original songs. These original songs are converted?into their hashes H by use of the model encoder and the thresholding function. We then create alist of queries from the cover songs. These are then converted into their hashes by the sameprocedure. Individual hashes are one second long, however we combine sequential hashes tocreate longer hashes for longer sequences of audio (e.g. 1 second long audio sequence wouldresult in a 128-bit hash, 3 seconds would result in a (9×128) 1152-bit hash). The database andquery hashes are the same length. A search of the database is done for each query, and if a hashwhere hamming distance equals zero is not found, the hash with the smallest hamming distanceis chosen. If the chosen hash is the query’s original, then the classification is considered correct.In the case where the chosen hash is not the query’s original or multiple hashes with the smallesthamming distance are found, the classification is considered incorrect. Accuracy is the numberof correct classification divided by the total classifications.The collision and unique hash statistics will identify the model’s ability to perform themore abstract task of audio fingerprinting. The collisions will be calculated over the binarizedhashes produced from the test set. There is a theoretical maximum of 2? 128? or approximately 10? 38possible hashes so it highly unlikely we will expend the hash space. This hashing algorithmdiffers from most other algorithms in that distance is preserved and hashes contain some usefulsemantic meaning. This means that while it is unlikely that hash space will be expended, thelikelihood of a particular hash is not strictly guaranteed to be 1/N by the algorithm and dependson the learned features.The hash feature space visualization is will allow for the identification of structure andpatterns learned by the model as they relate to the labels. Since this is a more qualitative metric,we will need to evaluate how well the cover songs hashes cluster into the original songs hashes.Since this is an unsupervised metric algorithm, there is no automatic learning of hashes thatdescribe the cover-original relation; however, we at least expect to see gender and genreknowledge preservation.IV. ResultsAfter training our model for 40 epochs taking approximately one day, we end with thetrain loss of 1.662 from 8.883 at epoch 0. Over the testing set, our model scores a classificationaccuracy of 10.42% with 1 hash (1 second) and 20.28% with 9 hashes (3 seconds). Given thecomplexity of the task — to recognize a cover song’s original song given a small sample of 1second, low accuracies are expected. A longer hash provides more information about the audiosequence and is easier to find similar hashes with. It is clear that the model does not excel at thistask; however, the improvement with a longer hash size shows that the semantic information inthe hash features are structured in such a way that covers could be found.Our model’s collision rate is 20.09% (therefore unique hash rate is 79.91%); however, thetrend of collision rate had not reached a steady state by epoch 40 and it is likely that continuingthe training procedure would reduce the collision rate further. It is clear that the model islearning the hash space and attempting to reduce the collision rate to better differentiatesegments, so this result is expected.We visualized the hash feature space at epoch 40. Figure 4.1 shows the result. Each coloris a distinctive original artist, each point represents a different audio segment by a variety ofcovering artists. The results show clusters forming for each original artist although the spread isquite large for some artists. The individual cover artists are not shown in the figure; howeverthey tend to be found within or nearby their true original artist cluster.Figure 4.1. PCA of hash features (H) of cover songs from the validation set on epoch 40.Each color represents a distinctive original artist. Each point is a segment from a cover song.V. ConclusionOur unsupervised sequence-to-sequence autoencoder model performs poorly over thetask of cover detection with a relatively small audio sample, but improves slightly with a largeraudio sample. It is likely that further increasing the sample length would result in greaterimprovements and make it easier for an accurate classification. The model is relativelyuncomplicated for the task at hand since overfitting did not occur in reported or other tests run tolonger lengths. Potentially increasing the number of RNN layers, increasing the hash size, oreven increasing the complexity of the objective function may enhance the model and improve itsability to classify cover songs. For the task of audio fingerprinting, the model was able to build adistance preserving space while maintaining a relatively low collision rate which makes it quiteideal for this task and an improvement over prior state of the art methods. We note thatrobustness was not properly tested or compared over the prior methods, so it may be the case thata tradeoff exists between space preservation and robustness to certain transformations.IV. References1 X. Huang, A. Acero, H. Hon. 2001. Spoken Language Processing: A Guide to Theory,Algorithm, and System Development.2 I. Sutskever, O. Vinyals, and Q. Le. 2014. Sequence to sequence learning with neuralnetworks. In Proceedings of the 27th International Conference on Neural Information ProcessingSystems3 A. Wang. 2003. An industrial-strength audio search algorithm. In Proceedings of the 4thInternational Conference on Music Information Retrieval.4 R. Salakhutdinov and G. Hinton. 2009. Semantic hashing. In Proceedings of the InternationalJournal of Approximate Reasoning.5 C. Olah. 2015. Understanding LSTM Networks.6 J. Haitsma and T. Kalker. 2002. A Highly Robust Audio Fingerprinting System. InProceedings of the International Symposium on Music Information Retrieval.7 B. Logan. 2000. Mel Frequency Cepstral Coefficients for Music Modeling. In Proceedings ofThe International Society for Music Information Retrieval.8 P. Cano, E. Batlle, T. Kalker, and J. Haitsma. 2005. A Review of Audio Fingerprinting. InJournal of VLSI Signal Processing.