Audio Fingerprinting: A Quick Startup Guide

20 Jul, 2020
read

Audio Fingerprinting: A Quick Startup guide

What is Audio Fingerprinting?

Audio fingerprinting has been widely used for music and sound identification. This technology enables a user to record as little as 2 seconds of an audio signal and can identify the original sound that matches with the signal with ~95.6% certainty. This technology has also enabled various avenues of research such as identifying birds in the wild by using their unique acoustic signatures and help to mitigate privacy concerns for products like Amazon Echo, and Apple’s Siri that use voice commands to turn their services on.

Audio fingerprinting represents a digital summary of an audio signal. The digital summary is created by identifying the most prominent parts of the audio signal. This is traditionally done by creating a spectrogram of the signal and then running a peak identification algorithm on it to extract the peaks. After this, surrounding peaks are used to create a unique hash signature which is then stored in a database.

Getting Started with Audio Fingerprinting

Even if you do not know much about audio fingerprinting, don’t worry, you can still get started by using an opensource project called Dejavu. I would highly reccomend reading this article (Audio Fingerprinting with Python and Numpy)[https://willdrevo.com/fingerprinting-and-audio-recognition-with-python/] written by the author of Dejavu before continuing as it explains the internal functions of the algorithm.

From here, simply follow the ReadMe in the Github repository which will guide you to configure your setup with Docker.

My experience using Dejavu

Initially, I had no clue that the Dejavu project existed. So logically, I spent two weeks trying to put together my very own audio fingerprinting algorithm by reading blog posts after blog posts. It worked, but the algorithm was very naive so to speak. Simply put, the entropy needed to generate a unique audio fingerprint was just not present in my approach. Sooner than later, I stumbled across Dejavu which solved all my headaches.

I downloaded 15 30-minute segments from CBS 13, TV station based in Portland, Maine. After that, I split the 30-minute videos into 3 equal 10-minute chunks using Ffmpeg’s audio module with Python(Code Below). Then, I manually went into the videos and extracted small segments where the news transitions came up. I fingerprinted these transitions into the Dejavu database so we can use these for identification purposes later. After this, I ran the Audio Identification algorithm on each 10-minute video segment and the algorithm gave me a time of where it believes the transition started in the 10-minute segment chunks. Finally, I manually cross-checked the algorithm’s time offsets with the actual offsets to measure its accuracy.

Python Code to split video files into three 10-minute chunks.

import subprocess
import os

path = './test/FullVideos/'


for filename in os.listdir(path):
    if (filename.endswith(".mp4")):
        link = path + filename
        outputLinkOne = filename.split(".")[0] + "_part_1.mp4"
        outputLinkTwo = filename.split(".")[0] + "_part_2.mp4"
        outputLinkThree = filename.split(".")[0] + "_part_3.mp4"

        subprocess.run("ffmpeg -i " +  link + " -ss 00:00:00 -to 00:10:00 -c copy " +  outputLinkOne, shell=True)
        subprocess.run("ffmpeg -i " +  link + " -ss 00:10:00 -to 00:20:00 -c copy " +  outputLinkTwo, shell=True)
        subprocess.run("ffmpeg -i " +  link + " -ss 00:20:00 -to 00:30:00 -c copy " +  outputLinkThree, shell=True)

Python Script to run the audio fingerprinting algorithm on data and create a spreadsheet with results

from dejavu import Dejavu
from dejavu.logic.recognizer.file_recognizer import FileRecognizer
import os
import pandas as pd

# load config from a JSON file (or anything outputting a python dictionary)
config = {
    "database": {
        "host": "db",
        "user": "postgres",
        "password": "password",
        "database": "dejavu"
    },
    "database_type": "postgres"
}

if __name__ == '__main__':

    # create a Dejavu instance
    djv = Dejavu(config)

    # Fingerprint all the mp3's in the directory we give it
    djv.fingerprint_directory("mp3", [".mp4", "mp3"])



path = './test/'

# Create a dataframe
df = pd.DataFrame(columns=["File", "Transition","Total_time", "Fingerprint_Confidence", "Offset", "Hash"])
runtime = []
file_recognized = []
transition_found = []
fingerprint_confidence = []
offsets = []
hash = []

for filename in os.listdir(path):
    if (filename.endswith(".mp4")):
        results = djv.recognize(FileRecognizer, path+filename)
        
        print(f"Algorithm runtime: {str(results['total_time'])} seconds")
        print(f"File being recognized: {filename}")
        print(f"Transition found: {results['results'][0]['song_name']}")
        print(f"Fingerprint Confidence: {str(results['results'][0]['fingerprinted_confidence']*100)}%")
        print(f"Offset: {str(results['results'][0]['offset_seconds'] * -1)} minutes")
        print(f"Fingerprint_sha1_signature: {str(results['results'][0]['file_sha1'])}")
        print(f"  ")
        print(f"  ")
        
        # Collect data inside arrays to eventually use as input to the dataframe 
        runtime.append(results['total_time'])
        file_recognized.append(filename)
        transition_found.append(results['results'][0]['song_name'])
        fingerprint_confidence.append(results['results'][0]['fingerprinted_confidence']*100)
        offsets.append(results['results'][0]['offset_seconds'] * -1)
        hash.append(str(results['results'][0]['file_sha1']))


# Put the data inside the dataframe
df["File"] = file_recognized
df["Transition"] = transition_found
df["Total_time"] = runtime
df["Fingerprint_Confidence"] = fingerprint_confidence
df["Offset"] = offsets
df["Hash"] = hash

# Output results to a CSV
df.to_csv("Results.csv")

These were my results when I ran the algorithm on the segments

Video Segments	Accuracy Of Identification
Start	92.8%
Middle	64.2%
Ending	14.2%

The algorithm identified the timing of the transitions in the beginning segments of the news data much better than it did compared to the middle and ending segments. The cause for such discrepancy might be in the fact that the transitions were slightly different throughout the news segments. One way to fix this would be to identify unique transitions and put their fingerprints in the database. The algorithm also identified what kind of transition was identified. It almost always detected the sports transition as well as the weather report transition.

Shazam’s Founder Chris Barton discussing the creation of the algorithm

Sources

Drevo, Will. Audio Fingerprinting with Python and Numpy, 15 Nov. 2013, willdrevo.com/fingerprinting-and-audio-recognition-with-python/.
SLOANE, GARETT. “What Is Acoustic Fingerprinting”. Digiday, 29 Mar. 2016, digiday.com/media/what-is-acoustic-fingerprinting/.