Theory and Algorithms for Statistical Content Identification
funded by the National Science Foundation
researchers: Pierre Moulin, Sujoy Roy (Institute for Infocomm Research, Singapore)
Automatic Content Identification is an emerging technology that has found applications to broadcast monitoring, connected audio, content tracking, digital asset management, near-duplicate identification, and contextual advertising, and as a filtering technology for file sharing. Content identification algorithms must be robust to common signal degradations. They operate on highly compressed data (robust hashes, aka content fingerprints) to meet storage, communication, and computing constraints. The goals of this project are to develop an analytical framework for content identification based upon fundamental principles and modern methods of statistical inference and information theory, and to develop novel content identification algorithms based on the theory. Our focus is on the following four research topics:
- Hash-Based Inference. Dithered nested lattice quantization methods are being developed to construct a new family of hash functions with provable optimality properties.
- Information-Theoretic Analysis. We are formulating content identification as a communication problem with storage constraints and studying fundamental performance limits, in the form of a capacity region. Finite blocklength effects are being analyzed by application of strong large deviations analysis, providing sharp asymptotics for the error probabilities.
- Code Design. A learning-theoretic approach is being developed for statistical modeling of content fingerprints and degradation channels from training data, and for design of hashing codes and decoding metrics that are optimally matched to the statistics.
- Applications to audio, RGB+D images, and video are being explored, as well as forensic analysis and security.