EPIC-SOUNDS Dataset

We introduce EPIC-SOUNDS, a large scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos from EPIC-KITCHENS-100. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes. For actions that involve objects colliding, we collect human annotations of the materials of these objects (e.g. a glass object colliding with a wooden surface), which we verify from visual labels discarding ambiguities. Overall, EPIC-SOUNDS includes 75.9k segments of audible events and actions, distributed across 44 classes. We train and evaluate two state-of-the-art audio recognition models on our dataset, highlighting the importance of audio-only labels and the limitations of current models to recognise actions that sound

Download Data

Downloading annotations

The dataset is now publicly available for download from here

Downloading Code, Models and Challenges

EPIC-SOUNDS: Audio-Based Interaction Recognition - code and models are available from here. Also, Leaderboard to submit on test set is here

EPIC-SOUNDS: Audio-Based Interaction Detection - code and models are available from here. Also, Leaderboard to submit on test set is here

Paper and Citation

The paper appeared as a 4-pager at ICASSP 2023 (ArXiv) with an extended 12-pages accepted at IEEE TPAMI 2025. Both versions are below. We encourage you to cite the later TPAMI version for its completeness.

When using this work, please cite the TPAMI (or ICASSP) paper from EPIC-SOUNDS as well as the IJCV paper of the original audio-visual recordings Rescaling Egocentric Vision.

@ARTICLE{EPICSOUNDS2025,
           title={{EPIC-SOUNDS}: {A} {L}arge-{S}cale {D}ataset of {A}ctions that {S}ound},
           author={Huh, Jaesung and Chalk, Jacob and Kazakos, Evangelos and Damen, Dima and Zisserman, Andrew},
           journal   = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
           year      = {2025}
}

@inproceedings{EPICSOUNDS2023,
           title={{EPIC-SOUNDS}: {A} {L}arge-{S}cale {D}ataset of {A}ctions that {S}ound},
           author={Huh, Jaesung and Chalk, Jacob and Kazakos, Evangelos and Damen, Dima and Zisserman, Andrew},
           booktitle   = {IEEE International Conference on Acoustics, Speech, & Signal Processing (ICASSP)},
           year      = {2023}
}

Also cite the EPIC-KITCHENS-100 paper where the videos originate:

@ARTICLE{Damen2022RESCALING,
           title={Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100},
           author={Damen, Dima and Doughty, Hazel and Farinella, Giovanni Maria  and and Furnari, Antonino 
           and Ma, Jian and Kazakos, Evangelos and Moltisanti, Davide and Munro, Jonathan 
           and Perrett, Toby and Price, Will and Wray, Michael},
           journal   = {International Journal of Computer Vision (IJCV)},
           year      = {2022},
           volume = {130},
           pages = {33–55},
           Url       = {https://doi.org/10.1007/s11263-021-01531-2}
}

Disclaimer

The underlying data that power EPIC-SOUNDS, EPIC-KITCHENS-100, were collected as a tool for research in computer vision. The dataset may have unintended biases (including those of a societal, gender or racial nature).

Copyright

The EPIC-SOUNDS dataset is copyright by us and published under the Creative Commons Attribution-NonCommercial 4.0 International License. This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.

For commercial licenses of EPIC-KITCHENS, email us at uob-epic-kitchens@bristol.ac.uk

Watch the Trailer