ASPED: Audio Sensing for PEdestrian Detection Dataset

Summary

ASPED (Audio Sensing for PEdestrian Detection) is a large-scale audio and video dataset prepared for pedestrian detection using sound. ASPED consists of almost 2,600 hours of audio, more than 3.4 million continuous frames in video, and corresponding annotation of pedestrian count for each audio and video.
For more information, please take a look at our paper in progress.


Download ASPED v1.0

  • We understand that our data package is quite extensive! Feel free to start with our mini test package for a trial. It includes one segment of our audio file along with the matching annotation file.
  • Our model is available in our github repository, here.
File Description Size Download
Mini Test Data Package ASPED v1.0 Session 1 Mini Package
  • one audio file (.MP4)
  • one annotation file (.CSV)
  • metadata file (.XLSX)
* Download
Audio Data Package ASPED v1.0 Session 1-5 Audio Recording (.MP4)
ASPED 1.0 Session 1-5 Annotation (.CSV)
580 GB
(Approx. 3 TB after decoding)
Download
Video Data Package ASPED v1.0 Session 1-5 Video Recording (.MP4) Approx. 2 TB Download
Metadata ASPED v1.0 Session 1-5 Metadata (.XLSX)
  • Session time, location map
  • Device coordinates
  • Start and end time of recordings
3.12 MB Download

Data Description

  1. Session Details
  2. Audio Data
  3. Video Data
  4. Annotation Data
  5. Metadata

1. Session Details

Session Date Location # of Video Recorders Total Video Frames # of Audio Recorders
Session 1 May 24-26, 2023 Cadell Courtyard 1 160,473 6
Tech Walkway 4 616,582 9
Session 2 June 1-3, 2023 Cadell Courtyard 1 163,678/td> 6
Tech Walkway 3 460,926 6
Session 3 June 7-9, 2023 Cadell Courtyard 1 163,914 6
Tech Walkway 3 467,919 7
Session 4 June 21-23, 2023 Cadell Courtyard 1 156,008 6
Tech Walkway 4 586,494 9
Session 5 June 28-30, 2023 Cadell Courtyard 1 163,903 6
Tech Walkway 3 466,075 7


2. Audio Data

  • |--Metadata.xlsx
    |--Session_5242023
      |-- Cadell
        |--Audio          * each location has multiple recorders
            |--Recorder1_[device-name]
                |--[audio1].wav
                |--[audio2].wav
                ...
            |--Recorder2_[device-name]
            ...
        |--Labels          * each location has one annotation file
            |--5-24-Cadell.csv
        |-- TechwayA
        |-- TechwayB
        |-- TechwayC
        |-- TechwayD
    |--Test_6012023
    |--Test_6072023
    |--Test_6212023
    |--Test_6282023

audio_recorder

All the audio files are encoded as FLAC files. FLAC is a widely applied lossless audio encoding format. For more information about FLAC, please refer to this Wikipedia page.

For decoding, you can use ffmpeg by running
ffmpeg -i /PATH/TO/FLAC.flac -o /PATH/TO/WAV.wav 
or use python packages like 'Pydub'. The WAV format audio will be about 3 TB in total and the FLAC format will be about 580 GB.



3. Video Data

  • |--Metadata.xlsx
    |--Session_5242023
      |-- Cadell
        |--Video          * each location has one camera
            |--Camera1_[device-name]
                |--[video1].MP4
                |--[video2].MP4
                ...
        |-- TechwayA
        |-- TechwayB
        |-- TechwayC
        |-- TechwayD
    |--Test_6012023
    |--Test_6072023
    |--Test_6212023
    |--Test_6282023

We set up five video cameras to capture footage to determine the actual count of pedestrians walking past the audio recorders. Each recording session captured almost two days' worth of footage at a rate of 1 frame per second, leading to a combined total of approximately 10 days of recordings in the entire dataset.
Each camera covered multiple audio recording devices, as depicted in the following map. You can find the map in the metadata file.

video_installing
location_techway

This dataset can be utilized either to reproduce the study or to train a computer vision model for pedestrian detection.



4. Annotation Data

Each annotation file lists the number of detected pedestrians within 1, 3, 6, and 9-meter radii around the monitored audio recorders. Every row presents the timestamp, frame number, and the count of pedestrians for each audio recorder across the four specified radii.

[Sample Annotation Data]
sample_annotation
We used the Masked-attention Mask Transformer (Mask2Former) model to detect and annotate the pedestrians passing by the audio recorders in the video recordings, with a prediction accuracy threshold of 0.7. For this research, we used the Mask2Former version from OpenMMLab, which was trained using the Microsoft COCO dataset.