Audio-JEPA: Joint-Embedding Predictive
Architecture for Audio Representation Learning
Ludovic TUNCAY ¹

Etienne LABBE ¹

Emmanouil BENETOS ²

Thomas PELLEGRINI ¹

ludovic.tuncay@irit.fr

etienne.labbe@irit.fr

emmanouil.benetos@qmul.ac.uk

thomas.pellegrini@irit.fr

¹ IRIT, Université de Toulouse, CNRS, Toulouse INP, Toulouse, France
² School of Electronic Engineering and Computer Science, Queen Mary University of London, UK

Abstract—Building on the Joint-Embedding Predictive Architecture (JEPA) paradigm, a recent self-supervised learning
framework that predicts latent representations of masked regions in high-level feature spaces, we propose Audio‑JEPA (Audio Joint‑Embedding Predictive Architecture), tailored specifically for audio data. Audio-JEPA uses a simple Vision Transformer backbone to predict latent representations of masked
spectrogram patches rather than reconstructing raw audio.
We pre‑train on unlabeled AudioSet clips (10s, 32kHz) with
random patch masking on mel‑spectrograms. We evaluate on
the X‑ARES suite covering speech, music, and environmental‐
sound tasks. Although our implementation is a straightforward
translation of the original model to audio, the results still show
comparable performance to wav2vec 2.0 and data2vec while
using less than one‑fifth of their training data and with no
hyper‑parameter tuning. All code and pretrained checkpoints
will be released on GitHub¹.
Index Terms—Self-supervised learning, audio representation,
joint-embedding predictive architecture, Audio-JEPA, AudioSet

I. Introduction
Self-Supervised Learning (SSL) has revolutionized representation learning for speech and audio, enabling models to
learn from unlabeled data and excel in diverse downstream
tasks [1, 2, 3, 4]. Early SSL approaches for audio, such as
contrastive predictive coding and wav2vec 2.0, learned latent
speech representations by masking the input and solving a
contrastive task over latent codes [5]. Follow-up methods like
HuBERT [1] introduced offline clustering to generate pseudolabels for masked audio segments and WavLM [6] applied
data augmentation and denoising to improve robustness in
speech representation learning. More recently, latent prediction approaches have gained traction: data2vec [7] and its
efficient successor data2vec 2.0 [8] employ a teacher–student
framework to predict contextualized latent representations of
the input, achieving strong results across vision, speech, and
language tasks. In the audio domain, Niizumi et al. introduced
Masked Modeling Duo (M2D) [4], which uses two networks
(online and momentum encoder) to predict masked patch
embeddings and attained state-of-the-art results on numerous
audio benchmarks.
In computer vision, a new paradigm called Joint-Embedding Predictive Architecture (JEPA) [9, 10, 11] has been
proposed to predict hidden content in a high-level latent space
instead of pixel space. Notably, the image-based I-JEPA [10]
model demonstrated that predicting representations of masked
¹https://github.com/LudovicTuncay/Audio-JEPA

image regions can yield powerful visual features. The JEPA
approach differs from prior masked reconstruction methods by
focusing on semantic latent prediction rather than lower-level
signal reconstruction. Inspired by these advances, an audio
version of JEPA (termed A-JEPA) [12] was recently described
by Fei et al. Their A-JEPA encodes a spectrogram “context”
part and predicts the latent representations of masked “target”
regions using a momentum-updated target encoder. During
pre-training, they anneal the mask from fully random toward a
spec-augment-style structured scheme. In contrast, our AudioJEPA sticks with purely random masking throughout for
simplicity and maximal generality. At the time of writing,
no official implementation or checkpoints for A-JEPA are
available, motivating our from-scratch development. In the
musical domain, Stem-JEPA [13] adapts the JEPA paradigm
to multi-track recordings by jointly training an encoder and
predictor to forecast embeddings of compatible instrument
stems. While utilizing the JEPA backbone, Stem-JEPA differs
methodologically as it masks entire instrument stems instead
of individual spectrogram patches.
Our work targets the ICME 2025 Audio Encoder Capability
Challenge, where the goal is to learn general audio representations that perform well across a broad suite of tasks. We
present our Audio-JEPA implementation - developed from
scratch following the I-JEPA paradigm - and benchmark it on
the challenge’s eXtensive Audio Representation and Evaluation Suite (X-ARES). Our contributions include: (1) adapting
the JEPA masked latent prediction framework to audio spectrogram inputs using a Vision Transformer (ViT) backbone;
(2) an extensive evaluation against prior self-supervised audio
models on standard downstream tasks, assessing Audio-JEPA
via both linear probing and k-nearest-neighbor evaluation.

II. Related Work
A. Self-Supervised Audio Representation Learning
Early work in SSL for audio focused on predicting future or
missing parts of the waveform. wav2vec 2.0 [5] pioneered
masking in the latent speech representations and training
the model to identify the true quantized latent of a masked
segment among distractors (a contrastive loss). This approach
enabled models to learn rich speech features and achieved
remarkable results on speech recognition with limited labeled data. Building on this, HuBERT (Hidden-Unit BERT)
[1] introduced a BERT-like masked prediction where the
model predicts cluster assignments of masked audio frames.

Fig. 1. A‑JEPA architecture. Mel‑spectrogram patches are split into visible and masked sets. A context encoder embeds visible patches, a lightweight
predictor reconstructs masked‑patch embeddings, and a momentum‑updated target encoder provides targets. Training minimizes average L2 (Euclidean)
distance. The dashed arrow denotes a stop‑gradient.

HuBERT uses an offline k-means on acoustic features to
provide target labels, and by iteratively refining these labels, it
learns high-level speech units, matching wav2vec 2.0 performance on ASR benchmarks. Facebook’s data2vec presented
a modality-general SSL approach: instead of contrastive or
classification targets, data2vec [7] trains a student network
to regress the contextualized embeddings produced by a
teacher network (an Exponential Moving Average (EMA) of
the student) for masked portions of the input. data2vec 2.0
[8] improved the efficiency of this method by not encoding
masked tokens and using a lightweight decoder, achieving
similar accuracies to Masked Autoencoders [14] in a fraction
of the training time and matching wav2vec 2.0 on speech tasks
with over a 10× speedup. These latent regression approaches
eliminate the need for discrete targets and have set strong
baselines in audio.
B. Masked Prediction with Dual Networks
The use of two networks (online/target) for masked prediction
has also been explored in specialized audio SSL methods.
M2D (Masked Modeling Duo) [4] employs an online network
that sees the unmasked patches and a momentum target
network that encodes only the masked patches. The online
network predicts the target network’s representation of the
masked region, encouraging both networks to effectively
model the input. This design, inspired by Masked Autoencoders but working in representation space, led M2D to
state-of-the-art results on a range of audio classification
tasks (environmental sound, speaker ID, music genre, etc.).
Notably, M2D achieved top performance on datasets like
UrbanSound8K [15], VoxCeleb1 [16], GTZAN [17], and
SpeechCommands [18] with a single universal model. Such
results highlight the power of using latent prediction instead
of raw signal reconstruction for learning transferable audio
features. Other contemporary models include WavLM [6],
which extended HuBERT with simulated noisy inputs and
achieved strong results on both speech recognition and classification tasks.

C. Joint-Embedding Predictive Architectures
Rather than predicting low-level details of masked inputs,
JEPA methods aim to predict higher-level representations.
The image-based I-JEPA [10] demonstrated that a ViT can
learn excellent representations by predicting the latent representations of masked image patches, as opposed to generating
pixels. By operating in the feature space, I-JEPA forces the
model to capture abstract semantic information and ignore
minute pixel-level differences. The concept has since been
extended to other modalities and combinations (e.g., TIJEPA [19] for text–image, GeoJEPA [20] for geospatial
data, etc.), showing JEPA’s flexibility. For audio, Fei et al.
recently proposed A-JEPA [12], applying the same principle
to spectrogram inputs. While A‑JEPA and M2D both adopt
a dual‑network masked‑prediction framework, they differ in
how the target encoder is applied: M2D processes only the
masked spectrogram patches, whereas JEPA processes the
entire spectrogram (context + masked). This richer context
enables more detailed representations of the masked regions.
Their design uses a context encoder to process unmasked
spectrogram patches and a target encoder (the EMA of the
context network) to encode masked regions, with a lightweight
predictor network aligning the two in latent space. Our work is
directly inspired by this approach but contrary to M2D, JEPA
does not require data augmentation and the whole spectrogram
is seen by the target encoder. We evaluate Audio-JEPA in
our experiments, underlining how it bridges the gap between
vision-style masked modeling and audio understanding.

III. Proposed Method: JEPA for Audio
In this section, we describe our adaptation of the JEPA paradigm to the audio domain, which we call Audio-JEPA. We first
present the overall architecture, then detail the self‑supervised
training objectives, and finally highlight the audio‑specific
design choices that make A‑JEPA effective on diverse sound
data.

A. Overall architecture
As shown in the Fig. 1, our Audio-JEPA model consists of
three main modules:
1) Context encoder: Processes the “visible” subset of
Mel-spectrogram patches
2) Target Encoder: Provides stable target embeddings via
an Exponential Moving Average (EMA) of the context
encoder’s parameters
3) Lightweight Predictor Network: Takes context embeddings and predicts latent representations for masked
(“target”) patches
Upon converting an input waveform to a Mel‑spectrogram and
partitioning it into non‑overlapping time–frequency patches,
we randomly mask a fixed proportion of patches. The context
encoder embeds the remaining visible patches, producing a
context representation. The lightweight predictor network then
reconstructs embeddings for the masked patches. In parallel,
the target encoder (updated by EMA rather than gradient
descent) encodes the true masked patches. Training minimizes
the average L2 distance between the predictor’s outputs and
the target encoder’s embeddings, with stop‑gradient applied
between the predictor and the target encoder. This implementation is a direct adaptation of I-JEPA to the audio domain,
by considering the spectrogram as a single channel, possibly
non-square, image.
B. Training objective
We train Audio-JEPA using the average 𝐿2 distance between
the predicted patch-level representations and the target patchlevel representation in the masked parts. Formally, let
𝑐𝑖 = 𝑓ctx (𝑥\𝑀 ) , 𝑐̂𝑗 = 𝑔pred (𝑐),
𝑖

𝑡𝑗 = 𝑓tgt (𝑥)𝑗

(1)

where 𝑥 are the Mel-spectrogram patches, 𝑥\𝑀 are the visible
patches, 𝑓ctx and 𝑓tgt are the context and target encoders
respectively, and 𝑔pred the lightweight predictor. The loss is
then
1
ℒ=
∑ ‖̂
𝑐 − 𝑡𝑗 ‖22
(2)
|𝑀 | 𝑗∈𝑀 𝑗
We update 𝑓ctx and 𝑔pred parameters via backpropagation,
while 𝑓tgt parameters are updated as
𝜃tgt ← 𝜏 𝜃tgt + (1 − 𝜏 )𝜃ctx
(3)
with 𝜏 the EMA decay factor. This design stabilizes target
representation and prevents collapse.
C. Evaluation
Our assessment follows the eXtensive Audio Representation
and Evaluation Suite (X-ARES)², which brings together 21
publicly available audio datasets spanning a variety of tasks
and domains. Using the frozen target encoder in the evaluation, we employ two complementary evaluation strategies
drawn from X‑ARES:
a) Linear Probing (MLP): For each downstream task, we
freeze the pre-trained encoder and attach a single linear layer.
This classifier is trained on the task’s labeled data using a
²https://github.com/jimbozhang/xares

fixed set of hyperparameters. By holding the original model
weights constant, this procedure reveals how readily the
learned representations can be linearly separated and adapted
to new tasks.
b) k‑Nearest Neighbors (kNN): Without any additional
training, we directly apply a kNN classifier to the frozen
embeddings. This non‑parametric evaluation highlights the
raw discriminative power of the representations. Although it
may underperform more sophisticated fine‑tuning methods,
kNN offers a strict baseline for the intrinsic quality of the
learned features.
Due to the architecture and loss, the model’s outputs are
not guaranteed to be linearly separable as explained in the VJEPA paper [11]. Therefore, we do not expect good results in
that section. However, we should observe decent performance
in the kNN task.

IV. Implementation Details
In this section we summarize the key implementation choices
and hyperparameters used to train A‑JEPA. Tables 2–3 collect
the most important settings; for clarity we defer dataset splitting, and hardware details to Section V.
A. Data processing
We work with 1921982 AudioSet clips resampled to 32kHz
and 10s duration totaling to 5338 hours of audio. Each
waveform is converted to a 128‑band Mel‑spectrogram with
256 time bins (via a frame size and hop chosen accordingly
such that the frame size is 2.5 times the size of the hop).
Per example, we randomly sample 40 %–60 % (exact value
per batch is uniformly sampled) of the patches indices to
be masked. Each batch contains 256 audio clips. Preliminary
experiments showed the block masking strategy from I-JEPA
yielded lower performance than random masking.
B. Model Architecture
In Table I you can see the exact ViT hyperparameters used for
each module. The context and target encoders share the same
ViT configuration with 16×16 patches, a 768‑dimensional
embedding, 12 layers, 12 attention heads and an MLP ratio of
4.0. The target encoder is kept architecturally identical to the
context encoder and updated via EMA where the parameter
𝜏 is set in the same way as in BYOL [21]. The predictor uses
an embedding size of 384, a head count of 12 and contains 6
layers. MLP ratio remains the same at 4.0. The predictor reprojects the embeddings to 768 after going through the ViT
so that its outputs can to be compared with those of the target
encoder. The total number of trainable parameters during
training is 96.7M, with 85.4M parameters used at inference
since the predictor is not used.
C. Optimization and Scheduling
We train using AdamW [22] (𝛽1 = 0.9, 𝛽2 = 0.95, weight
decay = 0.05) and an initial learning rate of 3 ⋅ 10−4 . A
warmup‑cosine scheduler ramps from 1 ⋅ 10−6 to 3 ⋅ 10−4 over
1,000 steps, then anneals to zero.

TABLE I
Encoders and predictor architecture hyperparameters. The total
number of trainable parameters during training is 96.7M, with 85.4M
used at inference since the predictor is not needed.
Context / Target
Encoders

Predictor

16 × 16

-

Embedding dim

768

384

Depth

12

6

Heads

12

12

MLP ratio

4.0

4.0

Hyperparameter
Patch size

# Parameters

85.4M (each)

TABLE II
Linear-probing evaluation results on the X-ARES evaluation suite,
comparing our Audio-JEPA implementation with wav2vec 2.0 and
data2vec. Scores are reported as given by X-ARES; bold indicates
the best performance, and underline indicates the second-best.
Audio-JEPA
(ours)

Wav2Vec2 [5]

Data2Vec [7]

ASV2015 [23]

0.898

0.924

0.937

Clotho [24]

0.014

0.014

0.008

CREMA-D [25, 26]

0.427

0.541

0.523

DESED [27, 28]

0.306

0.313

0.136

ESC-50 [29]

0.338

0.510

0.229

Fluent Speech
Command [18]

0.025

0.468

0.978

Free Music
Archive small [30]

0.553

0.469

0.334

FSD18-Kaggle [31]

0.212

0.241

0.153

FSD50k [32]

0.151

0.166

0.085

GTZAN Genre [17]

0.628

0.630

0.448

LibriCount [33]

0.471

0.583

0.492

LibriSpeech-MF [34]

0.883

0.948

0.752

NSynth-Instruments
[35]

0.404

0.443

0.336

RAVDESS [36]

0.303

0.442

0.467

Speech
V1 [18]

0.152

0.714

0.927

UrbanSound 8k [15]

0.585

0.659

0.426

Vocal Imitation [37]

0.056

0.147

0.128

VocalSound [38]

0.526

0.768

0.803

VoxCeleb1 [16]

0.041

0.340

0.105

VoxLingua33 [39]

0.093

0.553

0.620

Dataset

11.3M

V. Experimental Setup
We train on 4 NVIDIA V100 GPUs, with a total batch size
of 256 clips. In other words, each batch contained somewhere
close to 42.7 minutes of audio. Training took place for
100,000 steps (~13 epochs) which took 14 hours to complete.
This training necessitated significantly less resources than
Wav2Vec2 Base [5] and data2vec [7] where each trained for
400k steps with bigger batches of 1.6h and 63 minutes of
audio respectively.

VI. Results
A. Linear-probe (MLP) performance
We first probe Audio-JEPA’s representations with a small
MLP head to assess their linear separability across tasks.
Table II reports Audio-JEPA’s performance on 20 of the 21 XARES datasets (all except LibriSpeech-100h [34]). The model
reaches first or second place on several benchmarks but falls
to last on roughly half of them. In particular, Audio-JEPA
underperforms significantly on Fluent Speech Commands
and Speech Commands V1. As noted in Section III.C.b,
Audio-JEPA is at a disadvantage under linear-probe evaluation as Audio-JEPA’s embedding space is not guaranteed to
be linearly separable due to the training objective favoring
embedding cohesion. As experimented in [11], using attentive
pooling might help in that situation.
B. kNN performance
We next evaluate pure embedding quality by k-nearestneighbor classification on the 16 X-ARES tasks compatible
with this probe. Table III shows that Audio-JEPA achieves
first place on 3 datasets (ESC-50, FMA-small, GTZAN)
and second place on 7 more, outperforming wav2vec 2.0
and data2vec despite using far less pre-training data and
resources. Conversely, it ranks last on about a third of the
tasks, underperforming substantially on those same datasets
as when evaluated with linear-probing. Because kNN probes
frozen embeddings directly and without extra classifier capacity, these results may better reflect the true representational
power of Audio-JEPA across music, environmental sounds
and speech domains.

Commands

C. Closing summary
Overall, Audio-JEPA demonstrates that a simple mask-prediction objective can yield high-quality audio embeddings with
far less pre-training data. Under kNN evaluation, AudioJEPA often matches or surpasses baselines on music and
general-sound tasks, even though those baselines underwent
far more extensive pre-training, confirming the strength of
latent-space prediction. Linear-probe results expose the limitations of a single-layer head for Audio-JEPA, but also
point to clear remedies (attentive pooling or small multilayer probes). Across both evaluation methods, Audio-JEPA is
weakest on tasks requiring fine-grained speech discrimination
(e.g. speaker verification and keyword spotting), indicating
that specialized data could improve these cases. These results
validate Audio-JEPA as a data-efficient foundation for audio
representation learning and set the stage for the architectural
and tuning improvements detailed in the Conclusion.

Audio-JEPA
(ours)

Wav2Vec2 [5]

Data2Vec [7]

rently untouched, is likely to uncover additional headroom.
By open‑sourcing our code and checkpoints, we hope to establish Audio‑JEPA as a solid starting point for the community
to explore these directions and to further unify JEPA research
across vision and now audio.

ASV2015 [23]

0.927

0.858

0.942

Acknowledgement

CREMA-D [25]

0.267

0.221

0.351

ESC-50 [29]

0.140

0.081

0.040

Fluent Speech
Command [18]

0.009

0.017

0.630

Free Music
Archive small [30]

0.449

0.251

0.106

GTZAN Genre [17]

0.452

0.303

0.108

LibriCount [33]

0.307

0.311

0.176

LibriSpeech-MF [34]

0.545

0.606

0.724

NSynth-Instruments
[35]

0.170

0.251

0.179

RAVDESS [36]

0.215

0.169

0.313

Speech
V1 [18]

0.044

0.208

0.852

UrbanSound 8k [15]

0.303

0.339

0.156

Vocal Imitation [37]

0.017

0.010

0.018

VocalSound [38]

0.256

0.269

0.308

VoxCeleb1 [16]

0.002

0.003

0.033

VoxLingua33 [39]

0.057

0.034

0.058

TABLE III
kNN evaluation results on the X-ARES evaluation suite, directly
comparing frozen embeddings from Audio-JEPA against wav2vec 2.0
and data2vec. Scores are reported as given by X-ARES; bold indicates the best performance, and underline indicates the second-best.
Dataset

Commands

VII. Conclusion
We introduced Audio‑JEPA, the first open‑source,
from‑scratch adaptation of the Joint‑Embedding Predictive Architecture to audio. Pre‑trained on AudioSet with
off‑the‑shelf hyper‑parameters, Audio‑JEPA delivers competitive performance on the X-ARES. The method compares
especially well under kNN evaluation, confirming that predicting masked latent targets, rather than the waveform or
discrete labels, is a useful inductive bias for general‑purpose
audio representation learning.
Looking forward, we identify three straightforward upgrades:
1) Attention‑pooling head. Replacing the single‑frame
MLP used in our linear‑probe evaluation with a
lightweight attention‑pooling block, as proposed for
V‑JEPA [11], could yield fairer comparisons and narrow the current linear‑probe gap.
2) Modern backbones and positional encodings. Swapping the vanilla ViT for recent audio transformers (e.g.
ConvFormer [40] or CAFormer [41]) and testing rotary
or conditional sine‑cosine encodings should improve
modelling of long‑range temporal cues.
3) Hyper‑parameter tuning. A systematic sweep of
mask ratio, EMA decay and optimizer settings, cur-

This work was granted access to the HPC resources of
IDRIS under the allocation AD011014754 made by GENCI.
Support from the ANR-3IA Artificial and Natural Intelligence
Toulouse Institute ANITI (ANR-19-PI3A-0004) is gratefully
acknowledged
References
[1]

W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov,
and A. Mohamed, “HuBERT: Self-Supervised Speech Representation
Learning by Masked Prediction of Hidden Units.” [Online]. Available:
http://arxiv.org/abs/2106.07447
[2] H. Dinkel, Y. Wang, Z. Yan, J. Zhang, and Y. Wang, “CED: Consistent
ensemble distillation for audio tagging.” [Online]. Available: http://
arxiv.org/abs/2308.11957
[3] H. Dinkel, Z. Yan, Y. Wang, J. Zhang, Y. Wang, and B. Wang,
“Scaling up masked audio encoder learning for general audio classification.” [Online]. Available: http://arxiv.org/abs/2406.06992
[4] D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino,
“Masked Modeling Duo: Towards a Universal Audio Pre-training
Framework.” [Online]. Available: http://arxiv.org/abs/2404.06095
[5] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0:
A Framework for Self-Supervised Learning of Speech Representations.” [Online]. Available: http://arxiv.org/abs/2006.11477
[6] S. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training
for Full Stack Speech Processing,” IEEE Journal of Selected Topics
in Signal Processing, vol. 16, no. 6, pp. 1505–1518, Oct. 2022, doi:
10.1109/JSTSP.2022.3188113.
[7] A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec:
A General Framework for Self-supervised Learning in Speech, Vision
and Language.” [Online]. Available: http://arxiv.org/abs/2202.03555
[8] A. Baevski, A. Babu, W.-N. Hsu, and M. Auli, “Efficient Self-supervised Learning with Contextualized Target Representations for Vision,
Speech and Language.” [Online]. Available: http://arxiv.org/abs/2212.
07525
[9] Y. LeCun, “A Path Towards Autonomous Machine Intelligence Version
0.9.2, 2022-06-27,” Jun. 2022.
[10] M. Assran et al., “Self-Supervised Learning from Images with a JointEmbedding Predictive Architecture.” [Online]. Available: http://arxiv.
org/abs/2301.08243
[11] A. Bardes et al., “Revisiting Feature Prediction for Learning Visual
Representations from Video.” [Online]. Available: http://arxiv.org/abs/
2404.08471
[12] Z. Fei, M. Fan, and J. Huang, “A-JEPA: Joint-Embedding Predictive
Architecture Can Listen.” [Online]. Available: http://arxiv.org/abs/2311.
15830
[13] A. Riou, S. Lattner, G. Hadjeres, M. Anslow, and G. Peeters,
“Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical
Stem Compatibility Estimation.” [Online]. Available: http://arxiv.org/
abs/2408.02514
[14] P.-Y. Huang et al., “Masked Autoencoders that Listen.” [Online]. Available: http://arxiv.org/abs/2207.06405
[15] J. Salamon, C. Jacoby, and J. P. Bello, “A Dataset and Taxonomy
for Urban Sound Research,” in Proceedings of the 22nd ACM international conference on Multimedia, in MM '14. New York, NY, USA:
Association for Computing Machinery, Nov. 2014, pp. 1041–1044. doi:
10.1145/2647868.2655045.
[16] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: a large-scale
speaker identification dataset,” in Interspeech 2017, Aug. 2017, pp.
2616–2620. doi: 10.21437/Interspeech.2017-950.

[17] B. L. Sturm, “The GTZAN dataset: Its contents, its faults,
their effects on evaluation, and its future use,” Journal of New
Music Research, vol. 43, no. 2, pp. 147–172, Apr. 2014, doi:
10.1080/09298215.2014.894533.
[18] P. Warden, “Speech Commands: A Dataset for Limited-Vocabulary
Speech Recognition.” [Online]. Available: http://arxiv.org/abs/1804.
03209
[19] K. H. N. Vo, D. P. T. Nguyen, T. Nguyen, and T. T. Quan, “TIJEPA: An Innovative Energy-based Joint Embedding Strategy for TextImage Multimodal Systems.” [Online]. Available: http://arxiv.org/abs/
2503.06380
[20] T. Lundqvist and L. Delvret, “GeoJEPA: Towards Eliminating Augmentation- and Sampling Bias in Multimodal Geospatial Learning.” [Online]. Available: http://arxiv.org/abs/2503.05774
[21] J.-B. Grill et al., “Bootstrap your own latent: A new approach to selfsupervised Learning.” [Online]. Available: http://arxiv.org/abs/2006.
07733
[22] I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization.” [Online]. Available: http://arxiv.org/abs/1711.05101
[23] Z. Wu, T. Kinnunen, N. Evans, and J. Yamagishi, “Automatic Speaker
Verification Spoofing and Countermeasures Challenge (ASVspoof
2015) Database.” [Online]. Available: https://datashare.ed.ac.uk/handle/
10283/853
[24] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An Audio Captioning
Dataset.” [Online]. Available: http://arxiv.org/abs/1910.09387
[25] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and
R. Verma, “CREMA-D: Crowd-sourced Emotional Multimodal Actors
Dataset,” IEEE transactions on affective computing, vol. 5, no. 4, pp.
377–390, 2014, doi: 10.1109/TAFFC.2014.2336244.
[26] M. K. Keutmann, S. L. Moore, A. Savitt, and R. C. Gur, “Generating
an item pool for translational social cognition research: methodology
and initial validation,” Behavior Research Methods, vol. 47, no. 1, pp.
228–234, Mar. 2015, doi: 10.3758/s13428-014-0464-0.
[27] N. Turpault, R. Serizel, A. P. Shah, and J. Salamon, “Sound event
detection in domestic environments with weakly labeled data and
soundscape synthesis,” presented at the Workshop on Detection and
Classification of Acoustic Scenes and Events, 2019. Accessed: May
07, 2025. [Online]. Available: https://inria.hal.science/hal-02160855
[28] R. Serizel, N. Turpault, A. Shah, and J. Salamon, “Sound event
detection in synthetic domestic environments,” presented at the ICASSP
2020 - 45th International Conference on Acoustics, Speech, and Signal
Processing, May 2020. Accessed: May 07, 2025. [Online]. Available:
https://inria.hal.science/hal-02355573
[29] K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,”
in Proceedings of the 23rd ACM international conference on Multimedia, in MM '15. New York, NY, USA: Association for Computing
Machinery, Oct. 2015, pp. 1015–1018. doi: 10.1145/2733373.2806390.
[30] M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “FMA: A
Dataset For Music Analysis.” [Online]. Available: http://arxiv.org/abs/
1612.01840
[31] E. Fonseca et al., “General-purpose Tagging of Freesound Audio with
AudioSet Labels: Task Description, Dataset, and Baseline.” [Online].
Available: http://arxiv.org/abs/1807.09902
[32] E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: An
Open Dataset of Human-Labeled Sound Events.” [Online]. Available:
http://arxiv.org/abs/2010.00475
[33] F.-R. Stöter, S. Chakrabarty, E. Habets, and B. Edler, “LibriCount,
a dataset for speaker count estimation.” [Online]. Available: https://
zenodo.org/records/1216072
[34] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech:
An ASR corpus based on public domain audio books,” in 2015
IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Apr. 2015, pp. 5206–5210. doi: 10.1109/
ICASSP.2015.7178964.
[35] J. Engel et al., “Neural Audio Synthesis of Musical Notes with WaveNet
Autoencoders.” [Online]. Available: http://arxiv.org/abs/1704.01279
[36] S. R. Livingstone and F. A. Russo, “The Ryerson Audio-Visual Database
of Emotional Speech and Song (RAVDESS).” [Online]. Available:
https://zenodo.org/records/1188976
[37] “Vocal Imitation Set v1.1.3 : Thousands of vocal imitations of hundreds
of sounds from the AudioSet ontology.” [Online]. Available: https://
zenodo.org/records/1340763

[38] Y. Gong, J. Yu, and J. Glass, “Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition,” in ICASSP 2022 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022, pp. 151–155. doi: 10.1109/
ICASSP43922.2022.9746828.
[39] N. Yadong, “voxlingua33 in WebDataset Format.” [Online]. Available:
https://zenodo.org/records/14723799
[40] X. Lin, Z. Yan, X. Deng, C. Zheng, and L. Yu, “ConvFormer:
Plug-and-Play CNN-Style Transformers for Improving Medical Image
Segmentation.” [Online]. Available: http://arxiv.org/abs/2309.05674
[41] W. Yu et al., “MetaFormer Baselines for Vision.” [Online]. Available:
http://arxiv.org/abs/2210.13452