Hello, I am now studying the hidden state of the audio_encoder output of CLAP. I use hidden_states = encoder_output.hidden_states[-1], and after printing, I get torch.Size([1, 768, 8, 8]). What does each dimension mean here mean?Is it [batch_size, channels, time, freq]? Does that time represent frame? e.g. a piece of audio is divided into 8 frames, and then each frame is divided into the number of 8 frequency bins. In the end, the characteristic dimension of each frequency bin of each frame is 768?I'm very confused and hope you get an answer. Thank you very much