The output dimensions of the hidden layer and what each dimension means

Hello, I am now studying the hidden state of the audio_encoder output of CLAP. I use hidden_states = encoder_output.hidden_states[-1], and after printing, I get torch.Size([1, 768, 8, 8]). What does each dimension mean here mean?Is it [batch_size, channels, time, freq]? Does that time represent frame? e.g.  a piece of audio is divided into 8 frames, and then each frame is divided into the number of 8 frequency bins. In the end, the characteristic dimension of each frequency bin of each frame is 768?I'm very confused and hope you get an answer. Thank you very much

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The output dimensions of the hidden layer and what each dimension means #179

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The output dimensions of the hidden layer and what each dimension means #179

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions