Skip to content

The output dimensions of the hidden layer and what each dimension means #179

@Nora-jsu

Description

@Nora-jsu

Hello, I am now studying the hidden state of the audio_encoder output of CLAP. I use hidden_states = encoder_output.hidden_states[-1], and after printing, I get torch.Size([1, 768, 8, 8]). What does each dimension mean here mean?Is it [batch_size, channels, time, freq]? Does that time represent frame? e.g. a piece of audio is divided into 8 frames, and then each frame is divided into the number of 8 frequency bins. In the end, the characteristic dimension of each frequency bin of each frame is 768?I'm very confused and hope you get an answer. Thank you very much

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions