Skip to content

Convert provgigapath embeddings to parquet by slide/tile #32

@seandavi

Description

@seandavi

The current prov-gigapath files are formatted as CSV files with embedded text representations of python classes. This format makes the data very difficult to access and use.

Proposal

Convert all the tile-level and slide-level prov-gigapath to a parquet-format file with one or more metadata columns (slide id, tile location, image name) and one column with the actual tensor data (14 x 768 array).

Advantages

  • Much easier data management: one file for tile-level data and one for slide-level data gets ALL of TCGA.
  • Dataset becomes more AI-ready
  • Language-agnostic representation (any language can read parquet files)
  • Data access code becomes trivial (read parquet file)

Pseudocode

  1. Read in embeddings for each per-sample CSV file
  2. Develop metadata for each CSV file and collect in data.frame
  3. Convert each CSV file embedding to a matrix and include as a new column in the dataframe from step 2.
  4. Write out full dataframe as parquet file

Result

  1. tile-level provgigapath embeddings in a parquet file
  2. slide-level provgigapath embeddings in a parquet file

Fully language-agnostic and AI/ML ready...

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions