Convert provgigapath embeddings to parquet by slide/tile

The current prov-gigapath files are formatted as CSV files with embedded text representations of python classes. This format makes the data very difficult to access and use.

## Proposal

Convert all the tile-level and slide-level prov-gigapath to a parquet-format file with one or more metadata columns (slide id, tile location, image name) and one column with the actual tensor data (14 x 768 array).

## Advantages

- Much easier data management: one file for tile-level data and one for slide-level data gets ALL of TCGA.
- Dataset becomes more AI-ready
- Language-agnostic representation (any language can read parquet files)
- Data access code becomes trivial (read parquet file)

## Pseudocode

1. Read in embeddings for each per-sample CSV file
2. Develop metadata for each CSV file and collect in data.frame
3. Convert each CSV file embedding to a matrix and include as a new column in the dataframe from step 2. 
4. Write out full dataframe as parquet file

## Result

1. tile-level provgigapath embeddings in a parquet file
2. slide-level provgigapath embeddings in a parquet file

Fully language-agnostic and AI/ML ready...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert provgigapath embeddings to parquet by slide/tile #32

Proposal

Advantages

Pseudocode

Result

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Convert provgigapath embeddings to parquet by slide/tile #32

Description

Proposal

Advantages

Pseudocode

Result

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions