-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed
Description
The current prov-gigapath files are formatted as CSV files with embedded text representations of python classes. This format makes the data very difficult to access and use.
Proposal
Convert all the tile-level and slide-level prov-gigapath to a parquet-format file with one or more metadata columns (slide id, tile location, image name) and one column with the actual tensor data (14 x 768 array).
Advantages
- Much easier data management: one file for tile-level data and one for slide-level data gets ALL of TCGA.
- Dataset becomes more AI-ready
- Language-agnostic representation (any language can read parquet files)
- Data access code becomes trivial (read parquet file)
Pseudocode
- Read in embeddings for each per-sample CSV file
- Develop metadata for each CSV file and collect in data.frame
- Convert each CSV file embedding to a matrix and include as a new column in the dataframe from step 2.
- Write out full dataframe as parquet file
Result
- tile-level provgigapath embeddings in a parquet file
- slide-level provgigapath embeddings in a parquet file
Fully language-agnostic and AI/ML ready...
billila
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed
Type
Projects
Status
Backlog