Variation Normalizer Manuscript

This repo contains analysis notebooks used in the The Clinical Genomic Variation Landscape manuscript.

Small output files can be found in this repo. Larger files can be found in our public s3 bucket: s3://nch-igm-wagner-lab-public/variation-normalizer-manuscript/2025. There are notebooks that provide functions for programmatically downloading files from the s3 bucket.

After running the notebooks, users will be able to create figures that demonstrate the results of the analysis, such as the below figure.

Variant normalization allows patient samples from AACR Project GENIE to be matched to normalized variants in the CIViC, MOAlmanac, and ClinVar knowledgebases.

Set Up

Before running the notebooks, you must set up your environment.

Prerequisites

Docker
Python 3.13
- We recommend using uv to install.
libpq
postgresql

MacOS

You can use Homebrew to install the prerequisites. See the Homebrew documentation for how to install. Make sure Homebrew is up-to-date by running brew update.

brew install libpq
brew install postgresql@14

Ubuntu

sudo apt install gcc libpq-dev python3-dev

Creating the virtual environment

uv

From the root directory, run the following to create the venv and install exact packages:

uv python pin 3.13
uv venv
source .venv/bin/activate
uv sync --all-extras
git submodule update --init --recursive

pip

python3.13 -m venv .venv
source .venv/bin/activate
python3.13 -m pip install -r requirements.txt
git submodule update --init --recursive

Set Up Backend Services

This analysis relies on backend services, which you must set up yourself:

Biocommons SeqRepo
Variation Normalizer: Docker Container

1. Biocommons SeqRepo

Biocommons SeqRepo is used for fast access to sequence data. This analysis uses 2024-12-20 SeqRepo data.

Follow the Quick Start Documentation for setting up SeqRepo (2024-12-20).

SeqRepo Verification

To verify, run the following inside your virtual environment:

╰─$ python3
Python 3.13.1 (main, Dec 31 2024, 13:03:34) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from biocommons.seqrepo import SeqRepo
>>> sr = SeqRepo(root_dir="/usr/local/share/seqrepo/2024-12-20")
>>> sr["NC_000001.11"][780000:780020]
'TGGTGGCACGCGCTTGTAGT'

Troubleshooting SeqRepo

Ensure you have the correct rsync executable: GNU rsync, NOT openrsync. Recent MacOS releases include the latter and not the former, but it can be installed with homebrew and provided to SeqRepo a la seqrepo --rsync-exe $(brew --prefix)/bin/rsync.

If you encounter a permission error like this:

PermissionError: [Error 13] Permission denied: '/usr/local/share/seqrepo/2024-02-20._fkuefgd' -> '/usr/local/share/seqrepo/2024-02-20'

Try moving data manually with sudo:

sudo mv /usr/local/share/seqrepo/$SEQREPO_VERSION.* /usr/local/share/seqrepo/$SEQREPO_VERSION

2. Variation Normalizer: Docker Container

Important

This section assumes you have a local SeqRepo installed at /usr/local/share/seqrepo/2024-12-20. If you have it installed elsewhere, please update add a SEQREPO_ROOT_DIR environment variable in compose.yaml and .env.shared.
If you're using Docker Desktop, you must go to Settings -> Resources -> File sharing and add /usr/local/share/seqrepo under the Virtual file shares section. Otherwise, you will get the following error: OSError: Unable to open SeqRepo directory /usr/local/share/seqrepo/2024-12-20.

To build, (re)create, and start containers

docker volume create --name=uta_vol
docker compose \
  -p variation-normalizer-manuscript \
  -f submodules/compose.yaml \
  -f compose.yaml \
  up

Tip

If you want a clean slate, run docker compose down -v to remove containers and volumes, then docker compose -p variation-normalizer-manuscript -f submodules/compose.yaml -f compose.yaml up to rebuild and start fresh containers.

Note

We use python-dotenv to load environment variables needed for analysis notebooks that run the Variation Normalizer. Environment variables can be located at .env.shared.

In Docker Desktop, you should see the following for a successful setup:

cool-seq-tool-data-1 exits after download it is complete. The other three images (api-1, uta-1, and gene-dynamodb-local-1) should all be running.

Running Notebooks

This section provides information about the notebooks and the order that they should be run in. The Table of Contents, in the notebooks that have them, will link to the sections in the notebooks. You must use VS Code in order for Table of Contents links to work.

Important

You must have the Docker containers running.

Run the following notebook:
- analysis/download_s3_files.ipynb
  - Downloads ClinVar CNV and NCH CNV from public s3 bucket that are needed for the notebooks.
    - The following notebooks were used to create the files that are downloaded in this notebook. You do not need to re-run these notebooks. Order does not matter if you do choose to re-run:
      - analysis/cnvs/prep_clinvar_cnvs.ipynb
        
        Creates ClinVar-CNVs-normalized.csv.gzip and NCH-normalizer-results.json
      - analysis/cnvs/parse_prep_normalize_nch_cnvs.ipynb
        
        Creates NCH-microarray-CNVs-cleaned.csv
Run the following notebooks (order does not matter):
- analysis/civic/variation_analysis/civic_variation_analysis.ipynb
  - Runs CIViC variant data through the Variation Normalizer
- analysis/clinvar/clinvar_variation_analysis.ipynb
  - Analysis on ClinVar variant data
- analysis/genie/pre_variant_analysis/genie_pre_variant_analysis.ipynb
  - Runs GENIE variant data through the Variation Normalizer
- analysis/moa/feature_analysis/moa_feature_analysis.ipynb
  - Runs MOA feature data through the Variation Normalizer
Run the following notebooks (order does not matter):
- analysis/civic/variation_analysis/transcript_variation_analysis.ipynb
  - Analysis on CIViC variants in the Transcript category
- analysis/civic/evidence_analysis/civic_evidence_analysis.ipynb
  - Analysis on CIViC evidence items
- analysis/cnvs/query_match_nch_clinvar_cnvs.ipynb
  - Analysis on feature overlap in NCH and ClinVar CNVs
- analysis/genie/variant_analysis/genie_search_analysis.ipynb
  - Analysis on matched normalized GENIE variants and normalized variants from CIViC, MOA, and ClinVar
- analysis/moa/assertion_analysis/moa_assertion_analysis.ipynb
  - Analysis on MOA assertions
Run the following notebook:
- analysis/merged_moa_civic/merged_moa_civic_evidence_analysis.ipynb
  - Combined analysis on CIViC evidence items and MOA assertions
Run the following notebook:
- analysis/performance_analysis/merged_performance_analysis.ipynb
  - Analysis on Variation Normalizer performance on CIViC, MOA, and ClinVar

Running Notebooks in Visual Studio Code (VS Code)

VS Code is a lightweight source code editor for Windows, Linux, and macOS.

Download VS Code
Open a notebook and click Select Kernel at the top right. Select the option where the path is .venv/3.13/bin/python or .venv/bin/python. See Manage Jupyter Kernels in VS Code for more information on managing Jupyter Kernels in VS Code.
Run the notebooks

Analysis with macOS Environments

These notebooks were run using these macOS specs:

Model Year	CPU Architecture	Total RAM	Hard drive capacity
2023	M2 Pro	32 GB	1 TB
2023	M3 Pro	36 GB	1 TB
2024	M4 Pro	48 GB	1 TB

Help

If you have any questions or problems, please make an issue in the repo and our team will be happy to assist.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
analysis		analysis
submodules @ 068337d		submodules @ 068337d
.env.shared		.env.shared
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
Download.Dockerfile		Download.Dockerfile
LICENSE		LICENSE
README.md		README.md
compose.yaml		compose.yaml
docker-desktop-container.png		docker-desktop-container.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Variation Normalizer Manuscript

Set Up

Prerequisites

MacOS

Ubuntu

Creating the virtual environment

uv

pip

Set Up Backend Services

1. Biocommons SeqRepo

SeqRepo Verification

Troubleshooting SeqRepo

2. Variation Normalizer: Docker Container

Running Notebooks

Running Notebooks in Visual Studio Code (VS Code)

Analysis with macOS Environments

Help

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

GenomicMedLab/variation-normalizer-manuscript

Folders and files

Latest commit

History

Repository files navigation

Variation Normalizer Manuscript

Set Up

Prerequisites

MacOS

Ubuntu

Creating the virtual environment

uv

pip

Set Up Backend Services

1. Biocommons SeqRepo

SeqRepo Verification

Troubleshooting SeqRepo

2. Variation Normalizer: Docker Container

Running Notebooks

Running Notebooks in Visual Studio Code (VS Code)

Analysis with macOS Environments

Help

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages