Skip to content

GenomicMedLab/variation-normalizer-manuscript

Repository files navigation

Variation Normalizer Manuscript

This repo contains analysis notebooks used in the The Clinical Genomic Variation Landscape manuscript.

Small output files can be found in this repo. Larger files can be found in our public s3 bucket: s3://nch-igm-wagner-lab-public/variation-normalizer-manuscript/2025. There are notebooks that provide functions for programmatically downloading files from the s3 bucket.

After running the notebooks, users will be able to create figures that demonstrate the results of the analysis, such as the below figure.

Variant normalization allows patient samples from AACR Project GENIE to be matched to normalized variants in the CIViC, MOAlmanac, and ClinVar knowledgebases.

Patient Matching with GENIE

Set Up

Before running the notebooks, you must set up your environment.

Prerequisites

  • Docker
  • Python 3.13
    • We recommend using uv to install.
  • libpq
  • postgresql

MacOS

You can use Homebrew to install the prerequisites. See the Homebrew documentation for how to install. Make sure Homebrew is up-to-date by running brew update.

brew install libpq
brew install postgresql@14

Ubuntu

sudo apt install gcc libpq-dev python3-dev

Creating the virtual environment

uv

From the root directory, run the following to create the venv and install exact packages:

uv python pin 3.13
uv venv
source .venv/bin/activate
uv sync --all-extras
git submodule update --init --recursive

pip

python3.13 -m venv .venv
source .venv/bin/activate
python3.13 -m pip install -r requirements.txt
git submodule update --init --recursive

Set Up Backend Services

This analysis relies on backend services, which you must set up yourself:

  1. Biocommons SeqRepo
  2. Variation Normalizer: Docker Container

1. Biocommons SeqRepo

Biocommons SeqRepo is used for fast access to sequence data. This analysis uses 2024-12-20 SeqRepo data.

Follow the Quick Start Documentation for setting up SeqRepo (2024-12-20).

SeqRepo Verification

To verify, run the following inside your virtual environment:

╰─$ python3
Python 3.13.1 (main, Dec 31 2024, 13:03:34) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from biocommons.seqrepo import SeqRepo
>>> sr = SeqRepo(root_dir="/usr/local/share/seqrepo/2024-12-20")
>>> sr["NC_000001.11"][780000:780020]
'TGGTGGCACGCGCTTGTAGT'
Troubleshooting SeqRepo
  • Ensure you have the correct rsync executable: GNU rsync, NOT openrsync. Recent MacOS releases include the latter and not the former, but it can be installed with homebrew and provided to SeqRepo a la seqrepo --rsync-exe $(brew --prefix)/bin/rsync.

  • If you encounter a permission error like this:

    PermissionError: [Error 13] Permission denied: '/usr/local/share/seqrepo/2024-02-20._fkuefgd' -> '/usr/local/share/seqrepo/2024-02-20'

Try moving data manually with sudo:

sudo mv /usr/local/share/seqrepo/$SEQREPO_VERSION.* /usr/local/share/seqrepo/$SEQREPO_VERSION

2. Variation Normalizer: Docker Container

Important

This section assumes you have a local SeqRepo installed at /usr/local/share/seqrepo/2024-12-20. If you have it installed elsewhere, please update add a SEQREPO_ROOT_DIR environment variable in compose.yaml and .env.shared.
If you're using Docker Desktop, you must go to Settings -> Resources -> File sharing and add /usr/local/share/seqrepo under the Virtual file shares section. Otherwise, you will get the following error: OSError: Unable to open SeqRepo directory /usr/local/share/seqrepo/2024-12-20.

To build, (re)create, and start containers

docker volume create --name=uta_vol
docker compose \
  -p variation-normalizer-manuscript \
  -f submodules/compose.yaml \
  -f compose.yaml \
  up

Tip

If you want a clean slate, run docker compose down -v to remove containers and volumes, then docker compose -p variation-normalizer-manuscript -f submodules/compose.yaml -f compose.yaml up to rebuild and start fresh containers.

Note

We use python-dotenv to load environment variables needed for analysis notebooks that run the Variation Normalizer. Environment variables can be located at .env.shared.

In Docker Desktop, you should see the following for a successful setup:

Docker Desktop Container

cool-seq-tool-data-1 exits after download it is complete. The other three images (api-1, uta-1, and gene-dynamodb-local-1) should all be running.

Running Notebooks

This section provides information about the notebooks and the order that they should be run in. The Table of Contents, in the notebooks that have them, will link to the sections in the notebooks. You must use VS Code in order for Table of Contents links to work.

Important

You must have the Docker containers running.

  1. Run the following notebook:
  2. Run the following notebooks (order does not matter):
  3. Run the following notebooks (order does not matter):
  4. Run the following notebook:
  5. Run the following notebook:

Running Notebooks in Visual Studio Code (VS Code)

VS Code is a lightweight source code editor for Windows, Linux, and macOS.

  1. Download VS Code
  2. Open a notebook and click Select Kernel at the top right. Select the option where the path is .venv/3.13/bin/python or .venv/bin/python. See Manage Jupyter Kernels in VS Code for more information on managing Jupyter Kernels in VS Code.
  3. Run the notebooks

Analysis with macOS Environments

These notebooks were run using these macOS specs:

Model Year CPU Architecture Total RAM Hard drive capacity
2023 M2 Pro 32 GB 1 TB
2023 M3 Pro 36 GB 1 TB
2024 M4 Pro 48 GB 1 TB

Help

If you have any questions or problems, please make an issue in the repo and our team will be happy to assist.

About

Issue tracker for Variation Normalizer manuscript work

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 5

Languages