Imputation of genomic data

System for imputation of genomic data.

Imputation of genomic data

About the Project

This project implements a sequence-to-sequence (seq2seq) model with attention mechanism using PyTorch. The model consists of an encoder-decoder architecture with GRU (Gated Recurrent Unit) cells and attention mechanism to improve sequence processing. The system is designed to learn patterns in sequences and generate corresponding output sequences, with a focus on genetic data imputation..

Key Features

  • Train to predict masked values based on visible markers;
  • Evaluate prediction accuracy on a test set;
  • Configurations from parameters;
  • VCF File Processing.

Technical Implementation

The model uses a seq2seq architecture with an attention mechanism. The encoder includes an embedding layer to process the inputs and a GRU to handle the sequences. The decoder with attention incorporates an attention mechanism to highlight relevant parts of the input sequence, followed by a GRU to decode the encoded sequences. The training process occurs iteratively with a configurable number of iterations, uses SGD optimization with adjustable learning rate and can employ the teacher forcing technique. The evaluation is performed by functions, allowing both the evaluation of random samples and a more comprehensive analysis through accuracy metrics.