Coronavirus Sequence Prediction with Transformer Models
Tri Songz
Unknown
- 0 Collaborators
Leveraging Natural Language Transformer Models (BERT, XLNet, etc) to develop methods for identifying, classifying, and predicting DNA protein sequences using publicly available NCBI Virus database. ...learn more
Project status: Concept
Intel Technologies
DevCloud
Overview / Usage
Developing a Natural Language model to identify and predict protein DNA sequences of the coronavirus, since DNA sequences are just pairs of ATGC character pairs, which can be tokenized and trained through Transformer based Language Models for task-specific functions and predictions, such as classification, recognition, etc.
If valid - the method would work towards predicting key parts of the sequence that can be used to further research into developing vaccines/cures.
Current Progress: Analyzing Dataset compiled from NCBI to identify commonalities to perform tokenization.
Technologies Used
-
PyTorch/PyTorch Lightning
-
Tensorflow
-
TPUs