Leveraging Large Language Models for Predicting Microbial Virulence from Protein Structure and Sequence

F. Quintana, T. Treangen, and L. Kavraki, “Leveraging Large Language Models for Predicting Microbial Virulence from Protein Structure and Sequence,” in Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, New York, NY, USA, 2023.


In the aftermath of COVID-19, screening for pathogens has never been a more relevant problem. However, computational screening for pathogens is challenging due to a variety of factors, including (i) the complexity and role of the host, (ii) virulence factor divergence and dynamics, and (iii) population and community-level dynamics. Considering a potential pathogen’s molecular interactions, specifically individual proteins and protein interactions can help pinpoint a potential protein of a given microbe to cause disease. However, existing tools for pathogen screening rely on existing annotations (KEGG, GO, etc), making the assessment of novel and unannotated proteins more challenging. Here, we present an LLM-inspired approach that considers protein sequence and structure to predict protein virulence. We present a two-stage model incorporating evolutionary features captured from the DistilProtBert language model and protein structure in a graph convolutional network. Our model performs better than sequence alone for virulence function when high-quality structures are present, thus representing a path forward for virulence prediction of novel and unannotated proteins.

Publisher: http://dx.doi.org/10.1145/3584371.3612953

PDF preprint: http://kavrakilab.org/publications/quintana2023llm.pdf