7RGR

Lysozyme 056 from Deep neural language modeling


Experimental Data Snapshot

  • Method: X-RAY DIFFRACTION
  • Resolution: 2.48 Å
  • R-Value Free: 0.296 
  • R-Value Work: 0.259 
  • R-Value Observed: 0.261 

wwPDB Validation   3D Report Full Report


This is version 1.2 of the entry. See complete history


Literature

Large language models generate functional protein sequences across diverse families.

Madani, A.Krause, B.Greene, E.R.Subramanian, S.Mohr, B.P.Holton, J.M.Olmos Jr., J.L.Xiong, C.Sun, Z.Z.Socher, R.Fraser, J.S.Naik, N.

(2023) Nat Biotechnol 

  • DOI: https://doi.org/10.1038/s41587-022-01618-2
  • Primary Citation of Related Structures:  
    7RGR

  • PubMed Abstract: 

    Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.


  • Organizational Affiliation

    Salesforce Research, Palo Alto, CA, USA. ali@madani.ai.


Macromolecules
Find similar proteins by:  (by identity cutoff)  |  3D Structure
Entity ID: 1
MoleculeChains Sequence LengthOrganismDetailsImage
Artificial protein L056A [auth B],
B [auth A]
168synthetic constructMutation(s): 0 
EC: 3.2.1.17
Entity Groups  
Sequence Clusters30% Identity50% Identity70% Identity90% Identity95% Identity100% Identity
Sequence Annotations
Expand
  • Reference Sequence
Small Molecules
Ligands 2 Unique
IDChains Name / Formula / InChI Key2D Diagram3D Interactions
NHE
Query on NHE

Download Ideal Coordinates CCD File 
C [auth B]2-[N-CYCLOHEXYLAMINO]ETHANE SULFONIC ACID
C8 H17 N O3 S
MKWKNSIESPFAQN-UHFFFAOYSA-N
CL
Query on CL

Download Ideal Coordinates CCD File 
D [auth A]CHLORIDE ION
Cl
VEXZGXHMUGYJMC-UHFFFAOYSA-M
Experimental Data & Validation

Experimental Data

Unit Cell:
Length ( Å )Angle ( ˚ )
a = 61.15α = 90
b = 68.1β = 90
c = 95.41γ = 90
Software Package:
Software NamePurpose
PHENIXrefinement
BUCCANEERmodel building
XDSdata reduction
XSCALEdata scaling
PHASERphasing

Structure Validation

View Full Validation Report



Entry History & Funding Information

Deposition Data


Funding OrganizationLocationGrant Number
National Institutes of Health/National Institute of General Medical Sciences (NIH/NIGMS)United StatesGM123159
National Institutes of Health/National Institute of General Medical Sciences (NIH/NIGMS)United StatesGM124149
National Institutes of Health/National Institute of General Medical Sciences (NIH/NIGMS)United StatesGM124169

Revision History  (Full details and data files)

  • Version 1.0: 2021-07-28
    Type: Initial release
  • Version 1.1: 2023-02-08
    Changes: Database references
  • Version 1.2: 2024-04-03
    Changes: Data collection, Refinement description