Input File

Note

This documentation is automatically generated from the YAML template file.

This file defines a set of fields which are then read and used by the workflow. Every field is explained and example values are provided.

Project metadata

The following metadata has no effect on the workflow itself, but it will be written to the output metadata file. These values will be uploaded to the database and then exposed in the project overview. They may be also useful to search this project using the browser.

`name`

Write a brief description or title for this trajectory for the overview page. This name may be used by the client to search the trajectory in the database. The name is displayed in the overview page.

Default: My project

`description`

Write additional comments about the project. The description is displayed in the overview page.

description: My project description

`authors`

Write author names. Authors are displayed in the overview page.

Warning

Check already existing values for this field to avoid making duplicates.

authors:
  - My name
  - My partner's name

`groups`

Write author group name/s. The group is displayed in the overview page.

Warning

Check already existing values for this field to avoid making duplicates.

groups:
  - My group
  - My partner's group

`contact`

How to contact the authors (e.g. a mail). The contact is displayed in the overview page.

`program`

Program (software) name which carried the trajectory and its version. Program and version are both displayed in the overview page.

Warning

Check already existing values for this field to avoid making duplicates.

program: GROMACS

`version`

version: 2024.2

`type`

Type of molecular dynamics. At this moment there are only two options in this field: trajectory and ensemble.

Note

This field has an effect on the client: some time-dependent analysis will change the labels of their axes in order to make sense. For instance RMSD X axis will be labeled as frames instead of time.

type: trajectory

`method`

MD method is displayed in the overview. e.g. Classical MD, Targeted MD, Biased MD (Accelerated Weighted Ensemble), Enhanced sampling (Hamiltonian Replica Exchange) …

method: Classical MD

`license`

License and link to the license web page. The license is displayed in the overview page. Under the license there is a More information button. The link is used to redirect the user when this button is clicked.

license: This trajectory dataset is released under a Creative Commons Attribution 4.0 International Public License

`linkcense`

linkcense: https://creativecommons.org/licenses/by/4.0/

`citation`

Citation for refering this simulation or related paper. The citation is displayed in the overview page. To set a citation use the following instructions: to add a line break type ‘(br)’ inside the citation string, to add superior text type ‘^’ before each character.

`thanks`

Acknowledgements to be shown in the overview page.

thanks: I would like to thank my funders...

`accession`

OPTIONAL: Use this field to force a custom accession. Accession is a short code to refer this project in your local database. If no accession is forced then a default formatted accession will be generated when loading the project.

References

References to other databases to enrich our data.

`links`

Links to somewhere related to the simulation. These links are displayed in the overview page. MolSSI uses this field to find simulations in our database and place the embed viewer in their website. You must fit to the standard when adding a new MolSSI simulation.

Note

This field has no effect in our workflow BUT others may rely on it.

links:
  - name: First data source
  url: https://data.source.org/
  - name: Second data source
  url: https://mydata.com/

`pdb_ids`

Set the source pdb ids of the trajectory structure Additional data from the pdb is harvested by the loader while uploading to the database This data is displayed in the overview page

pdb_ids:
  - 6ACS
  - 6M0J

`forced_references`

Set which reference sequences must be used in order to map residues in the structure of the simulation. UniProt accession ids are accepted. Forced references may be not provided or just cover the structure partially. Then a blast will be run for each orphan chain sequence. In addition, UniProt accession ids may be guessed from the PDB ids, when provided.

Forced references may be provided as a list. In this scenario UniProt sequences are aligned to chain sequences to guess which UniProt belongs to each chain. Forced references may be provided as a dictionary. Then the user specifies which reference belongs to each chain. Use the “noref” flag to mark a chain as “no referable” (e.g. antibodies, synthetic constructs).

forced_references:
  - Q9BYF1
  - P0DTC2
forced_references:
  A: Q9BYF1
  B: P0DTC2
  C: noref

`ligands`

Set ligands in the simulation. The workflow identifies ligands by their pubchem accession. If a pubchem accession is passed then it is used. Otherwise, the pubchem accession is found using other database accessions. Each ligand must have at least one of the following attributes:

pubchem: the PubChem accession
drugbank: the DrugBank accession
chembl: the ChEMBL accession

Optionally, a list of vmd selections may be provided to force the mapping

vmd_selection: a list of vmd selections (chain D)

Ligands are mapped in the standard topology file. In addition, an RMSD analysis is run for every defined ligand.

ligands:
  - pubchem: 1986
  - drugbank: DB00945

Simulation metadata

Simulation parameters.

Tip

Someday this will be automatically mined.

`framestep`

Time framestep in nanoseconds (ns). Time between each snapshot/frame. May be None if this is not a trajectory, but an ensemble. Framestep is an important value since it is used in many graph axes in the web client.

framestep: 0.01 # ns

`timestep`

The rest of values are displayed in the web client as trajectory metadata. These values do not affect other outcomes in the workflow. Simulation timestep in femtoseconds (fs). Time step used for the numerical integration of the simulation.

timestep: 2 # fs

`temp`

Temperature in Kelvin (K).

temp: 310 # K

`ensemble`

Ensemble e.g. NVT, NPT, etc.

Warning

Check already existing values for this field to avoid making duplicates.

ensemble: NPT

`ff`

Force fields.

Warning

Check already existing values for this field to avoid making duplicates.

ff:
  - Amber ff14SB
  - GLYCAM-06j

`wat`

Water force fields.

Warning

Check already existing values for this field to avoid making duplicates.

wat: TIP3P

`boxtype`

Boxtype e.g. Triclinic, Cubic, Dodecahedron.

Warning

Check already existing values for this field to avoid making duplicates.

Analysis parameters

These fields have an impact in the analysis workflow.

`interactions`

Set which are the interesting interactions to be analyzed. A bunch of interaction-specific analyses will be run for each interaction and displayed in the web client.

Interactions are defined by the ‘agents’ which are meant to interact pairwise. An ‘agent’ may be anything, even a group of unrelated molecules. Atoms of different agents which are close enought will be considered as interface atoms. These atoms will be the ones considered in interface analyses. If no interface atoms are found then the interaction is considered not valid and the user is warned.

Interactions are uploaded to the database as part of the project metadata and as an independent analysis. Project metadata includes the interaction name, agents name and agent atom selections (VMD syntax). Analysis data includes also every agent atom indices (both the whole agent and the interface only).

Each interaction has the following attributes:

name: a string tag used to relate interaction analyses data with their corresponding atoms. In addition, the name is used to label the corresponding analyses in the web client.
agent_1: the name of the first agent in the interaction, which is used to label in the client.
selection_1: the VMD selection of the first agent in the interaction.
agent_2: the name of the second agent in the interaction, which is used to label in the client.
selection_2: the VMD selection of the second agent in the interaction.
distance_cutoff (optional): the distance used to determine which atoms are in the interface (in Å).

The default value is intended for atomistic simulations. Thus coarse grain interactions may need manual input distance cutoff.

interactions:
  - name: protein-ligand interaction
  agent_1: protein
  selection_1: not resname lig
  agent_2: ligand
  selection_2: resname lig
  - name: domain-domain interaction
  agent_1: domain 1
  selection_1: resid 2 to 291
  agent_2: domain 2
  selection_2: resid 2 to 291
  - name: dna-dna hybridization
  agent_1: strain A
  selection_1: chain A
  agent_2: strain B
  selection_2: chain B
  distance_cutoff: 10

`pbc_selection`

Set those residues which are under periodic boundary conditions (PBC). These residues are excluded from the imaging centering and fitting. These residues are excluded in the follwoing analyses:

RMSD: Sudden jumps in PBC residues result in non-sense high peaks
RMSD per residue: Sudden jumps in PBC residues result in non-sense high peaks
RMSD pairwise: Sudden jumps in PBC residues result in non-sense high peaks
TM score: Sudden jumps in PBC residues result in non-sense high peaks
RGYR: Sudden jumps in PBC residues result in non-sense high changes
RMSF: Sudden jumps in PBC residues result in non-sense high peaks
PCA: Sudden jumps make not sense in PCA and they eclipse non-PBC movements
SASA: Residues close to the boundary will be considered exposed to solvent while they may be not
Pockets: Residues close to the boundary may be considered to have pockets while they have not.

Tip

This isn’t really possible because fpocket doesn’t allow you to intelligently “discard” atoms. If you remove atoms so it doesn’t find pockets in them then pockets can appear in the sites occupied by those atoms. For now, we discard the entire analysis when there’s something in PBC and that’s it.

Clusters: Since Clustering is RMSD-based it has the same limitations

If this field is set to ‘auto’ then PBC residues are set automatically. Solvent, counter ions and membrane lipids are selected in this cases. Note that these are the most tipical residues under periodic boundary conditions.

This field is also useful for those scenarions with several protein or nucleic acid molecules floating around. In this situation you can not image and fit all molecules. You must focus in one molecule and let the others stay in periodic boundary conditions.

These residues are defined using VMD selection language.

pbc_selection: water or ions

Default: auto

`cg_selection`

Set those atoms which are not actual “atoms” but coarse grained (CG) beads.

Warning

EXPERIMENTAL INPUT

`dummy_selection`

Set those atoms which are not real atoms but dummy atoms thus representing other entities. These dummy atoms are used in some methodologies such as alchemical bonds

Warning

EXPERIMENTAL INPUT

Default: auto

Representation parameters

These fields have an impact in the display of the simulation once in the web client

`chainnames`

Set optional custom chain names which may be longer than a single letter. This names are used to label chains in the web client.

chainnames:
  A: Protein
  B: Ligand

`customs`

The web client sets some default representations (molecular viewer configurations). They highlight important features in the structure according to the topology reference or interactions. In addition, you may set extra customized representations which are interesting for you. These representations will be available in the web client.

Warning

Make sure whatever you want to represent is not already represented by default or it will be duplicated.

customs:
  - name: A custom view focusing on an interesting residue
  representations:
  - name: The very interesing residue
  selection: VIR
  type: ball+stick
  color: element
  - name: The resting boring molecule
  selection: not VIR
  type: cartoon
  color: chainid

`orientation`

Set a specific starting orientation for the web client viewer. Normally this is done once the simulation has been uploaded since there is no easy way to get the orientation before.

[
72.05997406618104,
21.871748915422142,
47.89720038949639,
0,
34.3234627961572,
42.053333152877315,
-70.84188126104011,
0,
-39.93012781662099,
75.61943426331311,
25.542927052994127,
0,
-63.015499114990234,
-33.07249975204468,
-39.439000606536865,
1
]

Others

Other metadata

`multimeric`

Set if we have any multimeric form: monomer, dimer, trimer… This field was requested by the referees. Its only use for now is as a parameter in project queries.

Tip

This is temporary, it would be best to automate it.

multimeric:
  - monomer
  - trimer

Collections

Set also additional collection related metadata values.

`collections`

Set to which collection does this simulation belong to. This allows to freely group simulations together The collection may also provide additional reference data further For instance a description of the whole dataset If you want to include this simulation in an already exsiting collection then use its id If you want to create a new “collection id” please use short ids with no weird signs e.g. cv19, model, abc, …

`cv19_unit`

BioExcel-CV19 specific metadata fields. Set which family does this trajectory belong to. Supported units:

RBD-ACE2
RBD
ACE2
Spike
3CLpro
PLpro
Polymerase
E protein
Exoribonuclease
Other

`cv19_startconf`

Set some additional inputs requested by the referees.

Starting conformation of the spike (options: close, 1 open, 2 open, 3 open)
Are there antibodies? (e.g. true)
Are there nanobodies? (e.g. false)

`cv19_abs`

`cv19_nanobs`

Input files

Directories for every MD in the project. Input file paths for topology, trajectory and structure. Note that all these values may be specified thorugh command line as well.

Note

These files have been passed to the workflow through command line traditionally.

Note

Now the inputs file also provides this option.

`input_topology_filepath`

Each projects contain one and only one topology Exceptionally a structure may be used instead of a topology if the topology is missing To do so, set ‘input_topology_filepath: no’ and pass the input_structure_filepath argument

`input_structure_filepath`

The input structure file is optional (except when the topology is missing)

`input_trajectory_filepaths`

You can set input trajectory filepath generic for the whole project here. However it makes more sense setting this value separatedly for each MD.

`mds`

Each project may contain several Molecular Dynamics (MD). Each MD is to be stored in an independent folder when running the workflow. Each MD must have a different name, which will be used also to assign directories in the workflow.

MDs may include additional metadata to overwrite the project metadata for a specific case.

mds:
  - name: replica 310 K
  mdir: replica_1
  input_trajectory_filepaths: my_raw_trajectory_1.nc
  temp: 310
  - name: replica 311 K
  mdir: replica_2
  input_trajectory_filepaths: my_raw_trajectory_2.nc
  temp: 311

`mdref`

Also the reference MD is to be defined by providing the index of the MDs list. If there is not an MD which is more important than others then simply set the first MD (0) as the reference.

mdref: 0

`dataset_path`

Path to the dataset storage file for the Dataset object.

Input File

Project metadata

name

description

authors

groups

contact

program

version

type

method

license

linkcense

citation

thanks

accession

References

links

pdb_ids

forced_references

ligands

Simulation metadata

framestep

timestep

temp

ensemble

ff

wat

boxtype

Analysis parameters

interactions

pbc_selection

cg_selection

dummy_selection

Representation parameters

chainnames

customs

orientation

Others

multimeric

Collections

collections

cv19_unit

cv19_startconf

cv19_abs

cv19_nanobs

Input files

input_topology_filepath

input_structure_filepath

input_trajectory_filepaths

mds

mdref

dataset_path

`name`

`description`

`authors`

`groups`

`contact`

`program`

`version`

`type`

`method`

`license`

`linkcense`

`citation`

`thanks`

`accession`

`links`

`pdb_ids`

`forced_references`

`ligands`

`framestep`

`timestep`

`temp`

`ensemble`

`ff`

`wat`

`boxtype`

`interactions`

`pbc_selection`

`cg_selection`

`dummy_selection`

`chainnames`

`customs`

`orientation`

`multimeric`

`collections`

`cv19_unit`

`cv19_startconf`

`cv19_abs`

`cv19_nanobs`

`input_topology_filepath`

`input_structure_filepath`

`input_trajectory_filepaths`

`mds`

`mdref`

`dataset_path`