Dataset
A Dataset is a collection of Projects that contain molecular dynamics simulations or related data, with some shared metadata and characteristics due to how they were generated. For each Project, in the context of the MDDB Workflow, we are refering to a set of simulations/replicas, with one or more trajectory files and a common topology file. To complete the definitions, individual simulations or replicas are referred to as MD.
The main functionality of this class is keeping track of the state of many Projects: if they are still running, if they are done or if they fail and what caused the error. For this the only adjustment we have to do is adding the path where our main SQLite storage file will be kept. We can do this by using the dataset_path flag during the workflow execution:
mwf run ... --dataset_path path/to/our_dataset.db
Or, if we do no want to write the flag everytime, by using the field dataset_path in the input.yaml config file:
- dataset_path: path/to/our_dataset.db
Creating a new Dataset
However, having to modify the inputs file for every project of the dataset may be very cumbersome, as Datasets can be form by hundreds or thousand projects. For this we can make use of another feature of this class: automatic inputs file generation.
Directory Structure
For this, we part from a root folder, that every person may be organize on its own ways, but they normally follow a hierarchical structure with all its project that may look something like this:
new_dataset/
├── project_1/
├── project_2/
├── project_3/
├── project_4/
├── ...
├──── special_cases/
├────── case_1/
├────── case_2/
├────── ...
├──── wrong_cases/
├────── case_1/
├────── case_2/
├────── ...
├── scripts/
├── project_logs/
└── ...
Note of we do not specify nothing about MDs as we will take care of that later.
[1]:
import os
# Create directory structure
dataset_dir = "new_dataset"
dirs = [
dataset_dir+"/project_1",
dataset_dir+"/project_2",
dataset_dir+"/project_3",
dataset_dir+"/project_4",
dataset_dir+"/special_cases/case_1",
dataset_dir+"/special_cases/not_this_one",
dataset_dir+"/to_remove/case_1",
dataset_dir+"/to_remove/case_2",
dataset_dir+"/scripts",
dataset_dir+"/project_logs",
]
for dir_path in dirs:
os.makedirs(dir_path, exist_ok=True)
[2]:
%load_ext autoreload
%autoreload 2
from mddb_workflow.core.dataset import Dataset
# Create test directory structure
dataset_dir = "new_dataset"
# Initialize the Dataset
db_path = dataset_dir+"/new_dataset.db"
# Remove database in case the notebook is re-run
if os.path.exists(db_path):
os.remove(db_path)
# Create dataset and scan for projects and MDs
ds = Dataset(dataset_path=db_path)
Adding entries
Adding entries to the dataset is the first step to select what are the projects where are going to keep track of.
For this, we specify the root folders and the ones to ignore (not containing projects, e.g., scripts, logs, etc). We can do this passing absolute, relative or glob patterns. For example:
[3]:
# CLI: mwf dataset add new_dataset.db -p project_* special_cases/case_1 to_remove/* --ignore-dirs */logs
ds.add_entries([dataset_dir+'/project_*',
dataset_dir+'/special_cases/case_1',
dataset_dir+'/to_remove/*'],
ignore_dirs=[dataset_dir+'/*logs'],
verbose=True)
Ignoring project: project_logs
Adding project: project_2 (UUID: 4f835985-9ffd-40c6-90f6-d4510ac01785)
Adding project: project_1 (UUID: bae0e3d6-7f2d-4d17-98a0-672a5763eff6)
Adding project: project_4 (UUID: 90e60590-b7e1-41bf-a2c2-8c159c12a35d)
Adding project: project_3 (UUID: 2faff7cd-0335-44f6-b196-55a74a84347e)
Adding project: special_cases/case_1 (UUID: 352ab0d3-c20e-4c40-9e2f-c870514670a8)
Adding project: to_remove/case_1 (UUID: 8d39cd59-d19e-4ace-b273-7d30c3a58468)
Adding project: to_remove/case_2 (UUID: 64f86660-a622-4d26-9e5c-91dcd017a25c)
Some useful glob patterns:
*: matches all the folders.**/*: matches all subfolders.**/[0-9]*: matches subfolders starting with a digit.
Adding already run projects
In case where the MDDB Workflow has beeng already run for some of the projects, we can also add them to the dataset. For this, we just have to specify the root folder and the workflow will automatically scan for all the projects and MDs.
[4]:
# Remove database as a way to reset the dataset
if os.path.exists(db_path):
os.remove(db_path)
# Reinstance the dataset and scan for projects and MDs
ds = Dataset(dataset_path=db_path)
ds.scan(dataset_dir, verbose=True)
Adding project: project_2 (UUID: 4f835985-9ffd-40c6-90f6-d4510ac01785)
Adding project: project_1 (UUID: bae0e3d6-7f2d-4d17-98a0-672a5763eff6)
Adding project: to_remove/case_1 (UUID: 8d39cd59-d19e-4ace-b273-7d30c3a58468)
Adding project: to_remove/case_2 (UUID: 64f86660-a622-4d26-9e5c-91dcd017a25c)
Adding project: project_4 (UUID: 90e60590-b7e1-41bf-a2c2-8c159c12a35d)
Adding project: project_3 (UUID: 2faff7cd-0335-44f6-b196-55a74a84347e)
Adding project: special_cases/case_1 (UUID: 352ab0d3-c20e-4c40-9e2f-c870514670a8)
This is also useful to correctyl track projects after moving them to a new location, as the dataset will be able to find them and update their paths in the database.
[5]:
!mv {dataset_dir}/project_4 {dataset_dir}/project_4_renamed
[6]:
ds.scan(dataset_dir, verbose=True)
Updating project path: project_4_renamed (UUID: 90e60590-b7e1-41bf-a2c2-8c159c12a35d) from project_4 to project_4_renamed
Project already exists: project_2 (UUID: 4f835985-9ffd-40c6-90f6-d4510ac01785)
Project already exists: project_1 (UUID: bae0e3d6-7f2d-4d17-98a0-672a5763eff6)
Project already exists: to_remove/case_1 (UUID: 8d39cd59-d19e-4ace-b273-7d30c3a58468)
Project already exists: to_remove/case_2 (UUID: 64f86660-a622-4d26-9e5c-91dcd017a25c)
Project already exists: project_3 (UUID: 2faff7cd-0335-44f6-b196-55a74a84347e)
Project already exists: special_cases/case_1 (UUID: 352ab0d3-c20e-4c40-9e2f-c870514670a8)
Removing entries
In cases where later we find a project should be deleted, or if the glob pattern added folders you did not want, we can remove those matching simlar to how we added them:
[7]:
# CLI: mwf dataset remove new_dataset.db to_remove/*
ds.remove_entry('to_remove/*')
Deleted project with UUID '8d39cd59-d19e-4ace-b273-7d30c3a58468'
Deleted project with UUID '64f86660-a622-4d26-9e5c-91dcd017a25c'
Showing the dataset
Once we have initialized the entries, we can show the dataset to check if it is correct.
Dataset tables
[8]:
ds.dataframe
[8]:
| project_uuid | scope | rel_path | num_mds | state | message | last_modified | |
|---|---|---|---|---|---|---|---|
| uuid | |||||||
| bae0e3d6 | projects | project_1 | 0 | new | No information recorded yet. | 13:23:54 13-02-2026 | |
| 4f835985 | projects | project_2 | 0 | new | No information recorded yet. | 13:23:54 13-02-2026 | |
| 2faff7cd | projects | project_3 | 0 | new | No information recorded yet. | 13:23:54 13-02-2026 | |
| 90e60590 | projects | project_4_renamed | 0 | new | No information recorded yet. | 13:23:54 13-02-2026 | |
| 352ab0d3 | projects | special_cases/case_1 | 0 | new | No information recorded yet. | 13:23:54 13-02-2026 |
[9]:
ds.get_dataframe(query_path='project_*')
[9]:
| project_uuid | scope | rel_path | num_mds | state | message | last_modified | |
|---|---|---|---|---|---|---|---|
| uuid | |||||||
| bae0e3d6-7f2d-4d17-98a0-672a5763eff6 | projects | project_1 | 0 | new | No information recorded yet. | 13:23:54 13-02-2026 | |
| 4f835985-9ffd-40c6-90f6-d4510ac01785 | projects | project_2 | 0 | new | No information recorded yet. | 13:23:54 13-02-2026 | |
| 2faff7cd-0335-44f6-b196-55a74a84347e | projects | project_3 | 0 | new | No information recorded yet. | 13:23:54 13-02-2026 | |
| 90e60590-b7e1-41bf-a2c2-8c159c12a35d | projects | project_4_renamed | 0 | new | No information recorded yet. | 13:23:54 13-02-2026 |
[10]:
# CLI:
!mwf dataset show {db_path}
MDDB Dataset (5 rows)
┏━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃ uuid ┃ projec… ┃ scope ┃ rel_pa… ┃ num_mds ┃ state ┃ message ┃ last_m… ┃
┡━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ bae0e3… │ │ projec… │ ../pro… │ 0 │ new │ No │ 13:23:… │
│ │ │ │ │ │ │ informa… │ 13-02-… │
│ │ │ │ │ │ │ recorded │ │
│ │ │ │ │ │ │ yet. │ │
│ 4f8359… │ │ projec… │ ../pro… │ 0 │ new │ No │ 13:23:… │
│ │ │ │ │ │ │ informa… │ 13-02-… │
│ │ │ │ │ │ │ recorded │ │
│ │ │ │ │ │ │ yet. │ │
│ 2faff7… │ │ projec… │ ../pro… │ 0 │ new │ No │ 13:23:… │
│ │ │ │ │ │ │ informa… │ 13-02-… │
│ │ │ │ │ │ │ recorded │ │
│ │ │ │ │ │ │ yet. │ │
│ 90e605… │ │ projec… │ ../pro… │ 0 │ new │ No │ 13:23:… │
│ │ │ │ │ │ │ informa… │ 13-02-… │
│ │ │ │ │ │ │ recorded │ │
│ │ │ │ │ │ │ yet. │ │
│ 352ab0… │ │ projec… │ ../spe… │ 0 │ new │ No │ 13:23:… │
│ │ │ │ │ │ │ informa… │ 13-02-… │
│ │ │ │ │ │ │ recorded │ │
│ │ │ │ │ │ │ yet. │ │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────┴──────────┴─────────┘
[11]:
# For more specific subset of the dataset, use the different flags:
!mwf dataset show -h
usage: mwf dataset show [-h] [-p [QUERY_PATH ...]] [-st [QUERY_STATE ...]]
[-sc QUERY_SCOPE] [-ms QUERY_MESSAGE] [-s SORT_BY]
[-n N_ROWS] [-l] [-m]
[dataset_path]
positional arguments:
dataset_path
Path to the dataset storage file, normally an .db file. If not provided,
the first *.db file found in the current directory will be used.
options:
-h, --help
show this help message and exit
-p, --query_path [QUERY_PATH ...]
If provided, filters rows whose 'rel_path' matches these glob patterns.
Default: ['*']
-st, --query_state [QUERY_STATE ...]
If provided, filters rows whose 'state' matches this value/list of
values.
-sc, --query_scope QUERY_SCOPE
If provided, filters rows whose 'scope' matches this value
('project'/'p' or 'md'/'m').
-ms, --query_message QUERY_MESSAGE
If provided, filters rows whose 'message' matches these glob patterns
(e.g., 'URLError*').
-s, --sort_by SORT_BY
Column name to sort the dataset by.
Default: last_modified
-n, --n_rows N_ROWS
Number of rows to display. 0 for all rows.
Default: 50
-l, --include_logs
If True, adds 'log_file' and 'err_file' columns with HTML links to the
latest log files.
-m, --summary
Get a summary of the state of the projects.
Dataset summary
[12]:
ds.summary()
[12]:
| state | count | |
|---|---|---|
| 0 | new | 5 |
[13]:
# CLI:
!mwf dataset show {db_path} -m
==================================================================
Summary of project states:
==================================================================
state count
0 new 5
Specific rows
[14]:
ds.get_status(dataset_dir+'/special_cases/case_1')
[14]:
{'uuid': '352ab0d3-c20e-4c40-9e2f-c870514670a8',
'rel_path': 'special_cases/case_1',
'num_mds': 0,
'state': 'new',
'message': 'No information recorded yet.',
'last_modified': '13:23:54 13-02-2026',
'scope': 'Project'}
[15]:
# CLI
!mwf dataset status {db_path} -p {dataset_dir}'/special_cases/case_1'
UUID: 352ab0d3-c20e-4c40-9e2f-c870514670a8
Path: special_cases/case_1
State: new
Scope: Project
MDs: 0
Last Modified: 13:23:54 13-02-2026
Message: No information recorded yet.
Running the workflow
Generating inputs files programmatically
The first step to run the workflow is generating the inputs files for each project. This can be done in a programmatic way using the generate_inputs_files method of the Dataset class. This method will generate an inputs.yaml file for each project in the dataset, with the same content as the one we would have to write if we were to do it manually, but with the advantage that we can use variables that will be replaced by the actual values when the workflow is executed.
Jinja2 templates
This is done by using Jinja2 templates syntax. For example, we can use the {{DATASET}} variable to refer to the dataset path and the {{DIR}} variable to refer to the project directory name. This way, we can write a single inputs file template that will be used for all projects in the dataset, and we do not have to worry about writing different inputs files for each project.
[16]:
inputs_template_str = """
authors:
- Rubén Chaves
collections:
- mdbind
contact: For any questions please send a mail to ruben.chaves@irbbarcelona.org
dataset_path: {{DATASET}}
description: 10 ns simulation of {{DIR}} pdb structure
linkcense: https://creativecommons.org/licenses/by/4.0/
name: Project {{DIR}}
"""
inputs_template = dataset_dir+'/inputs_template.yaml'
with open(inputs_template, 'w') as f:
f.write(inputs_template_str)
[17]:
# CLI: mwf dataset inputs new_dataset.db -it inputs_template.yaml -o
ds.generate_inputs_yaml(inputs_template, overwrite=True)
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_1/inputs.yaml for project project_1
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_2/inputs.yaml for project project_2
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_3/inputs.yaml for project project_3
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_4_renamed/inputs.yaml for project project_4_renamed
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/special_cases/case_1/inputs.yaml for project special_cases/case_1
[18]:
# Notice how the {{DATASET}} and {{DIR}} variables have been replaced by the dataset path and the project directory name, respectively.
!cat {dataset_dir}/project_1/inputs.yaml
authors:
- Rubén Chaves
collections:
- mdbind
contact: For any questions please send a mail to ruben.chaves@irbbarcelona.org
dataset_path: ../new_dataset.db
description: 10 ns simulation of project_1 pdb structure
linkcense: https://creativecommons.org/licenses/by/4.0/
name: Project project_1
Adding custom fields
To generate more complex inputs files, we make use of more advanced features of Jinja2 templates, such as custom filters and functions, which is basically Python code that we can use in the templates to generate the inputs files.
The template will recieve a dictionary generated by a custom function that we can write, using project directory as argument:
Project directory -> Custom function -> Dictionary -> Template -> Rendered inputs.yaml
[19]:
inputs_template_str = """
name: Project {{DIR}}
dataset_path: {{DATASET}}
{%- if is_special_case %}
description: Special case description for {{DIR}}
{%- else %}
description: 10 ns simulation of {{DIR}} pdb structure
{% endif %}
"""
# Save the template to a file
inputs_template = dataset_dir+'/inputs_template.yaml'
with open(inputs_template, 'w') as f:
f.write(inputs_template_str)
# Define the custom function
def inputs_generator(project_dir: str):
"""Generate a dictionary with the information that we want to use in the template.
This function will be called for each project directory, and the returned dictionary will be passed to the template as variables.
"""
if "special_cases" in project_dir:
return {'is_special_case': True}
[20]:
# CLI: mwf dataset inputs new_dataset.db -it inputs_template.yaml -ig inputs_generator.py -o
ds.generate_inputs_yaml(inputs_template, overwrite=True,
inputs_generator=inputs_generator)
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_1/inputs.yaml for project project_1
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_2/inputs.yaml for project project_2
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_3/inputs.yaml for project project_3
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_4_renamed/inputs.yaml for project project_4_renamed
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/special_cases/case_1/inputs.yaml for project special_cases/case_1
[21]:
# Notice how the project in the special_cases directory has a different description than the rest of the projects.
!cat {dataset_dir}/project_1/inputs.yaml
!cat {dataset_dir}/special_cases/case_1/inputs.yaml
name: Project project_1
dataset_path: ../new_dataset.db
description: 10 ns simulation of project_1 pdb structure
name: Project case_1
dataset_path: ../../new_dataset.db
description: Special case description for case_1
Similarly, we can use the CLI to generate the inputs files with the custom function. With the only difference that we pass a file instead of a function.
IMPORTANT: in this file there must be a function called inputs_generator.
[22]:
python_file_str = """
def inputs_generator(project_dir: str):
if "special_cases" in project_dir:
return {'is_special_case': True}
"""
inputs_generator_py = dataset_dir+'/inputs_generator.py'
with open(inputs_generator_py, 'w') as f:
f.write(python_file_str)
!mwf dataset inputs {db_path} -it {inputs_template} -ig {inputs_generator_py} -o
Loading inputs generator from file: new_dataset/inputs_generator.py
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_1/inputs.yaml for project project_1
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_2/inputs.yaml for project project_2
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_3/inputs.yaml for project project_3
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_4_renamed/inputs.yaml for project project_4_renamed
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/special_cases/case_1/inputs.yaml for project special_cases/case_1
[23]:
!cat {dataset_dir}/project_1/inputs.yaml
!cat {dataset_dir}/special_cases/case_1/inputs.yaml
name: Project project_1
dataset_path: ../new_dataset.db
description: 10 ns simulation of project_1 pdb structure
name: Project case_1
dataset_path: ../../new_dataset.db
description: Special case description for case_1
A more real example: handling multiple and variable number of MDs
In this case, we want to generate an inputs file that contains a list of all the MDs that we have in each project, but the number of MDs is not the same for all projects. For this, we can write a custom function that will look for all the MDs in each project, ignore any irrelevant files (equilibration trajectories, for example), and return a dictionary with the list of MDs, that we can then use in the template to generate the inputs file.
[24]:
# Create directory structure
dirs = [
dataset_dir+"/many_mds",
dataset_dir+"/many_mds/project_1",
dataset_dir+"/many_mds/project_2",
]
files = [
# A project with 3 equilibration and 3 MD replicas
dataset_dir+"/many_mds/project_1/equil_1.traj",
dataset_dir+"/many_mds/project_1/equil_2.traj",
dataset_dir+"/many_mds/project_1/equil_3.traj",
dataset_dir+"/many_mds/project_1/prod_1.traj",
dataset_dir+"/many_mds/project_1/prod_2.traj",
dataset_dir+"/many_mds/project_1/prod_3.traj",
# A project with 2 equilibration and 2 MD replicas
dataset_dir+"/many_mds/project_2/equil_1.traj",
dataset_dir+"/many_mds/project_2/equil_2.traj",
dataset_dir+"/many_mds/project_2/prod_1.traj",
dataset_dir+"/many_mds/project_2/prod_2.traj",
]
for dir_path in dirs:
os.makedirs(dir_path, exist_ok=True)
for file_path in files:
with open(file_path, 'w') as f:
f.write("DUMMY TRAJ FILE\n")
[25]:
ds.add_entries([dataset_dir+'/many_mds/*'], verbose=True)
Adding project: many_mds/project_2 (UUID: edc04884-e9cc-4f75-8e5e-80a7f864d9d2)
Adding project: many_mds/project_1 (UUID: c25f8ed5-917e-4bc1-bd13-e51329b33cb3)
[26]:
ds.dataframe
[26]:
| project_uuid | scope | rel_path | num_mds | state | message | last_modified | |
|---|---|---|---|---|---|---|---|
| uuid | |||||||
| c25f8ed5 | projects | many_mds/project_1 | 0 | new | No information recorded yet. | 13:24:16 13-02-2026 | |
| edc04884 | projects | many_mds/project_2 | 0 | new | No information recorded yet. | 13:24:16 13-02-2026 | |
| bae0e3d6 | projects | project_1 | 0 | new | No information recorded yet. | 13:23:54 13-02-2026 | |
| 4f835985 | projects | project_2 | 0 | new | No information recorded yet. | 13:23:54 13-02-2026 | |
| 2faff7cd | projects | project_3 | 0 | new | No information recorded yet. | 13:23:54 13-02-2026 | |
| 90e60590 | projects | project_4_renamed | 0 | new | No information recorded yet. | 13:23:54 13-02-2026 | |
| 352ab0d3 | projects | special_cases/case_1 | 0 | new | No information recorded yet. | 13:23:54 13-02-2026 |
[27]:
inputs_template_str = """
name: Project {{DIR}}
mds:
{% for md in mds %}
-
mdir: {{ md.mdir }}
input_trajectory_filepaths: {{ md.traj }}
{% endfor %}
"""
inputs_template = dataset_dir+'/inputs_template.yaml'
with open(inputs_template, 'w') as f:
f.write(inputs_template_str)
[28]:
from pathlib import Path
def mds_generator(project_dir: str):
"""Generate a list of MD replicas based on the traj files in the project directory."""
mds = []
project_path = Path(project_dir)
prod_trajs = sorted(project_path.glob('prod_*.traj'))
num_replicas = len(prod_trajs)
for i in range(num_replicas):
mds.append({
'mdir': f'md_replica_{i+1}',
'traj': prod_trajs[i].relative_to(project_path).as_posix(),
})
return {'mds': mds}
[29]:
# Check that the function works as expected
mds_generator(dataset_dir+'/many_mds/project_1')
[29]:
{'mds': [{'mdir': 'md_replica_1', 'traj': 'prod_1.traj'},
{'mdir': 'md_replica_2', 'traj': 'prod_2.traj'},
{'mdir': 'md_replica_3', 'traj': 'prod_3.traj'}]}
[30]:
ds.generate_inputs_yaml(inputs_template,
inputs_generator=mds_generator,
overwrite=True,
query_path='*many_mds*'
)
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/many_mds/project_1/inputs.yaml for project many_mds/project_1
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/many_mds/project_2/inputs.yaml for project many_mds/project_2
[31]:
# Generated inputs.yaml for project with 3 replicas
!cat new_dataset/many_mds/project_1/inputs.yaml
name: Project project_1
mds:
-
mdir: md_replica_1
input_trajectory_filepaths: prod_1.traj
-
mdir: md_replica_2
input_trajectory_filepaths: prod_2.traj
-
mdir: md_replica_3
input_trajectory_filepaths: prod_3.traj
[32]:
# Generated inputs.yaml for project with 2 replicas
!cat new_dataset/many_mds/project_2/inputs.yaml
name: Project project_2
mds:
-
mdir: md_replica_1
input_trajectory_filepaths: prod_1.traj
-
mdir: md_replica_2
input_trajectory_filepaths: prod_2.traj
Launching the workflow
Python
Once the inputs files are generated, we can launch the workflow for all projects in the dataset. The launch_workflow method provides several options for running the workflow:
Sequential execution: Run projects one after another (default)
Parallel execution: Run multiple projects simultaneously using a process pool
SLURM execution: Submit jobs to a SLURM cluster
Filtering projects to run
The method also supports filtering which projects to run using the same query parameters we’ve seen before (query_path, query_state, query_message).
Run only for projects in the special_cases directory:
ds.launch(query_path=['*/special_cases/*'])
Run only for projects that are in ‘new’ state
ds.launch(query_state=['new'])
Run for projects matching a specific pattern and state
ds.launch(
query_path=['project_*'],
query_state=['new', 'error']
)
Number projects to run
Run 4 projects:
ds.launch(n_jobs=4)
Run all projects:
ds.launch(n_jobs=-1)
Parallel execution
To run multiple projects simultaneously, use the pool_size parameter to specify the number of parallel workers. Use pool_size=-1 to use all available CPU cores:
Run with 4 parallel workers:
ds.launch(pool_size=4)
Use all available CPU cores:
ds.launch(pool_size=-1)
Custom workflow command
By default, the workflow runs mwf run for each project. You can customize this command using the cmd parameter:
Run with custom flags, e.g., only include specific tasks:
ds.launch(cmd='mwf run --include meta network')
Run with debug mode enabled (only print the commands without executing them):
ds.launch(debug=True)
Using the CLI
All of the above functionality is also available through the command line interface:
Run sequentially for all projects:
mwf dataset run {db_path}Run with filtering:
mwf dataset run {db_path} -p 'project_*' -st new errorRun n projects:
mwf dataset run {db_path} -n 4Run with parallel workers:
mwf dataset run {db_path} -ps 4Run with custom command:
mwf dataset run {db_path} -c 'mwf run --include meta network'See all available options:
mwf dataset run -h
A real example:
[33]:
ds.launch(query_path='project_1')
Running job for dataset entry project_1
Running MDDB workflow (v0.1.8-212-gfa7cc53d)
Processing project at current directory
⚠ WARNING: Missing input "mds" -> Using default value: None
InputError: Impossible to know which are the MD directories. You can either declare them using the "-md" option or by providing an inputs file
[34]:
ds.get_status(dataset_dir+'/project_1')
[34]:
{'uuid': 'bae0e3d6-7f2d-4d17-98a0-672a5763eff6',
'rel_path': 'project_1',
'num_mds': 0,
'state': 'error',
'message': 'InputError: Impossible to know which are the MD directories. You can either declare them using the "-md" option or by providing an inputs file',
'last_modified': '13:24:20 13-02-2026',
'scope': 'Project'}
[35]:
# Here the path to the dataset file is relative to the project directory
# We also remove the generated inputs.yaml to use the one
ds.launch(cmd='mwf run -proj A0001 -smp -i download -ds ../new_dataset.db', query_path='project_[2-3]')
Running job for dataset entry project_2
Running MDDB workflow (v0.1.8-212-gfa7cc53d)
Processing project at current directory
Downloading inputs file (source_irb_A0001_inputs.yaml)
1 MDs are to be run
https://irb-dev.mddbr.eu/api/rest/current/projects/A0001/files/topology.prmtop -> source_irb_A0001_topology.prmtop
* Field "input_topology_filepath" in the inputs file will be permanently modified
Downloading file "topology.prmtop" in source_irb_A0001_topology.prmtop
Processing MD at replica_1
https://irb-dev.mddbr.eu/api/rest/current/projects/A0001.1/files/trajectory.xtc -> source_irb_A0001.1_trajectory.xtc
* Field "mds.0.input_trajectory_filepaths" in the inputs file will be permanently modified
Downloading main trajectory (replica_1/source_irb_A0001.1_trajectory.xtc)
Progress: 0.00B [00:00, ?B/s]
https://irb-dev.mddbr.eu/api/rest/current/projects/A0001.1/files/structure.pdb -> source_irb_A0001.1_structure.pdb
Progress: 149kB [00:00, 508kB/s]
* Field "mds.0.input_structure_filepath" in the inputs file will be permanently modified
Downloading standard structure (replica_1/source_irb_A0001.1_structure.pdb)
Workflow finished in 0.03 minutes
Done!
Running job for dataset entry project_3
Running MDDB workflow (v0.1.8-212-gfa7cc53d)
Processing project at current directory
Downloading inputs file (source_irb_A0001_inputs.yaml)
1 MDs are to be run
https://irb-dev.mddbr.eu/api/rest/current/projects/A0001/files/topology.prmtop -> source_irb_A0001_topology.prmtop
* Field "input_topology_filepath" in the inputs file will be permanently modified
Downloading file "topology.prmtop" in source_irb_A0001_topology.prmtop
Processing MD at replica_1
https://irb-dev.mddbr.eu/api/rest/current/projects/A0001.1/files/structure.pdb -> source_irb_A0001.1_structure.pdb
* Field "mds.0.input_structure_filepath" in the inputs file will be permanently modified
Downloading standard structure (replica_1/source_irb_A0001.1_structure.pdb)
https://irb-dev.mddbr.eu/api/rest/current/projects/A0001.1/files/trajectory.xtc -> source_irb_A0001.1_trajectory.xtc
* Field "mds.0.input_trajectory_filepaths" in the inputs file will be permanently modified
Downloading main trajectory (replica_1/source_irb_A0001.1_trajectory.xtc)
Progress: 0.00B [00:00, ?B/s]
Progress: 149kB [00:00, 480kB/s]
Workflow finished in 0.03 minutes
Done!
[36]:
# Now our dataset should have some information about the status of the projects, which we can check with the dataframe:
ds.dataframe
[36]:
| project_uuid | scope | rel_path | num_mds | state | message | last_modified | |
|---|---|---|---|---|---|---|---|
| uuid | |||||||
| 2faff7cd | projects | project_3 | 1 | done | Done! | 13:24:32 13-02-2026 | |
| 1f574789 | 2faff7cd | mds | project_3/replica_1 | done | Done! | 13:24:32 13-02-2026 | |
| 4f835985 | projects | project_2 | 1 | done | Done! | 13:24:26 13-02-2026 | |
| f457693e | 4f835985 | mds | project_2/replica_1 | done | Done! | 13:24:26 13-02-2026 | |
| bae0e3d6 | projects | project_1 | 0 | error | InputError: Impossible to know which are the M... | 13:24:20 13-02-2026 | |
| c25f8ed5 | projects | many_mds/project_1 | 0 | new | No information recorded yet. | 13:24:16 13-02-2026 | |
| edc04884 | projects | many_mds/project_2 | 0 | new | No information recorded yet. | 13:24:16 13-02-2026 | |
| 90e60590 | projects | project_4_renamed | 0 | new | No information recorded yet. | 13:23:54 13-02-2026 | |
| 352ab0d3 | projects | special_cases/case_1 | 0 | new | No information recorded yet. | 13:23:54 13-02-2026 |
SLURM
For computing clusters using SLURM, you can submit each project as a separate job. This requires a job template file that defines the SLURM configuration.
The job template is a Jinja2 template that will be rendered for each project. It should contain the SLURM directives and the command to run. The template has access to the following variables:
{{DIR}}: Absolute path to the project directoryEvery field available in the inputs.yaml.
Here’s an example job template:
[ ]:
job_template_str = """#!/bin/bash
#SBATCH --job-name=mddb_workflow
#SBATCH --output=mwf_%j.out
#SBATCH --error=mwf_%j.err
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
# Load required modules
module load anaconda3
# Activate virtual environment if needed
conda activate mwf_env
# Change to project directory
cd {{DIR}}
# Run the workflow command
mwf run -filt -fit -e energies clusters pockets -m largeaa
"""
# Save the template to a file
job_template_path = dataset_dir + '/slurm_job_template.sh'
with open(job_template_path, 'w') as f:
f.write(job_template_str)
Once you have a job template, you can submit jobs using the slurm=True parameter and providing the path to the template:
[ ]:
# Submit all projects as SLURM jobs
ds.launch(
slurm=True,
job_template=job_template_path
)
# Submit filtered projects as SLURM jobs
ds.launch(
query_path=['project_*'],
query_state=['new'],
slurm=True,
job_template=job_template_path
)
# Use custom workflow command with SLURM
ds.launch(
slurm=True,
job_template=job_template_path,
cmd='mwf run --include meta network minimal'
)
[ ]:
# Submit all projects as SLURM jobs
!mwf dataset run {db_path} --slurm --job-template {job_template_path}
# Submit with filtering
!mwf dataset run {db_path} -p 'project_*' -st new --slurm -jt {job_template_path}
# Submit with custom workflow command
!mwf dataset run {db_path} --slurm -jt {job_template_path} -c 'mwf run --include meta network'
When running workflows (either locally or via SLURM), the dataset automatically tracks the state of each project. You can monitor progress using:
[40]:
ds.summary()
[40]:
| state | count | |
|---|---|---|
| 0 | new | 7 |
[ ]:
# Check the summary of project states
ds.summary()
# View the full dataset with log files
ds.get_dataframe(include_logs=True)
# Filter to see only running or error states
ds.get_dataframe(query_state=['running', 'error'])
# CLI: Watch the dataset in real-time (updates every few seconds)
# mwf dataset watch new_dataset.db
Tips
You can use
mwf dsas a shorcut formwf datasetin the CLI.By default,
mwf datasetcommands look for a dataset file named*dataset*.dbin the current directory, so you can execute them with just something likemwf ds showinstead ofmwf dataset --dataset_path path/to/your_dataset.db show.
Dataset limitations
Concurrent access to the dataset file may cause issues if the storage file is accessed by multiple processes simultaneously, especially when using sshfs or network filesystems that may not have proper locking mechanisms. This can lead to data corruption or loss if not handled carefully.
Used flags history is not stored in the dataset, so if we change the flags used for a project, the dataset will not be aware of it and may show wrong information about the state of the project.