Dataset

A Dataset is a collection of Projects that contain molecular dynamics simulations or related data, with some shared metadata and characteristics due to how they were generated. For each Project, in the context of the MDDB Workflow, we are refering to a set of simulations/replicas, with one or more trajectory files and a common topology file. To complete the definitions, individual simulations or replicas are referred to as MD.

The main functionality of this class is keeping track of the state of many Projects: if they are still running, if they are done or if they fail and what caused the error. For this the only adjustment we have to do is adding the path where our main SQLite storage file will be kept. We can do this by using the dataset_path flag during the workflow execution:

mwf run ... --dataset_path path/to/our_dataset.db

Or, if we do no want to write the flag everytime, by using the field dataset_path in the input.yaml config file:

- dataset_path: path/to/our_dataset.db

Creating a new Dataset

However, having to modify the inputs file for every project of the dataset may be very cumbersome, as Datasets can be form by hundreds or thousand projects. For this we can make use of another feature of this class: automatic inputs file generation.

Directory Structure

For this, we part from a root folder, that every person may be organize on its own ways, but they normally follow a hierarchical structure with all its project that may look something like this:

new_dataset/
├── project_1/
├── project_2/
├── project_3/
├── project_4/
├── ...
├──── special_cases/
├────── case_1/
├────── case_2/
├────── ...
├──── wrong_cases/
├────── case_1/
├────── case_2/
├────── ...
├── scripts/
├── project_logs/
└── ...

Note of we do not specify nothing about MDs as we will take care of that later.

[1]:

import os

# Create directory structure
dataset_dir = "new_dataset"
dirs = [
    dataset_dir+"/project_1",
    dataset_dir+"/project_2",
    dataset_dir+"/project_3",
    dataset_dir+"/project_4",
    dataset_dir+"/special_cases/case_1",
    dataset_dir+"/special_cases/not_this_one",
    dataset_dir+"/to_remove/case_1",
    dataset_dir+"/to_remove/case_2",
    dataset_dir+"/scripts",
    dataset_dir+"/project_logs",
]

for dir_path in dirs:
    os.makedirs(dir_path, exist_ok=True)

[2]:

%load_ext autoreload
%autoreload 2
from mddb_workflow.core.dataset import Dataset

# Create test directory structure
dataset_dir = "new_dataset"
# Initialize the Dataset
db_path = dataset_dir+"/new_dataset.db"
# Remove database in case the notebook is re-run
if os.path.exists(db_path):
    os.remove(db_path)

# Create dataset and scan for projects and MDs
ds = Dataset(dataset_path=db_path)

Adding entries

Adding entries to the dataset is the first step to select what are the projects where are going to keep track of.

For this, we specify the root folders and the ones to ignore (not containing projects, e.g., scripts, logs, etc). We can do this passing absolute, relative or glob patterns. For example:

[3]:

# CLI: mwf dataset add new_dataset.db -p project_* special_cases/case_1 to_remove/* --ignore-dirs */logs
ds.add_entries([dataset_dir+'/project_*',
                dataset_dir+'/special_cases/case_1',
                dataset_dir+'/to_remove/*'],
                ignore_dirs=[dataset_dir+'/*logs'],
                verbose=True)

Ignoring project: project_logs
Adding project: project_2 (UUID: 4f835985-9ffd-40c6-90f6-d4510ac01785)
Adding project: project_1 (UUID: bae0e3d6-7f2d-4d17-98a0-672a5763eff6)
Adding project: project_4 (UUID: 90e60590-b7e1-41bf-a2c2-8c159c12a35d)
Adding project: project_3 (UUID: 2faff7cd-0335-44f6-b196-55a74a84347e)
Adding project: special_cases/case_1 (UUID: 352ab0d3-c20e-4c40-9e2f-c870514670a8)
Adding project: to_remove/case_1 (UUID: 8d39cd59-d19e-4ace-b273-7d30c3a58468)
Adding project: to_remove/case_2 (UUID: 64f86660-a622-4d26-9e5c-91dcd017a25c)

Some useful glob patterns:

*: matches all the folders.
**/*: matches all subfolders.
*/*/: matches all subfolders of the first level.
**/[0-9]*: matches subfolders starting with a digit.

Adding already run projects

In case where the MDDB Workflow has beeng already run for some of the projects, we can also add them to the dataset. For this, we just have to specify the root folder and the workflow will automatically scan for all the projects and MDs.

[4]:

# Remove database as a way to reset the dataset
if os.path.exists(db_path):
    os.remove(db_path)

# Reinstance the dataset and scan for projects and MDs
ds = Dataset(dataset_path=db_path)
ds.scan(dataset_dir, verbose=True)

Adding project: project_2 (UUID: 4f835985-9ffd-40c6-90f6-d4510ac01785)
Adding project: project_1 (UUID: bae0e3d6-7f2d-4d17-98a0-672a5763eff6)
Adding project: to_remove/case_1 (UUID: 8d39cd59-d19e-4ace-b273-7d30c3a58468)
Adding project: to_remove/case_2 (UUID: 64f86660-a622-4d26-9e5c-91dcd017a25c)
Adding project: project_4 (UUID: 90e60590-b7e1-41bf-a2c2-8c159c12a35d)
Adding project: project_3 (UUID: 2faff7cd-0335-44f6-b196-55a74a84347e)
Adding project: special_cases/case_1 (UUID: 352ab0d3-c20e-4c40-9e2f-c870514670a8)

This is also useful to correctyl track projects after moving them to a new location, as the dataset will be able to find them and update their paths in the database.

[5]:

!mv {dataset_dir}/project_4 {dataset_dir}/project_4_renamed

[6]:

ds.scan(dataset_dir, verbose=True)

Updating project path: project_4_renamed (UUID: 90e60590-b7e1-41bf-a2c2-8c159c12a35d) from project_4 to project_4_renamed
Project already exists: project_2 (UUID: 4f835985-9ffd-40c6-90f6-d4510ac01785)
Project already exists: project_1 (UUID: bae0e3d6-7f2d-4d17-98a0-672a5763eff6)
Project already exists: to_remove/case_1 (UUID: 8d39cd59-d19e-4ace-b273-7d30c3a58468)
Project already exists: to_remove/case_2 (UUID: 64f86660-a622-4d26-9e5c-91dcd017a25c)
Project already exists: project_3 (UUID: 2faff7cd-0335-44f6-b196-55a74a84347e)
Project already exists: special_cases/case_1 (UUID: 352ab0d3-c20e-4c40-9e2f-c870514670a8)

Removing entries

In cases where later we find a project should be deleted, or if the glob pattern added folders you did not want, we can remove those matching simlar to how we added them:

[7]:

# CLI: mwf dataset remove new_dataset.db to_remove/*
ds.remove_entry('to_remove/*')

Deleted project with UUID '8d39cd59-d19e-4ace-b273-7d30c3a58468'
Deleted project with UUID '64f86660-a622-4d26-9e5c-91dcd017a25c'

Showing the dataset

Once we have initialized the entries, we can show the dataset to check if it is correct.

Dataset tables

[8]:

ds.dataframe

[8]:

	scope	rel_path	num_mds	state	message	last_modified
uuid
bae0e3d6	projects	project_1	0	new	No information recorded yet.	13:23:54 13-02-2026
4f835985	projects	project_2	0	new	No information recorded yet.	13:23:54 13-02-2026
2faff7cd	projects	project_3	0	new	No information recorded yet.	13:23:54 13-02-2026
90e60590	projects	project_4_renamed	0	new	No information recorded yet.	13:23:54 13-02-2026
352ab0d3	projects	special_cases/case_1	0	new	No information recorded yet.	13:23:54 13-02-2026

[9]:

ds.get_dataframe(query_path='project_*')

[9]:

	scope	rel_path	num_mds	state	message	last_modified
uuid
bae0e3d6-7f2d-4d17-98a0-672a5763eff6	projects	project_1	0	new	No information recorded yet.	13:23:54 13-02-2026
4f835985-9ffd-40c6-90f6-d4510ac01785	projects	project_2	0	new	No information recorded yet.	13:23:54 13-02-2026
2faff7cd-0335-44f6-b196-55a74a84347e	projects	project_3	0	new	No information recorded yet.	13:23:54 13-02-2026
90e60590-b7e1-41bf-a2c2-8c159c12a35d	projects	project_4_renamed	0	new	No information recorded yet.	13:23:54 13-02-2026

[10]:

# CLI:
!mwf dataset show {db_path}

                             MDDB Dataset (5 rows)
┏━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃ uuid    ┃ projec… ┃ scope   ┃ rel_pa… ┃ num_mds ┃ state ┃ message  ┃ last_m… ┃
┡━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ bae0e3… │         │ projec… │ ../pro… │ 0       │ new   │ No       │ 13:23:… │
│         │         │         │         │         │       │ informa… │ 13-02-… │
│         │         │         │         │         │       │ recorded │         │
│         │         │         │         │         │       │ yet.     │         │
│ 4f8359… │         │ projec… │ ../pro… │ 0       │ new   │ No       │ 13:23:… │
│         │         │         │         │         │       │ informa… │ 13-02-… │
│         │         │         │         │         │       │ recorded │         │
│         │         │         │         │         │       │ yet.     │         │
│ 2faff7… │         │ projec… │ ../pro… │ 0       │ new   │ No       │ 13:23:… │
│         │         │         │         │         │       │ informa… │ 13-02-… │
│         │         │         │         │         │       │ recorded │         │
│         │         │         │         │         │       │ yet.     │         │
│ 90e605… │         │ projec… │ ../pro… │ 0       │ new   │ No       │ 13:23:… │
│         │         │         │         │         │       │ informa… │ 13-02-… │
│         │         │         │         │         │       │ recorded │         │
│         │         │         │         │         │       │ yet.     │         │
│ 352ab0… │         │ projec… │ ../spe… │ 0       │ new   │ No       │ 13:23:… │
│         │         │         │         │         │       │ informa… │ 13-02-… │
│         │         │         │         │         │       │ recorded │         │
│         │         │         │         │         │       │ yet.     │         │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────┴──────────┴─────────┘

[11]:

# For more specific subset of the dataset, use the different flags:
!mwf dataset show -h

usage: mwf dataset show [-h] [-p [QUERY_PATH ...]] [-st [QUERY_STATE ...]]
                        [-sc QUERY_SCOPE] [-ms QUERY_MESSAGE] [-s SORT_BY]
                        [-n N_ROWS] [-l] [-m]
                        [dataset_path]

positional arguments:
  dataset_path
      Path to the dataset storage file, normally an .db file. If not provided,
      the first *.db file found in the current directory will be used.

options:
  -h, --help
      show this help message and exit
  -p, --query_path [QUERY_PATH ...]
      If provided, filters rows whose 'rel_path' matches these glob patterns.
      Default: ['*']
  -st, --query_state [QUERY_STATE ...]
      If provided, filters rows whose 'state' matches this value/list of
      values.
  -sc, --query_scope QUERY_SCOPE
      If provided, filters rows whose 'scope' matches this value
      ('project'/'p' or 'md'/'m').
  -ms, --query_message QUERY_MESSAGE
      If provided, filters rows whose 'message' matches these glob patterns
      (e.g., 'URLError*').
  -s, --sort_by SORT_BY
      Column name to sort the dataset by.
      Default: last_modified
  -n, --n_rows N_ROWS
      Number of rows to display. 0 for all rows.
      Default: 50
  -l, --include_logs
      If True, adds 'log_file' and 'err_file' columns with HTML links to the
      latest log files.
  -m, --summary
      Get a summary of the state of the projects.

Dataset summary

[12]:

ds.summary()

[12]:

	state	count
0	new	5

[13]:

# CLI:
!mwf dataset show {db_path} -m

==================================================================
Summary of project states:
==================================================================
  state  count
0   new      5

Specific rows

[14]:

ds.get_status(dataset_dir+'/special_cases/case_1')

[14]:

{'uuid': '352ab0d3-c20e-4c40-9e2f-c870514670a8',
 'rel_path': 'special_cases/case_1',
 'num_mds': 0,
 'state': 'new',
 'message': 'No information recorded yet.',
 'last_modified': '13:23:54 13-02-2026',
 'scope': 'Project'}

[15]:

# CLI
!mwf dataset status {db_path} -p {dataset_dir}'/special_cases/case_1'

UUID:          352ab0d3-c20e-4c40-9e2f-c870514670a8
Path:          special_cases/case_1
State:         new
Scope:         Project
MDs:           0
Last Modified: 13:23:54 13-02-2026
Message:       No information recorded yet.

Running the workflow

Generating inputs files programmatically

The first step to run the workflow is generating the inputs files for each project. This can be done in a programmatic way using the generate_inputs_files method of the Dataset class. This method will generate an inputs.yaml file for each project in the dataset, with the same content as the one we would have to write if we were to do it manually, but with the advantage that we can use variables that will be replaced by the actual values when the workflow is executed.

Jinja2 templates

This is done by using Jinja2 templates syntax. For example, we can use the {{DATASET}} variable to refer to the dataset path and the {{DIR}} variable to refer to the project directory name. This way, we can write a single inputs file template that will be used for all projects in the dataset, and we do not have to worry about writing different inputs files for each project.

[16]:

inputs_template_str = """
authors:
- Rubén Chaves
collections:
- mdbind
contact: For any questions please send a mail to ruben.chaves@irbbarcelona.org
dataset_path: {{DATASET}}
description: 10 ns simulation of {{DIR}} pdb structure
linkcense: https://creativecommons.org/licenses/by/4.0/
name: Project {{DIR}}
"""

inputs_template = dataset_dir+'/inputs_template.yaml'
with open(inputs_template, 'w') as f:
    f.write(inputs_template_str)

[17]:

# CLI: mwf dataset inputs new_dataset.db -it inputs_template.yaml -o
ds.generate_inputs_yaml(inputs_template, overwrite=True)

Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_1/inputs.yaml for project project_1
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_2/inputs.yaml for project project_2
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_3/inputs.yaml for project project_3
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_4_renamed/inputs.yaml for project project_4_renamed
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/special_cases/case_1/inputs.yaml for project special_cases/case_1

[18]:

# Notice how the {{DATASET}} and {{DIR}} variables have been replaced by the dataset path and the project directory name, respectively.
!cat {dataset_dir}/project_1/inputs.yaml


authors:
- Rubén Chaves
collections:
- mdbind
contact: For any questions please send a mail to ruben.chaves@irbbarcelona.org
dataset_path: ../new_dataset.db
description: 10 ns simulation of project_1 pdb structure
linkcense: https://creativecommons.org/licenses/by/4.0/
name: Project project_1

Adding custom fields

To generate more complex inputs files, we make use of more advanced features of Jinja2 templates, such as custom filters and functions, which is basically Python code that we can use in the templates to generate the inputs files.

The template will recieve a dictionary generated by a custom function that we can write, using project directory as argument:

Project directory -> Custom function -> Dictionary -> Template -> Rendered inputs.yaml

[19]:

inputs_template_str = """
name: Project {{DIR}}
dataset_path: {{DATASET}}
{%- if is_special_case %}
description: Special case description for {{DIR}}
{%- else %}
description: 10 ns simulation of {{DIR}} pdb structure
{% endif %}
"""

# Save the template to a file
inputs_template = dataset_dir+'/inputs_template.yaml'
with open(inputs_template, 'w') as f:
    f.write(inputs_template_str)


# Define the custom function
def inputs_generator(project_dir: str):
    """Generate a dictionary with the information that we want to use in the template.
    This function will be called for each project directory, and the returned dictionary will be passed to the template as variables.
    """
    if "special_cases" in project_dir:
        return {'is_special_case': True}

[20]:

# CLI: mwf dataset inputs new_dataset.db -it inputs_template.yaml -ig inputs_generator.py -o
ds.generate_inputs_yaml(inputs_template, overwrite=True,
                        inputs_generator=inputs_generator)

Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_1/inputs.yaml for project project_1
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_2/inputs.yaml for project project_2
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_3/inputs.yaml for project project_3
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_4_renamed/inputs.yaml for project project_4_renamed
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/special_cases/case_1/inputs.yaml for project special_cases/case_1

[21]:

# Notice how the project in the special_cases directory has a different description than the rest of the projects.
!cat {dataset_dir}/project_1/inputs.yaml
!cat {dataset_dir}/special_cases/case_1/inputs.yaml


name: Project project_1
dataset_path: ../new_dataset.db
description: 10 ns simulation of project_1 pdb structure

name: Project case_1
dataset_path: ../../new_dataset.db
description: Special case description for case_1

Similarly, we can use the CLI to generate the inputs files with the custom function. With the only difference that we pass a file instead of a function.

IMPORTANT: in this file there must be a function called inputs_generator.

[22]:

python_file_str = """
def inputs_generator(project_dir: str):
    if "special_cases" in project_dir:
        return {'is_special_case': True}
"""

inputs_generator_py = dataset_dir+'/inputs_generator.py'
with open(inputs_generator_py, 'w') as f:
    f.write(python_file_str)

!mwf dataset inputs {db_path} -it {inputs_template} -ig {inputs_generator_py} -o

Loading inputs generator from file: new_dataset/inputs_generator.py
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_1/inputs.yaml for project project_1
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_2/inputs.yaml for project project_2
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_3/inputs.yaml for project project_3
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_4_renamed/inputs.yaml for project project_4_renamed
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/special_cases/case_1/inputs.yaml for project special_cases/case_1

[23]:

!cat {dataset_dir}/project_1/inputs.yaml
!cat {dataset_dir}/special_cases/case_1/inputs.yaml


name: Project project_1
dataset_path: ../new_dataset.db
description: 10 ns simulation of project_1 pdb structure

name: Project case_1
dataset_path: ../../new_dataset.db
description: Special case description for case_1

A more real example: handling multiple and variable number of MDs

In this case, we want to generate an inputs file that contains a list of all the MDs that we have in each project, but the number of MDs is not the same for all projects. For this, we can write a custom function that will look for all the MDs in each project, ignore any irrelevant files (equilibration trajectories, for example), and return a dictionary with the list of MDs, that we can then use in the template to generate the inputs file.

[24]:

# Create directory structure
dirs = [
    dataset_dir+"/many_mds",
    dataset_dir+"/many_mds/project_1",
    dataset_dir+"/many_mds/project_2",
]
files = [
    # A project with 3 equilibration and 3 MD replicas
    dataset_dir+"/many_mds/project_1/equil_1.traj",
    dataset_dir+"/many_mds/project_1/equil_2.traj",
    dataset_dir+"/many_mds/project_1/equil_3.traj",
    dataset_dir+"/many_mds/project_1/prod_1.traj",
    dataset_dir+"/many_mds/project_1/prod_2.traj",
    dataset_dir+"/many_mds/project_1/prod_3.traj",
    # A project with 2 equilibration and 2 MD replicas
    dataset_dir+"/many_mds/project_2/equil_1.traj",
    dataset_dir+"/many_mds/project_2/equil_2.traj",
    dataset_dir+"/many_mds/project_2/prod_1.traj",
    dataset_dir+"/many_mds/project_2/prod_2.traj",
]
for dir_path in dirs:
    os.makedirs(dir_path, exist_ok=True)

for file_path in files:
    with open(file_path, 'w') as f:
        f.write("DUMMY TRAJ FILE\n")

[25]:

ds.add_entries([dataset_dir+'/many_mds/*'], verbose=True)

Adding project: many_mds/project_2 (UUID: edc04884-e9cc-4f75-8e5e-80a7f864d9d2)
Adding project: many_mds/project_1 (UUID: c25f8ed5-917e-4bc1-bd13-e51329b33cb3)

[26]:

ds.dataframe

[26]:

	scope	rel_path	num_mds	state	message	last_modified
uuid
c25f8ed5	projects	many_mds/project_1	0	new	No information recorded yet.	13:24:16 13-02-2026
edc04884	projects	many_mds/project_2	0	new	No information recorded yet.	13:24:16 13-02-2026
bae0e3d6	projects	project_1	0	new	No information recorded yet.	13:23:54 13-02-2026
4f835985	projects	project_2	0	new	No information recorded yet.	13:23:54 13-02-2026
2faff7cd	projects	project_3	0	new	No information recorded yet.	13:23:54 13-02-2026
90e60590	projects	project_4_renamed	0	new	No information recorded yet.	13:23:54 13-02-2026
352ab0d3	projects	special_cases/case_1	0	new	No information recorded yet.	13:23:54 13-02-2026

[27]:

inputs_template_str = """
name: Project {{DIR}}
mds:
{% for md in mds %}
  -
    mdir: {{ md.mdir }}
    input_trajectory_filepaths: {{ md.traj }}
{% endfor %}
"""

inputs_template = dataset_dir+'/inputs_template.yaml'
with open(inputs_template, 'w') as f:
    f.write(inputs_template_str)

[28]:

from pathlib import Path


def mds_generator(project_dir: str):
    """Generate a list of MD replicas based on the traj files in the project directory."""
    mds = []
    project_path = Path(project_dir)
    prod_trajs = sorted(project_path.glob('prod_*.traj'))
    num_replicas = len(prod_trajs)
    for i in range(num_replicas):
        mds.append({
            'mdir': f'md_replica_{i+1}',
            'traj': prod_trajs[i].relative_to(project_path).as_posix(),
        })
    return {'mds': mds}

[29]:

# Check that the function works as expected
mds_generator(dataset_dir+'/many_mds/project_1')

[29]:

{'mds': [{'mdir': 'md_replica_1', 'traj': 'prod_1.traj'},
  {'mdir': 'md_replica_2', 'traj': 'prod_2.traj'},
  {'mdir': 'md_replica_3', 'traj': 'prod_3.traj'}]}

[30]:

ds.generate_inputs_yaml(inputs_template,
                        inputs_generator=mds_generator,
                        overwrite=True,
                        query_path='*many_mds*'
                        )

Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/many_mds/project_1/inputs.yaml for project many_mds/project_1
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/many_mds/project_2/inputs.yaml for project many_mds/project_2

[31]:

# Generated inputs.yaml for project with 3 replicas
!cat new_dataset/many_mds/project_1/inputs.yaml


name: Project project_1
mds:

  -
    mdir: md_replica_1
    input_trajectory_filepaths: prod_1.traj

  -
    mdir: md_replica_2
    input_trajectory_filepaths: prod_2.traj

  -
    mdir: md_replica_3
    input_trajectory_filepaths: prod_3.traj

[32]:

# Generated inputs.yaml for project with 2 replicas
!cat new_dataset/many_mds/project_2/inputs.yaml


name: Project project_2
mds:

  -
    mdir: md_replica_1
    input_trajectory_filepaths: prod_1.traj

  -
    mdir: md_replica_2
    input_trajectory_filepaths: prod_2.traj

Launching the workflow

Python

Once the inputs files are generated, we can launch the workflow for all projects in the dataset. The launch_workflow method provides several options for running the workflow:

Sequential execution: Run projects one after another (default)
Parallel execution: Run multiple projects simultaneously using a process pool
SLURM execution: Submit jobs to a SLURM cluster

Filtering projects to run

The method also supports filtering which projects to run using the same query parameters we’ve seen before (query_path, query_state, query_message).

Run only for projects in the special_cases directory:

ds.launch(query_path=['*/special_cases/*'])

Run only for projects that are in ‘new’ state

ds.launch(query_state=['new'])

Run for projects matching a specific pattern and state

ds.launch(
    query_path=['project_*'],
    query_state=['new', 'error']
)

Number projects to run

Run 4 projects:

ds.launch(n_jobs=4)

Run all projects:

ds.launch(n_jobs=-1)

Parallel execution

To run multiple projects simultaneously, use the pool_size parameter to specify the number of parallel workers. Use pool_size=-1 to use all available CPU cores:

Run with 4 parallel workers:

ds.launch(pool_size=4)

Use all available CPU cores:

ds.launch(pool_size=-1)

Custom workflow command

By default, the workflow runs mwf run for each project. You can customize this command using the cmd parameter:

Run with custom flags, e.g., only include specific tasks:

ds.launch(cmd='mwf run --include meta network')

Run with debug mode enabled (only print the commands without executing them):

ds.launch(debug=True)

Using the CLI

All of the above functionality is also available through the command line interface:

Run sequentially for all projects: mwf dataset run {db_path}
Run with filtering: mwf dataset run {db_path} -p 'project_*' -st new error
Run n projects: mwf dataset run {db_path} -n 4
Run with parallel workers: mwf dataset run {db_path} -ps 4
Run with custom command: mwf dataset run {db_path} -c 'mwf run --include meta network'
See all available options: mwf dataset run -h

A real example:

[33]:

ds.launch(query_path='project_1')

Running job for dataset entry project_1

Running MDDB workflow (v0.1.8-212-gfa7cc53d)
Processing project at current directory
⚠  WARNING: Missing input "mds" -> Using default value: None
InputError: Impossible to know which are the MD directories. You can either declare them using the "-md" option or by providing an inputs file

[34]:

ds.get_status(dataset_dir+'/project_1')

[34]:

{'uuid': 'bae0e3d6-7f2d-4d17-98a0-672a5763eff6',
 'rel_path': 'project_1',
 'num_mds': 0,
 'state': 'error',
 'message': 'InputError: Impossible to know which are the MD directories. You can either declare them using the "-md" option or by providing an inputs file',
 'last_modified': '13:24:20 13-02-2026',
 'scope': 'Project'}

[35]:

# Here the path to the dataset file is relative to the project directory
# We also remove the generated inputs.yaml to use the one
ds.launch(cmd='mwf run -proj A0001 -smp -i download -ds ../new_dataset.db', query_path='project_[2-3]')

Running job for dataset entry project_2

Running MDDB workflow (v0.1.8-212-gfa7cc53d)
Processing project at current directory
Downloading inputs file (source_irb_A0001_inputs.yaml)

  1 MDs are to be run
https://irb-dev.mddbr.eu/api/rest/current/projects/A0001/files/topology.prmtop -> source_irb_A0001_topology.prmtop
* Field "input_topology_filepath" in the inputs file will be permanently modified
Downloading file "topology.prmtop" in source_irb_A0001_topology.prmtop


 Processing MD at replica_1
https://irb-dev.mddbr.eu/api/rest/current/projects/A0001.1/files/trajectory.xtc -> source_irb_A0001.1_trajectory.xtc
* Field "mds.0.input_trajectory_filepaths" in the inputs file will be permanently modified
Downloading main trajectory (replica_1/source_irb_A0001.1_trajectory.xtc)


 Progress: 0.00B [00:00, ?B/s]

https://irb-dev.mddbr.eu/api/rest/current/projects/A0001.1/files/structure.pdb -> source_irb_A0001.1_structure.pdb

 Progress: 149kB [00:00, 508kB/s]

* Field "mds.0.input_structure_filepath" in the inputs file will be permanently modified
Downloading standard structure (replica_1/source_irb_A0001.1_structure.pdb)


 Workflow finished in 0.03 minutes 
Done!

Running job for dataset entry project_3

Running MDDB workflow (v0.1.8-212-gfa7cc53d)
Processing project at current directory
Downloading inputs file (source_irb_A0001_inputs.yaml)

  1 MDs are to be run
https://irb-dev.mddbr.eu/api/rest/current/projects/A0001/files/topology.prmtop -> source_irb_A0001_topology.prmtop
* Field "input_topology_filepath" in the inputs file will be permanently modified
Downloading file "topology.prmtop" in source_irb_A0001_topology.prmtop


 Processing MD at replica_1
https://irb-dev.mddbr.eu/api/rest/current/projects/A0001.1/files/structure.pdb -> source_irb_A0001.1_structure.pdb
* Field "mds.0.input_structure_filepath" in the inputs file will be permanently modified
Downloading standard structure (replica_1/source_irb_A0001.1_structure.pdb)

https://irb-dev.mddbr.eu/api/rest/current/projects/A0001.1/files/trajectory.xtc -> source_irb_A0001.1_trajectory.xtc
* Field "mds.0.input_trajectory_filepaths" in the inputs file will be permanently modified
Downloading main trajectory (replica_1/source_irb_A0001.1_trajectory.xtc)


 Progress: 0.00B [00:00, ?B/s]
 Progress: 149kB [00:00, 480kB/s]


 Workflow finished in 0.03 minutes 
Done!

[36]:

# Now our dataset should have some information about the status of the projects, which we can check with the dataframe:
ds.dataframe

[36]:

	project_uuid	scope	rel_path	num_mds	state	message	last_modified
uuid
2faff7cd		projects	project_3	1	done	Done!	13:24:32 13-02-2026
1f574789	2faff7cd	mds	project_3/replica_1		done	Done!	13:24:32 13-02-2026
4f835985		projects	project_2	1	done	Done!	13:24:26 13-02-2026
f457693e	4f835985	mds	project_2/replica_1		done	Done!	13:24:26 13-02-2026
bae0e3d6		projects	project_1	0	error	InputError: Impossible to know which are the M...	13:24:20 13-02-2026
c25f8ed5		projects	many_mds/project_1	0	new	No information recorded yet.	13:24:16 13-02-2026
edc04884		projects	many_mds/project_2	0	new	No information recorded yet.	13:24:16 13-02-2026
90e60590		projects	project_4_renamed	0	new	No information recorded yet.	13:23:54 13-02-2026
352ab0d3		projects	special_cases/case_1	0	new	No information recorded yet.	13:23:54 13-02-2026

SLURM

For computing clusters using SLURM, you can submit each project as a separate job. This requires a job template file that defines the SLURM configuration.

The job template is a Jinja2 template that will be rendered for each project. It should contain the SLURM directives and the command to run. The template has access to the following variables:

{{DIR}}: Absolute path to the project directory
Every field available in the inputs.yaml.

Here’s an example job template:

[ ]:

job_template_str = """#!/bin/bash
#SBATCH --job-name=mddb_workflow
#SBATCH --output=mwf_%j.out
#SBATCH --error=mwf_%j.err
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G

# Load required modules
module load anaconda3

# Activate virtual environment if needed
conda activate mwf_env

# Change to project directory
cd {{DIR}}

# Run the workflow command
mwf run -filt -fit -e energies clusters pockets -m largeaa
"""

# Save the template to a file
job_template_path = dataset_dir + '/slurm_job_template.sh'
with open(job_template_path, 'w') as f:
    f.write(job_template_str)

Once you have a job template, you can submit jobs using the slurm=True parameter and providing the path to the template:

[ ]:

# Submit all projects as SLURM jobs
ds.launch(
    slurm=True,
    job_template=job_template_path
)

# Submit filtered projects as SLURM jobs
ds.launch(
    query_path=['project_*'],
    query_state=['new'],
    slurm=True,
    job_template=job_template_path
)

# Use custom workflow command with SLURM
ds.launch(
    slurm=True,
    job_template=job_template_path,
    cmd='mwf run --include meta network minimal'
)

[ ]:

# Submit all projects as SLURM jobs
!mwf dataset run {db_path} --slurm --job-template {job_template_path}

# Submit with filtering
!mwf dataset run {db_path} -p 'project_*' -st new --slurm -jt {job_template_path}

# Submit with custom workflow command
!mwf dataset run {db_path} --slurm -jt {job_template_path} -c 'mwf run --include meta network'

When running workflows (either locally or via SLURM), the dataset automatically tracks the state of each project. You can monitor progress using:

[40]:

ds.summary()

[40]:

	state	count
0	new	7

[ ]:

# Check the summary of project states
ds.summary()

# View the full dataset with log files
ds.get_dataframe(include_logs=True)

# Filter to see only running or error states
ds.get_dataframe(query_state=['running', 'error'])

# CLI: Watch the dataset in real-time (updates every few seconds)
# mwf dataset watch new_dataset.db

Tips

You can use mwf ds as a shorcut for mwf dataset in the CLI.
By default, mwf dataset commands look for a dataset file named *dataset*.db in the current directory, so you can execute them with just something like mwf ds show instead of mwf dataset --dataset_path path/to/your_dataset.db show.

Dataset limitations

Concurrent access to the dataset file may cause issues if the storage file is accessed by multiple processes simultaneously, especially when using sshfs or network filesystems that may not have proper locking mechanisms. This can lead to data corruption or loss if not handled carefully.
Used flags history is not stored in the dataset, so if we change the flags used for a project, the dataset will not be aware of it and may show wrong information about the state of the project.