In the dynamic world of machine learning, where innovation and advancement are the norm, the role of configurations cannot be understated.

Configurations serve as the backbone that defines how a machine learning model operates, influencing everything from hyperparameters, data preprocessing, and even data location.

However, for those new to the field, navigating the intricacies of configurations can be a challenge, often resulting in a steep learning curve that hinders progress. This is where the concept of "config-driven development" comes into play, enabling quick iterations and empowering developers to focus on refining their models.

In this beginner's guide, we will unravel the significance of configurations in the context of machine learning, explore the hurdles that beginners commonly encounter, and delve into the solution that is poised to revolutionize how we manage configurations with Hydra.

So let’s discover how tools like Hydra can simplify and enhance your machine learning projects by streamlining configuration management, fostering reproducibility, and enabling seamless collaboration.

Understanding the Significance of Configurations

Configurations are the blueprint that governs how a machine learning model behaves.

Imagine them as the knobs and switches that determine the behaviour of a complex machine. In the realm of machine learning, these configurations encompass a wide range of parameters, from model architecture and hyperparameters to data preprocessing steps and training settings.

Anything you might want to adjust for a new model run, should find a space in your config.

They play a pivotal role in the reproducibility of experiments, as different configurations can lead to vastly different results.

Consider a scenario in Earth sciences, where accurate weather forecasting relies on intricate configurations that capture the interplay of various atmospheric variables. A slight tweak in these configurations can have a profound impact on the accuracy of predictions, highlighting the critical nature of configuration management.

So in addition to configurations, we ideally want to look at tracking these in our experiment logging setup.

Introducing Hydra

Enter Hydra – a versatile and widely acclaimed solution for managing configurations in machine learning projects developed by Facebook / Meta.

Hydra doesn't just simplify configuration management; it revolutionizes it. It is renowned for its ability to seamlessly integrate with various frameworks and libraries, making it a favourite among machine learning folks I talked to.

The nice thing about Hydra is that it integrates with your main training loop in a singular decorator.

import hydra


@hydra.main(version_base=None, config_path="configs", config_name="config")
def train(cfg) -> None:
    print(cfg.hardware.paths)

And then, we can define different defaults. Modify them in an experiment config, or even overwrite them from the command-line.

The configs themselves are simple YAML files that look like this.

training:
  epochs: 100
  lr: 10e-4
logging:
    wandb: True

Just really useful!

Installing and Setting Up Hydra

To embark on your journey with Hydra, the first step is installation. Fear not, as the process is straightforward.

Using pip, you can quickly install Hydra and lay the foundation for streamlined configuration management.

pip install hydra-core

Once installed, organizing your project structure is essential. Create a dedicated directory for configurations and ensure that it's integrated seamlessly into your project.

This initial setup paves the way for harnessing the power of Hydra.

Creating Configurations with Hydra

Hydra's structured approach to configurations sets it apart.

With Hydra, you define your configuration files using a clean and organized YAML syntax. It encourages you to group related parameters, making it easier to manage complex configurations.

Imagine you're developing an image classification pipeline.

With Hydra, you can effortlessly configure various aspects such as model architecture, optimizer settings, and data augmentation techniques.

model:
  activation: SiLU
  loss: CrossEntropy
optimizer:
  name: AdamW
  lr: 10e-3
  momentum: true
data:
  augmentation:
    enable: true
    types:
      - affine
      - hue

Then in our code, we simply access the config values with the dot notation:

hydra.main(...)
def train(cfg):
    activation = cfg.model.activation
    loss = cfg.model.loss

        ...
        model.add(Dense(activation=activation))
        ...

        model.fit(data, loss=loss)

This modular approach enhances readability, reusability, and experimentation.

Managing Overrides and Composition

One of Hydra's standout features is its ability to handle configuration overrides. In the world of machine learning, where projects often span multiple environments (development, production, testing), overrides are a game-changer.

They enable you to fine-tune configurations for each environment, ensuring consistency while adapting to specific requirements.

Additionally, Hydra's composition feature empowers you to build configurations by combining smaller, reusable components. This flexibility allows you to create intricate configurations without sacrificing simplicity.

Here we can, for example, use a config for the paths to data, and logging, and a file for training settings. With that paths.yaml, logging.yaml and training.yaml we can then create a main config.yaml that defines these as defaults to combine:

defaults:
   paths: paths
   logging: logging
   training: training

Hydra neatly composes these into a singular config object.

We can easily change the defaults to use and replace the config for inference.yaml by calling:

fancy-model --config-name=inference

which goes into the config directory and chooses a different main config that points to other configs.

Alternatively, we can override values in the command-line directly:

fancy-model --config-name=inference logging.enabled=false

And these different ways of using different overwrites gives us a ton of flexibility when working on model development together.

Pitfalls and Design Choices in Configuration Management

As with any powerful tool, there are potential pitfalls to watch out for.

In the excitement of exploring Hydra's capabilities, beginners might be tempted to create overly complex configurations. While Hydra promotes flexibility, striking a balance is crucial.

Design choices should prioritize clarity and simplicity to avoid drowning in a sea of configurations. The key is to start small, iterate, and gradually build complexity as needed.

Additionally, hydra has made some design choices you need to keep in mind.

Their launcher feature only works for sweeps. So you can’t use it when you’re not doing hyperparameter search. Odd choice, but it is what it is.

With hydra you can resolve configs at run-time. Which enables you to make some config settings rely on dynamic values. For example, consider that you’re finding checkpoints based on an environment variable. Then the config looks like:

checkpoints: ${oc.env:PROJECT_NUMBER}

However, these values are resolved only when they’re accessed.

I made the questionable choice of saving logs based on the time the datetime using \({now:%Y-\)M-\(D-\)hh-\(mm-\)ss}. When this is resolved during inference, you’re getting very different values than what you were using during training time in your saved checkpoints. But it gets worse that if you’re unlucky in your HPC queue system, you end up with multiple log folders for the same experiment. Or, if you are particularly eager, you may have multiple experiments in the same exact folder. Overall, such a bad design choice I made, exacerbated by the lazy resolution of hydra.

To get around any problems in your checkpoints from unresolved configs, you can use OmegaConf to resolve the config in your main training loop, because environment variables may not be available during inference or porting of models.

Integrating Hydra with Machine Learning Projects

The true power of Hydra shines when it seamlessly integrates with your machine learning code.

Accessing configuration parameters within your code is straightforward, allowing you to dynamically adjust settings without manual intervention. This becomes particularly valuable during hyperparameter tuning, where Hydra's organized configuration structure accelerates experimentation.

Whether you're fine-tuning a weather forecasting model or experimenting with a novel algorithm, Hydra simplifies the process.

But it also enables rapid feature development. Instead of adjusting the function signature for every new part we implement, we can pass the config and simply add new config settings. This accelerates the development in early phases tremendously.

Additionally, as long as we track our configs, there is no mystery about which default values were used to train your very best model.

If you want to go advanced mode, you can even use hydra utilities to instantiate classes from settings (Think layers, optimisers, or different losses.) But that exceeds this blog post.

Improving Reproducibility and Collaboration

Hydra's impact goes beyond configuration management; it significantly contributes to reproducibility and collaboration.

By encapsulating configurations, you ensure that your experiments are reproducible across different environments.

This aids collaboration among team members, enabling everyone to work with a shared understanding of the project's configuration.

Furthermore, version control becomes smoother as configurations are neatly organized and separated from the codebase.

Most loggers expect a dictionary, so there you can simply use OmegaConf to create a container as such: logger.log(OmegaConf.to_container(config, resolve=True).

Hydra creates a fully merged config with all the information you provide.

Best Practices and Tips

When navigating the Hydra landscape, adhering to best practices can enhance your experience.

Begin by organizing your configuration files logically, and grouping parameters intuitively.

You want to be able to access all your training parameters by simply going to config.training.bla and your logging parameters with config.logging.bla.

As your project evolves, resist the urge to overcomplicate configurations; prioritize clarity over complexity and delete entries when they’re not needed or you can simplify.

Resolve your config early, so you don't accidentally save environment variables as their key rather than their value.

Conclusion

In the spirit of continuous learning and community-driven growth, embracing Hydra as your configuration management tool of choice opens doors to new possibilities.

As you embark on your machine learning journey, remember that configurations are not mere settings but powerful orchestrators of your models' behaviour.

With Hydra, you're equipped to build powerful config-driven models. As you venture deeper into the world of machine learning with Hydra, remember that every line of configuration you write is a step toward shaping the future of our world.

I first wrote about config-driven development in my newsletter in July 2022.