
Thesis: Distributed development of AI models
My bachelor's thesis focused on distributed machine learning. For this thesis I received the Dean's Award.
Overview
For my bachelor's thesis, I explored the realm of distributed machine learning, focusing on techniques that allow training AI models across multiple devices or nodes. The goal was to understand how to efficiently distribute parts of the model development process (training, hyperparameter tuning, inference) to leverage computational resources effectively.
Thesis Instructions
The development of machine learning models is predominantly executed on single machines with multiple processing units. This work explores the potential for distributing various stages of the development process—such as training, hyperparameter tuning, and inference—across multiple machines connected over the internet. Emphasis is placed on hyperparameter optimization, evaluating methods and algorithms.
The main objectives of the thesis are:
- Study the use of distributed computing in machine learning, including model training and inference, with a primary focus on hyperparameter tuning.
- Analyze existing hyperparameter optimization methods and algorithms, with consideration of their applicability in distributed environments.
- Design and implement an API for collaborative hyperparameter tuning across multiple machines.
- Demonstrate the practical use of the API by connecting at least three machines to collaboratively perform hyperparameter tuning for a machine learning model.
Thesis Project
The project involved designing and implementing a system that allows multiple machines to collaborate on hyperparameter tuning for machine learning models. The system consists of a central server that coordinates the tuning process and multiple client machines that perform the actual training and evaluation of models with different hyperparameter configurations.
The server exposes a RESTful API built using Flask, which clients can interact with to receive hyperparameter configurations. To manage the communication during the training process, I utilized WebSockets, allowing real-time updates and coordination between the server and clients.
Clients connect to the server, request hyperparameter configurations, train the model using those parameters, and send back the results. The server collects these results and uses them for further analysis and optimization.
Results
The system was tested with multiple machines, demonstrating its ability to effectively distribute the hyperparameter tuning process. Although fairly simple, the implementation showcased the potential of distributed machine learning and provided insights into the challenges and considerations involved in such setups.
The system also successfully connected even clients with unstable and unreliable internet connections.
Technologies Used
- Python
- Flask (for building the RESTful API)
- WebSockets (for real-time communication)
- Scikit-learn (for machine learning model training)