news

Cool! A heterogeneous cluster of old devices including Phone, iPad, and MacBook can run Llama 3

2024-07-16

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Machine Heart Report

Synced Editorial Department

If you have a spare device, you might want to give it a try.

This time, the hardware devices in your hands can also make great strides in the field of AI.

By combining iPhone, iPad, and Macbook, you can assemble a "heterogeneous cluster inference solution" and then run the Llama3 model smoothly.



It is worth mentioning that this heterogeneous cluster can be a Windows system, Linux, or iOS system, and support for Android will be coming soon.

The heterogeneous cluster is in operation.



According to the project author @evilsocket, this heterogeneous cluster includes iPhone 15 Pro Max, iPad Pro, MacBook Pro (M1 Max), NVIDIA GeForce 3080, and 2x NVIDIA Titan X Pascal. All codes have been uploaded to GitHub.

After seeing this, netizens said that this brother is indeed not simple.



However, some netizens are beginning to worry about energy consumption. Regardless of speed, they cannot afford the electricity bill. Moving data back and forth will cause too much loss.





Project Introduction

The realization of the above functions is inseparable from a Rust framework called Cake. Cake can complete distributed reasoning of large models (such as Llama3) and aims to combine consumer-grade hardware into heterogeneous clusters, where consumer-grade hardware uses multiple operating systems, including: iOS, Android, macOS, Linux and Windows, making AI more accessible.



Project address: https://github.com/evilsocket/cake

The main idea of ​​Cake is to shard transformer blocks across multiple devices to enable running inference on models that normally do not fit into the GPU memory of a single device. Inference on consecutive transformer blocks on the same worker thread is batched to minimize latency caused by data transfer.

Cake currently supports the following systems and devices:



Compile

After installing Rust, run the following code:

cargo build --release

If the user wants to generate iOS binding in the application, he can do the following:

make ios

use

Run the worker node:

cake-cli --model /path/to/Meta-Llama-3-8B # model path, read below on how to optimize model size for workers

--mode worker # run as worker

--name worker0 # worker name in topology file

--topology topology.yml # topology

--address 0.0.0.0:10128 # bind address

Run the master node:

cake-cli --model /path/to/Meta-Llama-3-8B

--topology topology.yml

The topology.yml determines which layers are served by which worker:

linux_server_1:

host: 'linux_server.host:10128'

description: 'NVIDIA Titan X Pascal (12GB)'

layers:

- 'model.layers.0-5'

linux_server_2:

host: 'linux_server2.host:10128'

description: 'NVIDIA GeForce 3080 (10GB)'

layers:

- 'model.layers.6-16'

iphone:

host: 'iphone.host:10128'

description: 'iPhone 15 Pro Max'

layers:

- 'model.layers.17'

ipad:

host: 'ipad.host:10128'

description: 'iPad'

layers:

- 'model.layers.18-19'

macbook:

host: 'macbook.host:10128'

description: 'M1 Max'

layers:

- 'model.layers.20-31'

Regarding memory and disk space optimization, users may want to provide only the data actually needed in the model to the worker, rather than the entire folder. In this case, cake-split-model can be used. For example, to generate a smaller version of llama3 safetensors, you can use the following code:

cake-split-model --model-path path/to/Meta-Llama-3-8B # source model to split

--topology path/to/topology.yml # topology file

--output output-folder-name

Reference link: https://x.com/tuturetom/status/1812654489972973643