news

All members left their old company, Stable Diffusion started a new game, and defeated MJ v6

2024-08-02

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Machine Heart Report

Editors: Du Wei, Jia Qi

The field of AI image and video generation has a new and powerful player.

Remember Robin Rombach, a research scientist who left the AI ​​startup Stability AI at the end of March this year? As one of the two main authors who developed the literary graph model Stable Diffusion, he joined Stability AI in 2022.



Now, nearly five months after leaving Stability AI, Robin Rombach tweeted the good news of starting his own business!

He founded Black Forest Labs, which aims to advance SOTA high-quality generative deep learning models for images and videos and make them available to as many people as possible.



The team members are composed of outstanding AI researchers and engineers. Their previous representative work includes VQGAN and Latent Diffusion, Stable Diffusion models in the field of image and video generation (including Stable Diffusion XL, Stable Video Diffusion and Rectified Flow Transformers), and Adversarial Diffusion Distillation for ultra-fast real-time image synthesis.

It is worth noting that in addition to Robin Rombach, Stable Diffusion has three other authors who became members of the founding team, including Andreas Blattmann, Dominik Lorenz and Patrick Esser. They all left Stability AI earlier this year, and some people speculated that they left to start their own businesses.



Currently, the Labs has completed a $31 million seed round of financing led by Andreessen Horowitz. Other investors include angel investors Brendan Iribe, Michael Ovitz, Garry Tan, Timo Aila, Vladlen Koltun, and some well-known AI research and entrepreneurship experts. In addition, it has received follow-up investments from General Catalyst and MätchVC.

The Labs has also established an advisory board, whose members include Michael Ovitz, a tech mogul with extensive experience in the content creation industry, and Professor Matthias Bethge, a pioneer in neural style transfer and a top expert in open AI research in Europe.

Of course, Black Forest Labs launched its first model series "FLUX.1", which includes the following three variant models.



The first variant isFLUX.1 [pro], which is a new SOTA text graph model with extremely rich image details, strong prompt following ability and diverse styles. It is currently available through the API.

API address: https://docs.bfl.ml/



The second one isFLUX.1 [dev], which is an open-weight, non-commercial variant of FLUX.1 [pro] and is directly distilled from the latter. This model outperforms other image models such as Midjourney and Stable Diffusion 3. The inference code and weights are available on GitHub. The figure below shows a comparison with competing image models.

GitHub address: https://github.com/black-forest-labs/flux



The third one is open sourceFLUX.1 [schnell], which is a super efficient 4-step model that follows the Apache 2.0 protocol. The model is very close to [dev] and [pro] in performance and can be used on Hugging Face.

Hugging Face 地址:https://huggingface.co/black-forest-labs/FLUX.1-schnell





Meanwhile, Black Forest Labs has also begun promoting itself.



The next goal is to launch a SOTA Vincent video model that is available to everyone, so you can look forward to it!



A powerful hit at the first try: the Vincent Figure Model Series "FLUX.1" is coming

The three models launched by Black Forest Labs all use a hybrid architecture of multimodal and parallel diffusion Transformer. Unlike other companies that divide a series of models into "medium cup", "large cup" and "super large cup" according to the number of parameters, the members of the FLUX.1 family are uniformly expanded to a huge scale of 12 billion parameters.



The research team used the Flow Matching framework to upgrade the previous SOTA diffusion model. From the comments in the official blog, it can be inferred that the research team continued to use the Rectified flow+Transformer method proposed when they were still working at Stability AI (in March this year).



Paper link: https://arxiv.org/pdf/2403.03206.pdf

They also introduced rotation position embedding and parallel attention layers. These methods effectively improved the performance of the model in generating images, and the speed of generating images on hardware devices also became faster.

This time Black Forest Labs did not disclose the detailed technology of the model, but a more detailed technical report will be released soon.

These three models set new standards in their respective fields. Whether it is the beauty of the generated images, the degree of correspondence between the images and the text prompts, the variability of the size/aspect ratio, or the diversity of the output formats, FLUX.1 [pro] and FLUX.1 [dev] have surpassed a series of popular image generation models, such as Midjourney v6.0, DALL・E 3 (HD) and the old owner SD3-Ultra.

FLUX.1 [schnell] is the most advanced few-step model to date, outperforming not only similar competitors but also strong non-distilled models like Midjourney v6.0 and DALL・E 3 (HD).

The models are fine-tuned specifically to preserve the full diversity of outputs from the pre-training phase, and the FLUX.1 family of models offers plenty of headroom for improvement compared to the current state-of-the-art.



All models of the FLUX.1 series support a wide range of aspect ratios and resolutions, from 0.1 to 2 megapixels.



Some netizens who were quick to act have already experienced it first. It seems that the "strongest" repeatedly emphasized by Black Forest Labs is not just self-praise.

This effect can be created with simple prompts, and if you look closely at the pattern on the alpaca's mat, you can see that it does not appear twisted or deformed.



Prompt: An emerald Emu riding on top of a white llama.

If you don't tell me that this is an AI-generated picture, it would be difficult to tell whether this is a photo taken by a photographer.



Prompt word: A horse is playing with two aligators at the river.

Images containing text can also be easily grasped, and the depth of field is also handled to match the realistic sense of the lens.



Among the three models, the FLUX.1 [schnell], which has slightly weaker performance, is also fast and powerful to use. Some netizens shared their experience of running it on a Mac, and they couldn't help but sigh that it was really instant.



Netizens who are not familiar with the "love and hate" between the authors of Stable Diffusion and Stability AI lamented: I don't know where this Wenshengtu model came from, it's so powerful that it's terrifying.



For the story about the author of Stable Diffusion and its former company Stability AI, you can read a previous report by Machine Heart: When it was worth 100 million US dollars, the team behind Stable Diffusion began to tear each other apart. Who is the real official?

In addition to the three most powerful image generation models, Black Forest Labs has another big trick up its sleeve. With such a powerful image generation model capability, Black Forest Labs has laid a solid foundation for video generation models. As they have announced, these top computer vision scientists are moving towards the goal of providing the most advanced image generation video technology to everyone.

Company blog: https://blackforestlabs.ai/announcements/