news

The strongest open source image generation model changed hands overnight! The original SD team created it and is about to release a SOTA video generation model

2024-08-02

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


Zhidongxi (public account:zhidxcom
authorvanilla
editLi Shuiqing

The most powerful open source Wenshengtu model changed hands overnight!

Zhidongxi reported on August 2 that last night, the leader of open source literary graph modelsStable DiffusionThe original team announced the launch of a new image generation modelFLUX.1

FLUX.1 includesProfessional Edition, Developer Edition, Express EditionOf the three models, the first two beat mainstream models such as the SD3-Ultra, and the smaller FLUX.1[schnell] also surpassed larger models such as the Midjourney v6.0 and DALL·E 3.


▲ Comparison of FLUX.1 ELO scores with mainstream models

FLUX.1Text generation, complex instruction followingandManual generationThe following is an example of an image generated by its most powerful professional model FLUX.1[pro]. It can be seen that even when generating large paragraphs of text and multiple characters, there are no errors in the details of characters, hands, etc.


▲Example of image generated by FLUX.1[pro]

FLUX.1 is now available on the open source platform Replicate. Here are some tips for you:The world's smallest Black Forest cake, finger-sized, surrounded by Black Forest trees", the images generated on the three models, the time taken is17.5s、12.2s、1.5s


▲Comparison of three model generation

FLUX.1 also opens its API (application programming interface), and the price is based on the number of images. The prices of the three models are as follows:$0.055, $0.03, $0.003(approximately RMB 0.4, 0.22, and 0.022).

The company behind FLUX.1 is calledBlack Forest LabsBlack Forest Labs was founded by the original team of Stable Diffusion and several former researchers of Stability AI. Similar to Stability AI, Black Forest is committed to developing high-quality multimodal models and open sourcing them.$31 million(approximately RMB 225 million) in seed round financing.

Black Forest also announced that it will be released soonSOTA (SOTA) video modelJudging from the demo it released, whether it is fluency, stability or physical simulation, it has reached the first-tier level. The company may become a dark horse in the field of video generation.


▲Video generation model preview

Three models trial address:

https://replicate.com/black-forest-labs/flux-pro

https://replicate.com/black-forest-labs/flux-dev

https://replicate.com/black-forest-labs/flux-schnell

1. Good at generating text and human hands, three models can be generated in seconds

FLUX.1 has superior performance in terms of visual quality, image detail and output diversity. It has three major features:Text generation, complex composition, and hand-drawing

Text generation is very important in image and video generation, and many models tend to confuse letters that look similar. FLUX.1 can handle tricky words with repeated letters, such as generating aBlack Forest Flux Schnell Cake


▲Black Forest Flux Schnell Cake

When it comes to composition, FLUX.1 excels at following complex instructions about where things should be in the image. For example, FLUX.1 perfectly interprets this prompt:Three magic wizards stand on a yellow table, each holding a sign. On the left, a wizard in black robes holds a sign that says "AI"; in the middle, a witch in red robes holds a sign that says "is"; on the right, a wizard in blue robes holds a sign that says "cool".


▲ Complex composition

Human hands have always been a major problem for multimodal generative models. Although the human hand images generated by FLUX.1 are not perfect, they have made great progress.


▲Manpower

FLUX.1 hasProfessional Edition, Developer Edition, Express EditionThree versions.

in,FLUX.1[pro]It is the most advanced version with top-notch instant tracking, visual quality, image detail and output versatility, providing customized enterprise solutions for professional users.


▲Example of image generated by FLUX.1[pro]

FLUX.1[dev]Aimed at non-commercial applications, it is derived from FLUX.1[pro] and has similar quality and capabilities, while being more efficient than standard models of the same size.


▲FLUX.1[dev] generated image example

FLUX.1[schnell]It is the fastest of the three models, tailored for local development and personal use, and is publicly available under the Apache 2.0 standard license.


▲FLUX.1[schnell] generated image example

FLUX.1 is now available on the open source platform Replicate. It can be run in the cloud with just one line of code. Users can also download the model weights and run them programmatically. The API of FLUX.1 is also open at the same time. The prices of the three models are as follows:$0.055, $0.03, $0.003(approximately RMB 0.4, 0.22, and 0.022).

2. DefeatMJ V6DALLE 3, technical report will be released soon

In terms of performance, FLUX.1 has been specially fine-tuned to retain the entire output diversity during pre-training, setting new standards in many aspects such as instruction compliance, visual quality, and size/length/width changes.

The FLUX.1[pro] and [dev] models outperformed popular models such as Midjourney v6.0, DALL·E 3 and SD3-Ultra in all five evaluation criteria.

As a lightweight model, FLUX.1[schnell] outperforms not only similar competitors but also powerful non-distilled models such as Midjourney v6.0 and DALL·E 3.


▲Comparison of FLUX.1 performance with mainstream models

Additionally, all FLUX.1 models support multiple aspect ratios and resolutions between 0.1 and 2.0 megapixels.


▲Aspect ratio/resolution changes

How is such powerful performance achieved?

In terms of model architecture, FLUX.1 adopts a hybrid architecture based on multimodal and parallel diffusion Transformer modules and expands it to 12B parameters.

The team improved the state-of-the-art diffusion model by building flow matching, and improved model performance and hardware efficiency by combining rotational position embedding and parallel attention layers. A more detailed technical report will be released soon.

three,SDOriginal crew,2.25100 millionSeed round, to sendSOTAVideo Model

The Black Forest Lab was established by the founding team of Stable Diffusion, whose previous work also includes the high-quality image generation model VQGAN and the video generation model Stable Video Diffusion.

Among the original five authors of Stable Diffusion,4Members who joined Stability AI and continued to develop subsequent versions of SD, including Robin Rombach, Andreas Blattmann, Dominik Lorenz and Patrick Esser, are all in the founding team of the Black Forest Lab.


▲Stable Diffusion author, Black Forest Lab founding team

The team said its core beliefs are to develop widely accessible models, foster innovation and collaboration within the research and academic communities, and increase model transparency.

Black Forest Laboratory announces completion of$31 million(about RMB 225 million)Seed round financing, led by the well-known venture capital firm a16z (Andreessen Horowitz), followed by experts and AI companies such as Brendan Iribe, CEO of VR manufacturer Oculus, Garry Tan, CEO of startup incubator YC, and Timo Aila, a researcher at Nvidia. It also received follow-up investments from first-tier funds such as General Catalyst.

The team’s advisory board includes Michael Ovitz, former Disney president with extensive experience in the content creation industry, and Professor Matthias Bethge, a pioneer in neural style transfer.

AI experts who just started their own businessAndre Kapasi(Andrej Karpathy) sent his blessings to the Black Forest team and said that "the open source FLUX.1 image generation model looks very powerful."


▲Kapasi's comments

Former leader of the founding team——Former CEO of Stability AIEmad Mostaq(Emad Mostaque) also sent a congratulatory message, saying "It has been an honor to work with them before, and I am sure they will continue to push the boundaries in their journey to generate every pixel."


▲Mostak's comments

In the next step, Black Forest will release aSOTA Vincent Video Model, "enabling everyone to turn text into video". The model will be built on FLUX.1, "enabling precise creation and editing in high definition and unprecedented speed".


▲Video generation model preview

Conclusion: A dark horse emerges in the field of multimodal large models

While many large companies and startups are madly developing text videos, a dark horse has suddenly appeared in the field of text images. The "out of nowhere" FLUX.1 not only shows excellent performance, but also breaks through difficulties in text generation, complex composition, and hand-drawing, and also meets the needs of different users with a variety of versions.

With the strong strength of the original team of Stable Diffusion, Black Forest Lab has obtained a generous seed round of financing and attracted the attention and support of many industry leaders. The video model it will release later will inject new vitality into the field of cultural video.