Jia Yangqing's classic work won the Time Test Award! ICML 2024's ten best papers were announced, and SD3 and Gu

Jia Yangqing's classic work won the Time Test Award! ICML 2024's ten best papers were announced, and SD3 and Google became popular.

2024-07-24

New Intelligence Report

Editor: Peach is so sleepy

【New Wisdom Introduction】The annual ICML awards have finally been announced! This year, a total of ten papers won the Best Paper Award, and three of them are household names: the image generation model SD3, the video generation model VideoPoet, and the foundational world model Genie. In addition, the Test of Time Award was given to DeCAF, a framework proposed by Jia Yangqing and his team ten years ago.

The ICML 2024 Grand Prize is out!

Just now, the opening ceremony of ICML was officially held. Ten best paper awards were announced at the meeting, and a paper written ten years ago won the Test of Time Award.

Among the best papers, there are several popular works in the field of AI image and video generation, including the SD3 technical report, CMU Google AI video model VideoPoet, and Google's basic world model Genie.

It is worth mentioning that the paper DeCAF published by AI giant Jia Yangqing and others in October 2013 won the Test of Time Award.

Just now, he posted a message saying that he felt deeply honored to receive this honor.

Russ Salakhutdinov, professor at CMU and vice president of Meta GenAI, summarized the overall admission results of ICML 2024:

This year's conference received a total of 9473 papers, of which 2610 were accepted, with an acceptance rate of 27.55%. 144 were oral and 191 were spotlight.

This year, we introduced a new Position paper category, with 286 submissions and 75 accepted (26%), including 15 Oral and 11 Spotlight.

In addition, there were 145 proposals in the Workshop, 30 of which were accepted, and 55 proposals in the Tutorial, 12 of which were accepted.

This year, it is the 41st annual conference of ICML 2024 (held once a year), which will be held in Vienna, Austria from July 21st to 27th.

8,675 people came to attend the meeting, and there were no empty seats in the audience.

ICML 2024 Top Conference Overview

Before the award ceremony, the organizing committee first introduced the overall situation of this year's conference:

9 EXPO Talk Panels

12 Tutorials

6 invited lectures

2,610 papers (main conference)

· 30 workshops

12,345 authors and speakers

39% of participants were students

10 offline social activities

3 affinity events

52 volunteers

97 Senior Area Chairs (SACs), 492 Area Chairs (ACs), 7,473 reviewers

9,406 registered attendees (8,675 of whom attended on-site)

Based on the accepted papers, ICML compiled the high-frequency words that appeared, which were also the hot words of the year:

The large model appears most frequently, more than 600 times.

Next are reinforcement learning, deep learning, graph neural networks, machine learning, federated learning, diffusion models, Transformer, LLM, representation learning, generative models, etc.

In terms of registered country/region, the United States has as many as 2,463 people, and China ranks second with more than 1,100 people.

Test of Time Award

Generally speaking, the Test of Time Award is given to academic papers that have had significant and lasting impact over a period of more than 10 years.

This paper is also a classic work completed by Jia Yangqing, the father of Caffe, when he was studying at UC Berkeley and interning at Google in collaboration with his team.

He once said in an interview that he drank too much coffee while interning at Google in 2013, so he named his company DeCAF in order to urge himself to quit coffee.

While working overtime, he wrote, "DeCAF should be the foundation features and deep embedding in the visual field, and it also gives the computer vision field a generalizable feature..."

The impact of DeCAF research is that it gave birth to the general object detection framework R-CNN, the high-performance heterogeneous computing framework Caffe, and indirectly led to the cooperation between Berkeley and NVIDIA to write the first generation of acceleration framework CuDNN, and the large-scale distributed training CaffeOnSpark created by Yahoo Labs, and a series of other works, which established Berkeley's leading position in the deep learning wave.

题目：DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

作者：Jeff Donahue，Yangqing Jia，Oriol Vinyals，Judy Hoffman，Ning Zhang，Eric Tzeng，Trevor Darrell

Institution: University of California, Berkeley

Paper address: https://arxiv.org/abs/1310.1531

In order to use a better probabilistic framework to express human behavior, the team wrote the first framework - DeCAF.

In this work, the authors evaluate whether features extracted from a deep convolutional network trained in a fully supervised manner on a large number of fixed object recognition tasks can be reused on new general tasks.

These general tasks may be significantly different from the original training tasks, and there may be a lack of sufficient labeled data, or no labeled data at all, making it impossible to use conventional methods to train or fine-tune deep networks to adapt to the new tasks.

In addition, the authors visualized the semantic clustering of deep convolutional features in tasks such as scene recognition, domain adaptation, and fine-grained recognition, and proposed new SOTA on several important vision challenges by comparing the effect of defining fixed features relying on different layers of the network.

Finally, the authors release an open source implementation of these deep convolutional activation features, DeCA, along with all relevant network parameters, to help vision authors experiment with deep representations in various visual concept learning paradigms.

Ten best papers

This year, there are a total of ten best papers.

The above rankings are in order of oral display

论文一：Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Author: Aaron Lou, Chenlin Meng, Stefano Ermon

Organization: Stanford University, Pika Labs

Paper address: https://arxiv.org/abs/2310.16834

This study proposed a new machine learning model SEDD (Score Entropy Discrete Diffusion), which is mainly aimed at discrete data generation tasks.

Currently, diffusion models have shown breakthrough performance in many generative modeling tasks, but perform poorly in discrete data fields such as natural language.

In the paper, the authors proposed the concept of score entropy to fill this gap.

This is a novel loss function that naturally extends score matching to discrete space, seamlessly integrates to build discrete diffusion models, and significantly improves the performance.

During experimental evaluation, SEDD performed better than existing language diffusion models (perplexity reduction of 25-75%).

Moreover, it surpasses autoregressive models such as GPT-2 in some aspects.

In summary, the advantages of SEDD are:

- Generates high-quality text without using techniques such as temperature scaling (generated perplexity is about 6-8 times better than unannealed GPT-2)

- Flexible trade-off between computational resources and output quality (similar performance using 32x fewer network evaluations)

- Support for controllable text padding, providing more flexibility. (matching nucleus sampling quality, and supporting other strategies besides left-to-right hinting).

Paper 2: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

作者：Patrick Esser，Sumith Kulal，Andreas Blattmann，Rahim Entezari，Jonas Müller，Harry Saini，Yam Levi，Dominik Lorenz，Axel Sauer，Frederic Boesel，Dustin Podell，Tim Dockhorn，Zion English，Kyle Lacey，Alex Goodwin，Yannik Marek，Robin Rombach

Organization: Stability AI

Paper address: https://arxiv.org/abs/2403.03206

As mentioned at the beginning, this paper is a technical report on the popular Stable Diffusion 3.

Similar to Sora, SD3 uses an improved Diffusion model and a new DiT-based Wenshengtu architecture.

Specifically, the authors used three different text encoders - two CLIP models and one T5 - to process text information, and a more advanced autoencoder model to process image information.

The newly proposed Multimodal Diffused Transformer (MMDiT) architecture uses independent sets of weights for image and language representation, which significantly improves text understanding and spelling capabilities compared to earlier versions of SD3.

Evaluation results show that SD3 reaches or exceeds the state-of-the-art in current text-based image generation technology, whether in terms of accuracy in following prompts, clear presentation of text, or visual beauty of images.

论文三：Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo

作者：Stephen Zhao，Rob Brekelmans，Alireza Makhzani，Roger Grosse

Institution: University of Toronto, Vector Institute

Paper address: https://arxiv.org/abs/2404.17546

This research focuses on the problems of sampling and inference in large models.

Many of LLM’s capabilities and security technologies, such as RLHF, automated red team testing, prompt engineering, and padding, can be viewed as:

Given a reward or potential function, sample from its defined unnormalized target distribution. This distribution is defined over the entire sequence.

In the paper, the authors proposed to use the Sequential Monte Carlo (SMC) method to solve these sampling probability problems.

In response to this, the authors proposed twist functions to estimate the potential future value of each time step, thereby optimizing the sampling process.

In addition, they proposed a method to evaluate the accuracy of LLM reasoning techniques using novel bidirectional SMC bounds.

The final results show that Warped SMC is powerful in sampling bad outputs from pre-trained models, generating reviews with different sentiments, and performing filling tasks.

Paper 4: Position: Measure Dataset Diversity, Don’t Just Claim It

作者：Dora Zhao，Jerone T.A. Andrews，Orestis Papakyriakopoulos，Alice Xiang

Institutions: Stanford University, Technical University of Munich, Sony AI

Paper address: https://arxiv.org/abs/2407.08188

Currently, many datasets label themselves as diverse but actually contain abstract and controversial social concepts.

In this work, the authors explore this question by analyzing "diversity" in 135 image and text datasets.

As shown in the figure below, the authors draw on measurement theory from social science theory as factors to consider and provide suggestions for conceptualizing, operationalizing, and evaluating diversity in data sets.

The ultimate goal of this study is to call on AI scholars to adopt more detailed and precise processing methods for attribute data with value judgments in machine learning research, especially in the process of data set construction.

Paper 5: Stealing Part of a Production Language Model

作者：Nicholas Carlini，Daniel Paleka，Krishnamurthy Dj Dvijotham，Thomas Steinke，Jonathan Hayase，A. Feder Cooper，Katherine Lee，Matthew Jagielski，Milad Nasr，Arthur Conmy，Itay Yona，Eric Wallace，David Rolnick，Florian Tramèr

Institutions: ETH Zurich, University of Washington, McGill University, Google DeepMind, OpenAI

Paper address: https://arxiv.org/abs/2403.06634

In this work, the authors present the first model stealing attack capable of extracting precise and complex information from black-box language models such as OpenAI's ChatGPT or Google's PaLM-2.

Specifically, this attack is able to reconstruct the embedding projection layer of the Transformer model (under symmetry conditions) through regular API access.

And, for less than $20, it is possible to extract the entire projection matrices of OpenAI’s Ada and Babbage language models, confirming for the first time that these two black-box models have 1024 and 2048 hidden dimensions, respectively.

In addition, the author also restored the exact hidden dimension size of the gpt-3.5-turbo model. This time, the cost of extracting the entire projection matrix is only $2,000.

Finally, the authors propose potential defenses and mitigations and discuss implications for future work.

论文六：Information Complexity of Stochastic Convex Optimization: Applications to Generalization and Memorization

作者：Idan Attias，Gintare Karolina Dziugaite，Mahdi Haghifam，Roi Livni，Daniel M. Roy

Institutions: Ben-Gurion University, Northeastern University, Tel Aviv University, University of Toronto, Vector Institute, Google DeepMind

Paper address: https://arxiv.org/abs/2402.09327

In this work, the authors study the interplay between memoization and learning in the context of stochastic convex optimization problems (SCO).

First, we define memoization by revealing information about training data points by a learning algorithm. Then, we quantify it using the Conditional Mutual Information (CMI) framework. This allows us to precisely describe the tradeoff between the accuracy of a learning algorithm and its CMI.

The results show that under the L^2 Lipschitz bounded setting and strong convexity conditions, the CMI of each learner with excess error ε is lower bounded at Ω(1/ε^2) and Ω(1/ε), respectively.

Furthermore, we demonstrate the important role of memoization in the SCO learning problem by designing an adversary that can accurately identify most of the training samples in a specific SCO problem.

Finally, the authors list several important implications, such as the limitation of generalization bounds based on CMI and the sample incompressibility of the SCO problem.

论文七：Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining

Authors: Florian Tramèr, Gautam Kamath, Nicholas Carlini

Institutions: ETH Zurich, University of Waterloo, Vector Institute, Google DeepMind

Paper address: https://arxiv.org/abs/2212.06470

The performance of differentially private machine learning can be significantly improved by leveraging the transfer learning capabilities of non-private models pre-trained on large public datasets.

In this work, the authors question whether the use of large web-scraped datasets is consistent with differential privacy protection. They warn that calling these models pre-trained on web data "private" could cause many harms, such as undermining public trust in the concept of differential privacy.

In addition to the privacy concerns of using public data, the authors further question the practicality of this approach.

The impact of pre-training is particularly pronounced for models that are too large for end users to run on their own devices, as this would require outsourcing private data to a third party with greater computational power, so deploying such models would result in a net loss in privacy.

Finally, the authors discuss potential development paths in the field of private learning as public pre-training becomes more popular and powerful.

Paper 8: Debating with More Persuasive LLMs Leads to More Truthful Answers

作者：Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktäschel, Ethan Perez

Institutions: University College London, Speechmatics, MATS, Anthropic, FAR AI

Paper address: https://arxiv.org/abs/2402.06782

The currently commonly used LLM alignment method relies heavily on manually labeled data.

However, as models become more complex, they will surpass human expertise and the role of human evaluators will evolve into non-expert supervisory experts.

Based on this, the author raised a question: Can a weaker model evaluate the correctness of a stronger model?

By definition, the stronger model (expert) has the necessary information to answer the question, while the weaker model (non-expert) lacks this information.

The evaluation method is a debate, that is, two LLM experts each defend different answers, rather than the experts choosing the answer.

The results show that debate consistently helps non-expert models and humans answer questions better, achieving 76% and 88% accuracy, respectively (baselines were 48% and 60%, respectively).

Furthermore, optimizing the persuasiveness of expert debaters in an unsupervised manner improves the ability of non-experts to identify the truth in debates.

Paper 9: Genie: Generative Interactive Environments

作者：Jake Bruce，Michael Dennis，Ashley Edwards，Jack Parker-Holder，Yuge Shi，Edward Hughes，Matthew Lai，Aditi Mavalankar，Richie Steigerwald，Chris Apps，Yusuf Aytar，Sarah Bechtle，Feryal Behbahani，Stephanie Chan，Nicolas Heess，Lucy Gonzalez，Simon Osindero，Sherjil Ozair，Scott Reed，Jingwei Zhang，Konrad Zolna，Jeff Clune，Nando de Freitas，Satinder Singh，Tim Rocktäschel

Organization: Columbia University, Google DeepMind

Paper address: https://arxiv.org/pdf/2402.15391

The basic world model released by Google DeepMind team - Genie "Elf".

From a single image, a photo, a sketch, it can generate an endless world.

The crazy thing about Genie is that it learns from 200,000 hours of unlabeled internet videos and can be trained without supervision.

Without any action annotation, it is possible to identify who the protagonist is and allow the user to control them in the generated world.

Specifically, it is achieved through three core components: latent action model, video segmenter, and autoregressive dynamic model.

The resulting learned latent action space not only enables user interaction but also helps train intelligent agents to imitate behaviors in unseen videos.

In summary, Genie opens up a new avenue for cultivating future generalist agents and reshapes the landscape of interactive generative environments.

Paper 10: VideoPoet: A Large Language Model for Zero-Shot Video Generation

作者：Dan Kondratyuk，Lijun Yu，Xiuye Gu，José Lezama，Jonathan Huang，Grant Schindler，Rachel Hornung，Vighnesh Birodkar，Jimmy Yan，Ming-Chang Chiu，Krishna Somandepalli，Hassan Akbari，Yair Alon，Yong Cheng，Josh Dillon，Agrim Gupta，Meera Hahn，Anja Hauth，David Hendon，Alonso Martinez，David Minnen，Mikhail Sirotenko，Kihyuk Sohn，Xuan Yang，Hartwig Adam，Ming-Hsuan Yang，Irfan Essa，Huisheng Wang，David A. Ross，Bryan Seybold，Lu Jiang

Institution: Carnegie Mellon University, Google

Paper address: https://arxiv.org/pdf/2312.14125

Before the release of Sora, the Google and CMU team launched a video generation technology similar to Sora in December 2014 - VideoPoet.

VideoPoet can generate 10 seconds of ultra-long, continuous, large-action videos at a time, and it can generate videos without the need for specific data.

Specifically, VideoPoet mainly includes the following components:

- Pre-trained MAGVIT V2 video tokenizer and SoundStream audio tokenizer, which can convert images, videos and audio clips of different lengths into discrete code sequences in a unified vocabulary. These codes are compatible with text-based language models, making them easy to combine with other modalities such as text.

- Autoregressive language models can perform cross-modal learning between video, image, audio, and text, and predict the next video or audio token in a sequence in an autoregressive manner.

- Introduced multiple multimodal generative learning objectives in the large language model training framework, including text to video, text to image, image to video, video frame continuation, video restoration/expansion, video stylization, and video to audio. In addition, these tasks can be combined with each other to achieve additional zero-shot capabilities (e.g., text to audio).

Unlike leading models, VideoPoet is not based on a diffusion model, but a large multimodal model, which enables it to have capabilities such as T2V and V2A.

In short, VideoPoet has three major advantages: generating longer videos, achieving more precise control, and powerful camera techniques.

Best Reviewer Award

Best of all, the Best Reviewer Award was also announced at the ICML 2024 conference.

References:

https://x.com/icmlconf/status/1815646373791842545

https://x.com/icmlconf/status/1815646856241672211

news

Jia Yangqing's classic work won the Time Test Award! ICML 2024's ten best papers were announced, and SD3 and Google became popular.

Introduction

my contact information