news

Jia Yangqing and one of his papers won the Test of Time Award, and no one in China was selected for the top 10 best papers or the ICML 2024 Award

2024-07-23

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Machine Heart Report

Synced Editorial Department

ICML stands for International Conference on Machine Learning. It is hosted by the International Machine Learning Society (IMLS) and is a top conference in the field of computer artificial intelligence.

This year's ICML conference is the 41st and is currently being held in Vienna, Austria. At the opening ceremony that just took place, ICML, which is getting more and more popular every year, announced this year's conference data and award information.



This main conference received a total of 9,473 valid paper submissions, of which 2,610 papers were accepted, with an acceptance rate of 27.5%, including 144 oral papers and 191 spotlight papers.



The keywords of the accepted papers are: large language model, reinforcement learning, deep learning, graph neural network, machine learning, federated learning, diffusion model, Transformer, LLM, representation learning, generative model... These keywords also represent the most popular research directions in the current AI field.

In addition to these data, the conference also announced this year's Test of Time Award and Best Paper. Jia Yangqing's paper DeCAF, which he completed ten years ago while at Berkeley, won this year's Test of Time Award. Compared with last year's six papers, this year there were 10 research papers that won the Best Paper Award, including the recently popular Google DeepMind world model Genie and video model VideoPoet.

Test of Time Award

Regarding DeCAF winning the award, Jia Yangqing said in his WeChat Moments, "From today's terminology, DeCAF should be the foundation features and deep embedding in the field of vision, which also gives the field of computer vision a generalizable feature. The work of DeCAF later gave birth to the general object detection framework R-CNN, the high-performance heterogeneous computing framework Caffe, and indirectly led to the cooperation between Berkeley and NVidia to write the first generation of acceleration framework CuDNN, and the large-scale distributed training CaffeOnSpark created by Yahoo Labs. A series of works have established Berkeley's leading position in the wave of deep learning."



论文:DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

作者:Jeffrey Donahue、Yangqing Jia、Oriol Vinyals、Judy Hoffman、Ning Zhang、Eric Tzeng、Trevor Darrell

Institution: UC Berkeley & ICSI, Berkeley, CA, USA

Paper link: https://arxiv.org/pdf/1310.1531

The research team evaluated whether features extracted from the activations of deep convolutional networks trained in a fully supervised manner on a large, fixed set of object recognition tasks can be reused for new general tasks. These general tasks may be significantly different from the tasks they were originally trained on, and there may not be enough labeled or unlabeled data to routinely train or adapt deep architectures to new tasks. They studied and visualized the semantic clustering of deep convolutional features on a variety of tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. The researchers compared the effect of relying on different layers of the network to define fixed features and reported new results that significantly outperformed the state of the art on several important vision challenges. They released DeCAF, an open source implementation of deep convolutional activation features with all relevant network parameters so that vision researchers can experiment with deep representations in a range of visual concept learning paradigms.

Best Paper

Paper 1: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

作者:Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Robin Rombach

Organization: Stability AI

Paper address: https://proceedings.mlr.press/v235/esser24a.html

Machine Heart Report: Stable Diffusion 3 paper is finally released, revealing the architectural details, will it help reproduce Sora?

This paper is the paper of Stable Diffusion 3. Compared with previous versions, the quality of the graphs generated by Stable Diffusion 3 has been greatly improved, multiple topic prompts are supported, and the text writing effect is also better.



Stable Diffusion 3 model architecture.

Diffusion models create data from noise by reversing the forward path of data into noise, and have emerged as a powerful generative modeling technique for high-dimensional perceptual data such as images and videos. Rectified Flow (RF) is a more recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it has not yet been clearly established as a standard practice.

This study improves existing noise sampling techniques by biasing RF models towards perceptually relevant scales for training them. Through large-scale studies, this study shows that this approach achieves superior performance compared to existing diffusion formulations for high-resolution text-to-image synthesis.

In addition, the study proposed a new Transformer-based architecture for text-to-image generation, which uses separate weights for the two modes and enables bidirectional information flow between image and text tokens, thereby improving text understanding, human preference ratings, etc. The study demonstrated that the architecture follows a predictable scaling trend and observed that the validation loss decreases smoothly with increasing model size and training steps.



Improved Multimodal Diffusion Transformer: MMDiT block.

Paper 2: Genie: Generative Interactive Environments

作者:Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes 等

Institution: Google DeepMind, University of British Columbia

Paper address: https://arxiv.org/pdf/2402.15391.pdf

The paper defines a new paradigm for generative AI - Generative Interactive Environments - Genie. Genie is an 11 billion parameter base world model that can generate playable interactive environments from a single image prompt.

Machine Heart reported: Just now, Google released the basic world model: 11B parameters, which can generate interactive virtual worlds

Multiple components in the Genie architecture are built on top of the Vision Transformer (ViT). It is worth noting that due to the quadratic memory cost of the Transformer, videos can contain up to (10^4) tokens. Therefore, Google uses the memory-efficient ST-transformer architecture in all model components to balance model capacity and computational constraints.



Genie consists of three key components (as shown in the following figure):

1) Latent Action Model (LAM), which is used to infer the latent action between each pair of frames;

2) Video Tokenizer, which converts raw video frames into discrete tokens;

3) Dynamic model, given the latent action and tokens of past frames, is used to predict the next frame of the video.



To achieve controllable video generation, we condition the prediction of future frames on the action taken in the previous frame. However, such action labels are rarely available in videos on the Internet, and it would be expensive to obtain action annotations. Instead, we learn latent actions in a completely unsupervised manner.



Paper 3: Considerations for Differentially Private Learning with Large-Scale Public Pretraining

Authors: Florian Tramèr, Gautam Kamath, Nicholas Carlini

Institutions: ETH Zurich, University of Waterloo, Google DeepMind

Paper address: https://arxiv.org/abs/2212.06470

The performance of differentially private machine learning can be significantly improved by leveraging transfer learning capabilities from non-private models pre-trained on large public datasets. The paper questions whether the use of large web-scraped datasets should be considered differentially private.

The study argues that setting these models pre-trained on web data as "private" models may damage and erode public trust in differential privacy. In addition to the privacy considerations of using public data, the study further questions the practicality of this paradigm. The study carefully examines whether existing machine learning benchmarks are suitable for measuring the ability of pre-trained models to generalize to sensitive domains that may be difficult to represent in public web data.

Additionally, the study notes that deploying large models may result in a net loss of privacy because private data needs to be outsourced to third parties with greater computing power.

Paper 4: Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Author: Aaron Lou, Chenlin Meng, Stefano Ermon

Institution: Stanford University, Pika Labs

Paper address: https://proceedings.mlr.press/v235/lou24a.html

Although diffusion models have performed well in many generative modeling tasks, they have failed to live up to expectations in discrete data domains such as natural language. Standard diffusion models rely on well-established score matching theory, but attempts to generalize it to discrete structures have not yielded the same empirical gains.

In this work, the research team bridges this gap by proposing a novel loss called score entropy, which naturally extends score matching to discrete space, seamlessly integrates to build discrete diffusion models, and significantly improves the performance.

In the experiment, they tested the score entropy discrete diffusion model (SEDD) on the standard language modeling task. At a comparable model scale, SEDD outperforms the existing language diffusion paradigm (25-75% reduction in perplexity) and is competitive with autoregressive models, especially surpassing GPT-2 in performance. In addition, compared to autoregressive models, SEDD is able to generate realistic text without the need for distribution annealing techniques such as temperature scaling (the generated perplexity is about 6-8 times higher than the unannealed GPT-2), can trade off between computational effort and quality (achieving similar quality with 32 times fewer network evaluations), and supports controllable padding (matching kernel sampling quality while allowing other strategies besides left-to-right prompts).

Paper 5: Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo

作者:Stephen Zhao、Rob Brekelmans、Alireza Makhzani 、Roger Grosse

Institution: University of Toronto, Vector Institute

Paper address: https://proceedings.mlr.press/v235/zhao24c.html

Many capabilities and security techniques of Large Language Models (LLMs), including RLHF, automated red team testing, hint engineering, and padding, can be viewed as sampling from a denormalized target distribution defined by a given reward or potential function. In this work, the authors leverage the rich toolbox of Sequential Monte Carlo (SMC) to tackle these probabilistic inference problems. In particular, they use a learned warp function to estimate the expected future value at each time step, enabling computation at inference time to focus on promising parts of the sequence.

The researchers propose a novel contrastive approach to learning warping functions, drawing connections to the rich literature on soft reinforcement learning. As a complementary application of the Warped SMC framework, they propose a method to evaluate the accuracy of language model inference techniques on log partition functions using new bidirectional SMC bounds. These bounds can be used to estimate the bidirectional KL divergence between the inference distribution and the target distribution. They apply inference evaluation techniques and demonstrate that Warped SMC is effective in sampling bad outputs from pre-trained models (useful for harmless training and automated red team testing), generating comments with different sentiments, and performing fill-in tasks.

Paper 6: Debating with More Persuasive LLMs Leads to More Truthful Answers

作者:Akbir Khan、John Hughes、Dan Valentine、Laura Ruis、Kshitij Sachan、Ansh Radhakrishnan、Edward Grefenstette、Samuel Bowman、Tim Rocktäschel、Ethan Perez

Institutions: University College London, Speechmatics, MATS, Anthropic, FAR AI

Paper address: https://proceedings.mlr.press/v235/khan24a.html

Common approaches to aligning large language models (LLMs) to desired behaviors rely heavily on manually annotated data. However, as models become more complex, they will surpass human expertise, and the role of human evaluation will evolve into non-expert supervision of experts. Based on this expectation, the researchers asked: Can a weaker model evaluate the correctness of a stronger model? They set up a similar scenario to study this question: in which the stronger model (expert) has the background information needed to answer the question, while the weaker model (non-expert) lacks this information. The researchers chose debate as the test method - that is, let two LLM experts each defend different answers, and the non-expert chooses the final answer.

The research team found that debate effectively helped non-expert models and humans answer questions, achieving 76% and 88% accuracy, respectively (the original baselines were 48% and 60%, respectively).



In addition, optimizing the persuasiveness of expert debaters in an unsupervised manner improves the ability of non-experts to identify the truth in debates. This result provides a reference for the feasibility of debate alignment models in the absence of true value labels.

论文 7:Information Complexity of Stochastic Convex Optimization: Applications to Generalization, Memorization, and Tracing

作者:Idan Attias、Gintare Karolina Dziugaite、Mahdi Haghifam、Roi Livni、Daniel Roy

Institutions: Ben-Gurion University, University of Toronto, DeepMind, etc.

Paper address: https://proceedings.mlr.press/v235/attias24a.html

In this work, the authors study the interplay between memory and learning in the context of stochastic convex optimization (SCO). They define memory as the information that a learning algorithm reveals about its training data points, and quantify this information using the conditional mutual information (CMI) framework proposed by Steinke and Zakynthinou (2020).

The main result of this study is to accurately characterize the trade-off between the accuracy of a learning algorithm and its CMI, answering an open question raised by Livni (2023). This paper shows that in the L² Lipschitz-bounded setting and under strong convexity conditions, the CMI of every learner with excess error ϵ is lower bounded by Ω(1/ϵ²) and Ω(1/). The authors further design an adversary to demonstrate the indispensable role of memory in the SCO problem, which is able to accurately identify a large number of training samples in a specific SCO problem. Finally, they list several implications of the results, such as the limitations of the generalization bound based on the CMI and the incompressibility of samples in the SCO problem.

Paper 8: Measure Dataset Diversity, Don't Just Claim It

作者:Dora Zhao、Jerone Andrews、Orestis Papakyriakopoulos、Alice Xiang

Institutions: Stanford University, Sony AI (London, UK), Technical University of Munich, Sony AI (Seattle, USA)

Paper address: https://arxiv.org/html/2407.08188v1

Machine learning (ML) datasets are often considered neutral, but they inherently contain abstract and controversial social constructions. Dataset curators often use value-laden terms such as diversity, bias, and quality to describe datasets. Despite their widespread use, these terms lack clear definitions and validation. Our study explores the implications of this issue by analyzing “diversity” in 135 image and text datasets. Drawing on the social sciences, we apply principles from measurement theory to identify considerations and provide recommendations for the conceptualization, operationalization, and assessment of diversity in datasets. Their findings have broad implications for ML research, advocating for more nuanced and precise approaches when dealing with value-laden attributes in dataset construction.

Paper 9: VideoPoet: A Large Language Model for Zero-Shot Video Generation

作者:Dan Kondratyuk、Lijun Yu、Xiuye Gu、Jose Lezama、 Jonathan Huang、Grant Schindler、Rachel Hornung、Vighnesh N Birodkar、Jimmy Yan、Ming-Chang Chiu、Krishna Somandepalli、Hassan Akbari、Yair Alon、Yong Cheng、Joshua V Dillon、Agrim Gupta、Meera Hahn、Anja Hauth、David Hendon、Alonso Martinez、David Minnen、Mikhail Sirotenko、Kihyuk Sohn、Xuan Yang、Hartwig Adam、Ming-Hsuan Yang、Irfan Essa、Huisheng Wang、David Ross、Bryan Seybold、Lu Jiang

Institution: Google, Carnegie Mellon University

Paper address: https://proceedings.mlr.press/v235/kondratyuk24a.html

Project link: http://sites.research.google/videopoet/

Machine Heart Report: Can video generation be infinitely long? Google VideoPoet large model is online, netizens: revolutionary technology

The research team released VideoPoet, a language model that can synthesize high-quality videos from a variety of conditional signals. VideoPoet uses a decoder-only Transformer architecture to process multimodal inputs including images, videos, text, and audio.



The training protocol follows the pipeline of Large Language Models (LLMs) and consists of two stages: pre-training and task-specific adaptation. In the pre-training stage, VideoPoet combines a mixture of multimodal generation objectives within an autoregressive Transformer framework. The pre-trained LLM serves as a basis for adapting to a range of video generation tasks. They demonstrate the model's state-of-the-art capabilities in zero-shot video generation, in particular the ability to generate high-fidelity motion.

Paper 10: Stealing part of a production language model

作者:Nicholas Carlini、Daniel Paleka、Krishnamurthy Dvijotham、Thomas Steinke、Jonathan Hayase、A. Feder Cooper、Katherine Lee、Matthew Jagielski、Milad Nasresfahani、Arthur Conmy、Eric Wallace、David Rolnick、Florian Tramer

Institutions: OpenAI, Google DeepMind, ETH Zurich, University of Washington, McGill University

Paper address: https://arxiv.org/pdf/2403.06634

The paper proposes a new method to attack AI models. It can accurately extract information from black-box generative language models such as OpenAI's ChatGPT or Google's PaLM-2. This method can invade the embedding projection layer of Transformer (which is a key part of the model's understanding of language), and it can be "broken" by chatting with the model through a website or application through API access. Based on the method in the paper, the researchers cracked the entire projection matrix of the two basic models of the GPT series, Ada and Babbage, and key information such as hidden dimensions were also directly cracked: one is 1024 and the other is 2048. They also broke the hidden dimension of gpt-3.5-turbo, and if you want to restore the entire projection matrix of the model, the cost will not exceed $2,000. The researchers proposed a series of defense measures and mitigation strategies to prevent such attacks from happening.