ACL 2024 Grand Prize Announced! An All-Chinese Team Used AI to Crack the Oracle Code 3,000 Years Ago

ACL 2024 Grand Prize Winners Announced! An All-Chinese Team Used AI to Crack the Oracle Code 3,000 Years Ago

2024-08-15

New Intelligence Report

Editor: Editorial Department

【New Wisdom Introduction】The annual NLP top conference ACL announced the final winning papers. This year, a total of 7 papers won the best paper award, and the Time Test Award was awarded to Stanford GloVe and Cornell University Similarity Metric. In addition, there are also Best Theme Award, Best Social Impact Award, Best Resource Award, Field Chair Award, and Outstanding Paper Award.

The ACL 2024 awards ceremony has finally taken place!

A total of 7 best papers, 35 outstanding papers, as well as the Test of Time Award, SAC Award, Best Theme Paper, and Best Resource Paper Award were announced.

It is worth mentioning that among the 7 best papers, Deciphering Oracle Bone Language with Diffusion Models was completed by an all-Chinese team.

This year is the 26th Annual Conference on Computational Linguistics (ACL), which will be held in Bangkok, Thailand from August 11 to 16.

The total number of paper submissions for ACL 2024 is almost the same as in 2023, approximately 5,000, of which 940 papers were accepted.

This year's ACL is the largest ever, with 72 SACs, 716 ACs, and 4,208 reviewers.

975 finding papers, 6 JCLs, 31 TACLs, 3 keynote speeches, and 1 panel.

The entire conference also included 18 workshops, 6 tutorials, 38 demos, and 60 SRW papers.

The specific situation of the paper submitted by the authors is as follows:

Most people submitted 1/2 papers: 10,333 scholars submitted 1 paper, and 2,130 submitted 2 papers

A small number of people submitted multiple papers: 3 authors submitted 18 papers, 6 submitted 19 papers, and 18 submitted more than 20 papers.

Let’s take a look at which teams won the grand prizes this year.

7 best papers

Paper 1: Deciphering Oracle Bone Language with Diffusion Models

作者：Haisu Guan, Huanxin Yang, Xinyu Wang, Shengwei Han, Yongge Liu, Lianwen Jin, Xiang Bai, Yuliang Liu

Institutions: Huazhong University of Science and Technology, University of Adelaide, Anyang Normal University, South China University of Technology

Paper address: https://arxiv.org/pdf/2406.00684

As the title suggests, the Chinese team used AI to do something very interesting and valuable - deciphering oracle bone inscriptions (OBS) with the help of diffusion models.

Oracle bone script originated in China's Shang Dynasty about 3,000 years ago and is a cornerstone in the history of language.

Although thousands of inscriptions have been discovered, large amounts of oracle bone script remain undeciphered, casting a veil of mystery over this ancient language.

In the paper, the authors introduced a new method of using image-generated AI, specifically the development of "Oracle Bone Script Decipher" (OBSD).

Using a strategy based on conditional diffusion, OBSD generated important deciphering clues, opening up a new path for AI-assisted analysis of ancient languages.

To verify its effectiveness, the researchers conducted a large number of experiments on the oracle dataset, and the quantitative results proved the effectiveness of OBSD.

论文2：Natural Language Satisfiability: Exploring the Problem Distribution and Evaluating Transformer-based Language Models

(Preprint not yet submitted)

Paper 3: Causal Estimation of Memorisation Profiles

作者：Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, Tiago Pimentel

Institutions: University of Cambridge, ETH Zurich

Paper address: https://arxiv.org/pdf/2406.04327

Understanding LLM memory has important practical and social implications, such as studying model training dynamics or preventing copyright infringement.

Previous research has defined memory as a causal response generated by training on an instance and the ability of a model to predict that instance.

This definition relies on a counterfactual: being able to observe what would have happened if the model had not seen that instance.

However, existing methods usually estimate memory for model architectures rather than specific model instances, making it difficult to provide computationally efficient and accurate counterfactual estimates.

This study fills an important gap, and the authors propose a principled and efficient new method to estimate memoization based on difference-in-difference designs in econometrics.

Using this approach, we can describe the model's memory profile, that is, the model's memory trend throughout the training process, by simply observing the behavior of a small number of examples during the entire training process.

In experiments with the Pythia model suite, the researchers found that:

(1) Large models have stronger and more persistent memory;

(2) Determined by the data order and learning rate;

(3) There is a stable trend across models of different sizes, so the memory of large models is predictable from the memory of small models.

Paper 4: Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model

作者：Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D'souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, Sara Hooker

Institutions: Cohere For AI, Brown University, Cohere, Cohere For AI Community, Carnegie Mellon University, Massachusetts Institute of Technology

Paper address: https://arxiv.org/pdf/2402.07827

In February this year, startup Cohere released a new open source large-scale language generation model called Aya, covering more than 101 languages.

It is worth mentioning that the language model coverage of the Aya model is more than twice that of existing open source models, surpassing mT0 and BLOOMZ.

The human evaluation score reached 75%, and the score in various simulated win rate tests was 80-90%.

This project was launched, bringing together the efforts of more than 3,000 independent researchers from 119 countries.

In addition, the researchers also released the largest multilingual guided fine-tuning dataset to date, containing 513 million data points covering 114 languages.

Paper 5: Mission: Impossible Language Models

作者：Julie Kallini, Isabel Papadimitriou, Richard Futrell, Kyle Mahowald, Christopher Potts

Institutions: Stanford University, University of California, Irvine, University of Texas at Austin

Paper address: https://arxiv.org/pdf/2401.06416

Chomsky and others have stated that LLM is equally capable of learning languages that are both possible and impossible for humans to learn.

However, there is little published experimental evidence to support this claim.

To this end, the researchers developed a set of synthetic "impossible languages" of varying complexity, each of which was designed by systematically changing English data and using unnatural word order and grammatical rules.

These languages lie on a continuum of impossible languages: at one end are completely impossible languages, such as randomly rearranged English; at the other end are languages that are considered linguistically impossible, such as those based on word position counting rules.

After a series of evaluations, GPT-2 struggled to learn impossible languages, which challenged the core idea.

More importantly, the researchers hope that this approach will lead to more research on the LLM's ability to learn different types of language, in order to better understand the LLM's potential applications in cognitive and linguistic typology research.

Paper 6: Semisupervised Neural Proto-Language Reconstruction

Author: Liang Lu, Peirong Xie, David R. Mortensen

Institution: Carnegie Mellon University, University of Southern California

Paper address: https://arxiv.org/pdf/2406.05930

Existing native language comparison reconstruction work usually requires full supervision.

However, history reconstruction models are only of practical value when trained with limited labeled data.

To this end, the researchers proposed a semi-supervised history reconstruction task.

In this task, the model only needs to be trained on a small amount of labeled data (homologous sets with prototypes) and a large amount of unlabeled data (homologous sets without prototypes).

The authors developed a neural architecture for comparative reconstruction, DPD-BiReconstructor, which incorporates an important insight from linguists’ comparative methods: a reconstructed word can not only be reconstructed from its subwords, but can also be deterministically converted back into its subwords.

We show that this architecture is able to leverage an unlabeled set of cognates and outperforms existing semi-supervised learning baselines on this new task.

Paper 7: Why are Sensitive Functions Hard for Transformers?

Author: Michael Hahn, Mark Rofin

Institution: Saarland University

Paper address: https://arxiv.org/pdf/2402.09963

Empirical studies have revealed a range of learnability biases and limitations of the Transformer model, such as its persistent difficulty in learning computationally simple formal languages such as PARITY, and its tendency to favor low-order functions.

However, theoretical understanding remains limited, and existing theories of expressive ability either over-predict or underestimate actual learning ability.

The researchers demonstrated that under the Transformer architecture, the loss landscape is constrained by the input space sensitivity:

Transformer models whose outputs are sensitive to multiple parts of the input string occupy isolated points in parameter space, resulting in low-sensitivity biases in generalization.

We demonstrate, both theoretically and empirically, that the new theory unifies long-standing empirical observations about Transformer learning capabilities and biases, such as their sensitivity to ground and preference for low-order functions in training, and their difficulty generalizing to length on parity problems.

This suggests that understanding the inductive bias of the transformer requires studying not only its principled expressive power but also its loss landscape.

2 Test of Time Awards

Paper 1: GloVe: Global Vectors for Word Representation (2014)

Author: Jeffrey Pennington, Richard Socher, Christopher Manning

Institution: Stanford University

Paper address: https://nlp.stanford.edu/pubs/glove.pdf

Word embeddings were the cornerstone of deep learning methods for NLP between 2013 and 2018 and continue to have a significant impact. They not only improve the performance of NLP tasks, but also have a significant impact on computing semantics, such as word similarity and analogy.

The two most influential word embedding methods are probably skip-gram/CBOW and GloVe. Compared with skip-gram, GloVe was proposed later, and its relative advantage lies in its conceptual simplicity - directly optimizing the similarity of words in the vector space based on their distribution characteristics, rather than indirectly optimizing it as a set of parameters from the perspective of simplifying language modeling.

Paper 2: Measures of Distibutional Similarity (1999)

By Lillian Lee

Institution: Cornell University

Paper address: https://aclanthology.org/P99-1004.pdf

The purpose of studying distribution similarity measures is to improve the probability estimation of unseen co-occurrence events, which is equivalent to another way to characterize the similarity between words.

The contributions of the paper are threefold: an extensive empirical comparison of various metrics; a classification based on the information contained in similarity functions; and the introduction of a new function that performs well in evaluating the distribution of potential agents.

1 best themed paper

Thesis: OLMo: Accelerating the Science of Language Models

作者：Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, Hannaneh Hajishirzi

Institutions: Allen Institute for Artificial Intelligence, University of Washington, Yale University, New York University, Carnegie Mellon University

Paper address: https://arxiv.org/abs/2402.00838

This work is a major step forward in improving transparency and reproducibility in training large language models, which is something the community desperately needs in order to make progress (or at least to enable contributors other than industry giants to contribute to progress).

3 Best Social Impact Awards

论文1：How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs

作者：Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, Weiyan Shi

Institutions: Virginia Tech, Renmin University of China, University of California, Davis, Stanford University

Paper address: https://arxiv.org/abs/2401.06373

This paper explores the topic of bypassing restrictions in AI safety. It studies an approach developed in the field of social science research. The research is highly interesting and has the potential to have a significant impact on the community.

论文2：DIALECTBENCH: An NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

作者：Fahim Faisal, Orevaoghene Ahia, Aarohi Srivastava, Kabir Ahuja, David Chiang, Yulia Tsvetkov, Antonios Anastasopoulos

Institutions: George Mason University, University of Washington, University of Notre Dame, RC Athena

Paper address: https://arxiv.org/abs/2403.11009

Dialect variation is an understudied phenomenon in natural language processing and artificial intelligence. However, its study is of great value, not only from a linguistic and social perspective, but also with important implications for applications. This paper proposes an innovative benchmark for studying this problem in the era of large language models.

Paper 3: Having Beer after Prayer? Measuring Cultural Bias in Large Language Models

Author: Tarek Naous, Michael J. Ryan, Alan Ritter, Wei Xu

Institution: Georgia Institute of Technology

Paper address: https://arxiv.org/abs/2305.14456

This paper sheds light on an important issue in the era of large language models: cultural bias. Although the context of the study is Arabic culture and language, the results show that we need to consider cultural nuances when designing large language models. Therefore, similar studies can be conducted on other cultures to generalize and assess whether other cultures are also affected by this issue.

3 Best Resource Papers

Paper 1: Latxa: An Open Language Model and Evaluation Suite for Basque

作者：Julen Etxaniz, Oscar Sainz, Naiara Perez, Itziar Aldabe, German Rigau, Eneko Agirre, Aitor Ormazabal, Mikel Artetxe, Aitor Soroa

Institution: University of the Basque Country

Paper address: https://arxiv.org/abs/2403.20266

The paper describes all the details of the corpus collection and evaluation dataset in detail. Although they studied the Basque language, this approach can be extended to build large language models for low-resource languages.

论文2：Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

作者：Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo

Institutions: Allen Institute for Artificial Intelligence, UC Berkeley, Carnegie Mellon University, Spiffy AI, MIT, University of Washington

Paper address: https://arxiv.org/abs/2402.00159

This paper illustrates the importance of data curation when preparing datasets for large language models. It provides valuable insights that can benefit a wide audience within the community.

论文3：AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

作者：Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, Niranjan Balasubramanian

Institutions: State University of New York at Stony Brook, Allen Institute for Artificial Intelligence, Saarland University

Paper address: https://arxiv.org/abs/2407.18901

This is a very impressive and important attempt to build a simulator and evaluation environment for human-computer interaction. This will encourage the production of challenging dynamic benchmarks for the community.

21 Field Chair Awards

35 outstanding papers

(This picture is incomplete)

References:

https://x.com/aclmeeting/status/1823664612677705762

news

ACL 2024 Grand Prize Winners Announced! An All-Chinese Team Used AI to Crack the Oracle Code 3,000 Years Ago

Introduction

My contact information