news

Google DeepMind was exposed for plagiarizing open source results, and the paper was also selected for a top conference

2024-07-15

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

The fish and sheep west wind comes from Aofei Temple
Quantum Bit | Public Account QbitAI

The big model circle is exposed to plagiarism again. This time,The defendant is the famous Google DeepMind



The “plaintiff” directly blasted:They just washed our technical report

Specifically, it is this:

A paper from Google DeepMind that was accepted at the top new generation conference CoLM 2024 was suspended, and the melon owner pointed out that it plagiarized a study that was posted on arXiv a year ago. The kind that is open source.



Both papers explore a method to standardize the structure of model text generation.

What’s interesting is that the Google DeepMind paper clearly stated that it cited the “plaintiff’s” paper.



However, even though the citation was marked, the two authors of the "plaintiff" paper, Brandon T. Willard and R´emi Louf, still insisted that Google plagiarized and believed that:

Google's description of the difference between the two is "absurd."



After reading the paper, many netizens slowly asked a question: How does CoLM review manuscripts?



The only difference is the change of concept?



Take a quick look at the paper comparison...

Comparison of the two papers

Let’s take a quick look at the abstracts of the two papers and compare them.

Google DeepMind's paper says that tokenization brings trouble to the output of the constrained language model. They introduced automaton theory to solve these problems. The core is to avoid traversing all logical values ​​(logits) in each decoding step.

This method only needs to access the decoded logical value of each token. The calculation is independent of the size of the language model. It is efficient and easy to use in almost all language model architectures.

The plaintiff's statement is roughly as follows:

An efficient framework is proposed to greatly improve the efficiency of constrained text generation by building an index on the vocabulary of the language model.Avoid traversing all logical values ​​through indexing

Also "not dependent on a specific model".



The direction is indeed similar, let's continue to look at more details.

We used Google Gemini 1.5 Pro to summarize the main contents of the two papers, and then used Gemini to compare the similarities and differences between the two.

Regarding the "defendant" Google's paper, Gemini summarizes its method as follows:Redefine detokenization as a finite state transducer (FST) operation



This FST is combined with an automaton that represents the target formal language, which can be represented by regular expressions or a grammar.

Through the above combination, a token-based automaton is generated to constrain the language model during the decoding process to ensure that the output text conforms to the preset formal language specifications.

In addition, the Google paper also made a series of regular expression extensions, which were written using specially named capture groups, significantly improving the efficiency and expressiveness of the system when processing text.

As for the “plaintiff” paper, Gemini summarizes the core of its approach as follows:Reformulate the text generation problem as a transition between finite state machines (FSMs)

The specific method of "plaintiff" is:

  • The FSM is constructed using regular expressions or context-free grammars and used to guide the text generation process.
  • By building a vocabulary index, valid words in each step are efficiently determined, avoiding traversal of the entire vocabulary.



Gemini lists the commonalities between the two papers.



As for the difference between the two, it is a bit like what the netizen said above. The simple summary is: Google defines the vocabulary as an FST.



As mentioned earlier, Google listed the plaintiff’s paper as the “most relevant” work in “Related work”:

The most related work is Outlines (Willard & Louf, 2023), which also uses finite state automata (FSA) and pushdown automata (PDA) as constraints - our approach was developed independently in early 2023.

Google believes that the difference between the two is that the Outlines method is based on a special "index" operation that needs to be manually extended to new application scenarios. In contrast, Google completely redefined the entire process using automata theory, making it easier to apply FSA and generalize to PDA.

Another difference is that Google defined extensions to support wildcard matching and improve usability.



Google then mentioned Outlines in the following two related works.

One is that Yin et al. (2024) extended Outlines by adding the ability to “compress” text segments to pre-fill.

Another recent system by Ugare et al. (2024) is called SynCode, which also uses FSA, but uses LALR and LR parsers instead of PDA to process the grammar.

Similar to Outlines, this method relies on a custom algorithm.

But the onlookers are obviously not very happy:

CoLM reviewers should note that I do not think this looks like separate "concurrent work".



Netizen: This is not uncommon...

As this incident fermented, many netizens were angry. Plagiarism is shameful, not to mention that "it is not the first time that a technology giant has plagiarized the work of a small team."

By the way, Brandon and Remi were both working remotely for Normal Computing, an AI Infra company founded in 2022, when they published the plaintiff's paper.

Oh, and part of the founding team of Normal Computing came from Google Brain...



In addition, Brandon and Remi have now started a new company called .txt. According to the official website, its goal is to provide fast and reliable information extraction models. And the GitHub homepage posted on the official website is the Outlines repository.

Back to the netizens, what makes everyone even more angry is that "this situation has become common."

A postdoctoral fellow from Delft University of Technology in the Netherlands shared his experience:

We completed a work last October, and recently a paper was accepted that used the same ideas and concepts, but did not even cite our paper.



Another guy from Northeastern University in the United States was even worse. He encountered this situation twice, and the attacker was the same group. And the first author on the other side even starred his GitHub...



However, some netizens expressed different opinions:

If publishing a blog post or an unreviewed preprint paper counts as occupying a slot, then everyone would occupy a slot, right?



In response, Remi angrily responded:

Good guy, publish preprint papers and open source code = occupy the site;
Writing a math paper without even any pseudocode = good work???



Brother Brandon also expressed his approval:

Open-sourcing code and writing related papers is "taking up a job", but copying someone else's work and saying "I had this idea earlier" and submitting it to a conference is not? How disgusting.



Let’s stop here. What do you think about this? Let’s continue the discussion in the comment section~

The two papers are here:
Google DeepMind paper: https://arxiv.org/abs/2407.08103v1
Plaintiff's paper: https://arxiv.org/abs/2307.09702

Reference Links:
[1]https://x.com/remilouf/status/1812164616362832287?s=46
[2]https://x.com/karan4d/status/1812172329268699467?s=46
[3]https://x.com/brandontwillard/status/1812163165767053772?s=46