still struggling with ai spells? peking university-baichuan has developed an automatic prompt engineering system pas

2024-09-10

aixiv is a column where synced publishes academic and technical content. in the past few years, synced's aixiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. if you have excellent work to share, please submit or contact us for reporting. submission email: [email protected]; [email protected]

the co-first author of the paper, zheng miao, is from the baichuan alignment team led by zhou zenan. he graduated from peking university. his research interests include large language models, multimodal learning, and computer vision. he has led open source projects such as mmflow. the co-first author liang hao is a doctoral student at the institute of frontier interdisciplinary studies of peking university. his research direction is the data side of large models. his supervisor is professor zhang wentao. the peking university-baichuan intelligent ai system joint laboratory was established in january 2024. it aims to study important issues such as scientific and systematic data generation and quality assessment strategies, large model training and reasoning acceleration around the entire technical process of artificial intelligence model systems. the joint laboratory is directed by cui bin, peking university boya distinguished professor, and chen weipeng, co-founder of baichuan intelligence.

large language models based on the transformer architecture are achieving breakthrough results in various fields. prompt engineering plays a crucial role in this.

by using prompts, researchers and developers can guide models to perform better on specific tasks. this method can not only significantly improve the performance of the model, but also enhance the adaptability of the model, making it more flexible and efficient in facing various complex tasks.

in addition, prompt word engineering can optimize the model's learning process, improve the efficiency of complex problem processing, and reduce training time and computing resource requirements.

compared with traditional fine-tuning methods, cue word engineering can adapt models to multiple downstream tasks at a very low cost, greatly saving computing resources and data collection costs. however, designing effective cue words is still challenging for non-professionals and often requires a lot of learning and practice.

it is usually difficult to achieve ideal results by directly using large language models for automatic prompt engineering. inappropriate prompts may distract the model and reduce performance. therefore, it is particularly important to develop an automatic prompt engineering system that can assist users and is easy to operate.

pas: a groundbreaking automated prompt engineering system

to meet this challenge, peking university-baichuan joint laboratory proposed the pas automatic prompt engineering system. the innovation of pas lies in:

1. design a high-quality automatic suggestion dataset

2. few-sample learning and data screening for gpt models

3. automatically build a concise and efficient prompt dataset

4. fine-tune effective auto-prompt engineering

pas can provide a concise and effective supplement to user input, enabling fast, simple, and streaming-supported automatic prompt engineering.

in multiple benchmark tests, pas outperformed existing sota models and required less data. manual evaluation results also showed that pas performed well, highlighting its great potential in practical applications.

this breakthrough not only promoted the development of prompt word engineering, but also paved the way for the application of large language models in a wider range of fields.

paper address: https://arxiv.org/abs/2407.06027
PKU-Baichuan-MLSystemLab：

https://github.com/PKU-Baichuan-MLSystemLab

https://huggingface.co/PKU-Baichuan-MLSystemLab

method

training pas is mainly divided into three steps:

step 1: build a high-quality question dataset

the first task of training pas is to build a high-quality question dataset. as shown in figure (a), researchers screened out high-quality questions based on the lmsys-1m and wildchat datasets through the following three aspects:

1. data deduplication: use embedding technology combined with clustering algorithms to effectively remove duplicate data.

2. quality screening: use the baichuan big model to evaluate and screen data quality.

3. diversity guarantee: finally, 9,000 high-quality question data covering more than 10 categories were selected.

step 2: supplement engineering data

at this stage, the researchers used the 100 high-quality data accumulated internally and the problematic data screened in the first step, and built automatic prompt engineering data with the help of the gpt model through the few-shot learning method:

1. initial data generation: use few-shot learning to guide gpt to generate preliminary prompt engineering data.

2. quality control: design the critique step and use few-shot learning again to let gpt evaluate the quality of generated data.

3. iterative optimization: automatically filter out low-quality data and regenerate it, ensuring data quality through multiple rounds of iterations.

4. final result: 9,000 high-quality automatic prompt engineering data were obtained.

data distribution

the distribution of the 9,000 data points generated is shown in the figure above, which ensures the diversity and representativeness of the data.

step 3: fine-tune the auto-suggest model

the final step will use the dataset obtained in the first two stages to fine-tune a large language model:

1. select a basic model: such as qwen2-7b.

2. targeted fine-tuning: use high-quality datasets for fine-tuning.

3. specialized training: the final result is a large language model specifically used for automatic prompt engineering.

experiments and results

manual evaluation

according to the evaluation of human evaluators, pas has shown a higher win rate in all fields compared to the previous sota (state-of-the-art) model. the average win rate in many fields exceeds 50%, and the sum of the win rate and the draw rate is as high as more than 80%.

machine evaluation benchmark

to comprehensively evaluate the performance of pas, the researchers selected three benchmarks: arena-hard, alpaca-eval 2.0, and alpaca-eval 2.0 (lc).

the researchers then applied pas to six leading ai models, including:

gpt-4 (three versions)
GPT-3.5
Qwen2-72-Instruct
LLaMA3-70B-Instruct

the evaluation results show:

pas achieves significant improvements over both the no-prompt case and the previous sota automatic prompt engineering model.
compared with the previous bpo model, pas shows stronger adaptability, is compatible with various very large models, and achieves performance improvements on each model.

computational efficiency analysis

pas not only has excellent performance, but also has high computational efficiency: in terms of data efficiency, it can show excellent performance with only 9,000 fine-tuning data. in terms of output efficiency, it can limit the length of supplementary automatic suggestions, usually not more than 30 words.

in terms of user experience, pas also brings benefits to large models, specifically:

unlike previous models such as bpo, pas does not need to modify the user's original question, but only provides supplementary automatic prompts.
provides excellent user experience with controllable response time.
supports streaming display similar to gpt to further enhance the interactive experience.

example: pas helps large models avoid logical traps

“if there are 10 birds in a tree and one of them is shot dead, how many birds are left on the ground?”

this seemingly simple question actually hides a clever logic trap. when you see it, it may take you a few seconds to react before you realize that there are 9 birds left in the tree and only 1 on the ground.

as shown in the figure, without the assistance of pas, gpt gave the wrong answer. the pas system significantly improved the performance of the model by adding prompt words:

under the guidance of pas, the model's new round of answers showed significant improvement. it not only successfully avoided the logical traps in the questions, demonstrated a clear, multi-step logical reasoning process, but also guided users to understand the entire reasoning process in addition to giving the correct answers.

interested readers can read the original paper to learn more about the research content.

news

still struggling with ai spells? peking university-baichuan has developed an automatic prompt engineering system pas

introduction

my contact information