Zero-sample spatiotemporal prediction! HKU, South China University of Technology, etc. release spatiotemporal big model UrbanGPT

Zero-sample spatiotemporal prediction! HKU, South China University of Technology, etc. release spatiotemporal big model UrbanGPT | KDD 2024

2024-07-31

New Intelligence Report

Edited by: LRST

【New Wisdom Introduction】UrbanGPT is an innovative spatiotemporal large-scale language model that combines spatiotemporal dependency encoders and instruction fine-tuning techniques to demonstrate excellent generalization and prediction accuracy in a variety of urban tasks. This technology breaks through the traditional model's reliance on large amounts of labeled data and can provide accurate predictions even when data is scarce, providing strong support for urban management and planning.

Spatiotemporal prediction technology is dedicated to in-depth analysis and prediction of dynamic urban environments. It not only focuses on changes in time, but also considers spatial layout. The goal of this technology is to reveal future trends and patterns in various aspects of urban life, such as transportation, population migration, and crime rates. Although many studies have focused on using neural networks to improve the accuracy of spatiotemporal data prediction, these methods usually require a large amount of training data to generate reliable spatiotemporal features.

However, in actual urban monitoring scenarios, data is often insufficient, especially in some cases, collecting labeled data becomes very difficult, which further exacerbates the challenge. Therefore, it is particularly critical to develop a model that can adapt to different spatiotemporal contexts and has strong generalization capabilities.

Inspired by the remarkable progress of large language models (LLMs) in multiple fields, researchers from the University of Hong Kong, South China University of Technology and other institutions released a new spatiotemporal large language model UrbanGPGT, which combines spatiotemporal dependent encoders and instruction fine-tuning techniques. Its goal is to develop a spatiotemporal large language model that can be widely applied to urban tasks.

Project link: https://urban-gpt.github.io/

Code link: https://github.com/HKUDS/UrbanGPT

Paper link: https://arxiv.org/abs/2403.00813

Video display: https://www.bilibili.com/video/BV18K421v7ut

This combination enables the model to deeply understand complex relationships in time and space and provide more comprehensive and precise predictions when data is limited.

To test the effectiveness of this approach, we conducted extensive experiments on multiple public datasets covering a variety of spatiotemporal prediction tasks. The experimental results consistently show that UrbanGPT consistently outperforms existing state-of-the-art models. These results demonstrate the great potential of using large language models for spatiotemporal learning when data is less labeled.

Overview

Current Challenges

C1. Scarcity of labeled data and high cost of retraining:Although existing spatiotemporal neural networks have performed well in terms of prediction accuracy, they are highly dependent on large amounts of labeled data.

In actual urban monitoring environments, data scarcity is a significant obstacle. For example, it is unrealistic to deploy sensors throughout a city to monitor traffic flow or air quality due to cost issues. In addition, existing models often lack sufficient generalization capabilities when faced with new regional or urban prediction tasks and need to be retrained to generate effective spatiotemporal features.

C2. Large language models and existing spatiotemporal models have insufficient generalization capabilities in zero-shot scenarios:As shown in Figure 2, the large language model LLaMA is able to infer traffic patterns from input text. However, it sometimes makes prediction errors when processing numeric time series data with complex spatiotemporal dependencies.

At the same time, while pre-trained baseline models perform well in encoding spatiotemporal dependencies, they may perform poorly in zero-shot scenarios due to over-fitting to the source dataset.

C3. Extending the reasoning capabilities of large language models to the domain of spatiotemporal prediction:There is a significant gap between the unique properties of spatiotemporal data and the knowledge encoded in large language models. How to narrow this gap and build a spatiotemporal large language model with excellent generalization ability on a wide range of urban tasks is an important problem that needs to be solved.

Figure 1: Compared with LLM and spatiotemporal graph neural network, UrbanGPT has better prediction performance in zero-shot scenarios

Current Challenges

(1) To the best of our knowledge, this is the first attempt to create a spatiotemporal large-scale language model capable of predicting various urban phenomena across multiple datasets, especially when training data is limited.

(2) This paper introduces a spatiotemporal prediction framework called UrbanGPT, which allows large language models to deeply understand the complex connections between time and space. By tightly combining the spatiotemporal dependency encoder with instruction fine-tuning technology, spatiotemporal information is effectively incorporated into the language model.

(3) Extensive experiments on real-world datasets validate UrbanGPT’s superior generalization capabilities in zero-shot spatiotemporal learning environments. These results not only demonstrate the model’s efficiency in predicting and understanding spatiotemporal patterns, but also prove its ability to provide accurate predictions even in the absence of samples.

method

Figure 2: UrbanGPT overall framework

Spatiotemporal Dependency Encoder

Although large language models have made remarkable achievements in processing linguistic texts, they still face challenges in parsing temporal variations and dynamic patterns in spatiotemporal data.

To address this problem, this study proposes an innovative approach, namely integrating a spatiotemporal dependency encoder to enhance the ability of large language models in capturing time series dependencies in spatiotemporal contexts.

Specifically, the spatiotemporal encoder we designed consists of two core components: one is the gated diffusive convolutional layer, and the other is the multi-level correlation injection layer.

The above formula is to initialize the spatiotemporal embedding, which is obtained from the original spatiotemporal data. Er' is a slice of Er, which is used to perform residual operation to alleviate the gradient disappearance.

We use one-dimensional dilated convolution to encode temporal correlations.

The Sigmoid activation function δ is used to control the degree of information retention of multi-layer convolution operations.

After processing by gated time-expanded convolutional layers, we can accurately capture the temporal series dependencies within multiple consecutive time steps and generate rich temporal feature representations. These representations cover multi-level temporal dependencies and reveal temporal evolution patterns at different granularity levels.

In order to preserve the temporal information intactly, we introduce a multi-level correlation injection layer, which is specially designed to capture and integrate the interrelationships between different levels:

Where is a convolution kernel of the form. After the encoding process of the L layer, we use a simple linear layer to integrate the outputs of the gated diffusion convolution layer and the multi-level association injection layer. The resulting spatiotemporal dependency feature is expressed as

To cope with the complex situations that may arise in various urban scenarios, the spatiotemporal encoder designed in this paper does not rely on a specific graph structure when dealing with spatial correlations. This is because in a zero-shot prediction environment, the spatial connections between entities are often unknown or difficult to predict. This design enables UrbanGPT to maintain its applicability and flexibility in a wide range of urban application scenarios.

Spatiotemporal instruction fine-tuning framework

Spatiotemporal data-text alignment

In order for language models to deeply understand spatiotemporal dynamics, it is critical to ensure the consistency of text content and spatiotemporal data. This consistency enables the model to integrate multiple data types and generate richer data representations. By combining text content with contextual features in the spatiotemporal domain, the model can not only capture complementary information, but also extract higher-level, more expressive semantic features.

To achieve this, we adopt a lightweight alignment module to project spatiotemporal dependency representations.

The projection operation is performed using linear layer parameters, where dL represents the hidden dimension commonly used in large language models. The resulting projection is represented in the instructions using special tokens: , ,..., , . Here, and are special symbols used to mark the start and end of spatiotemporal information, which can be incorporated into large language models by expanding the vocabulary.

The placeholders represent spatiotemporal labels, which correspond to the vector H in the hidden layer. Using this technique, the model is able to identify spatiotemporal dependencies, which significantly enhances its ability to perform spatiotemporal prediction tasks in urban environments.

Time and space prompt instructions

When making spatiotemporal predictions, both temporal and spatial data contain key semantic information, which is crucial for the model to capture the spatiotemporal laws in specific situations.

For example, traffic flow varies significantly in the morning and during rush hour, and traffic patterns in commercial and residential areas are also different. Therefore, introducing temporal and spatial information as prompt text into spatiotemporal prediction tasks can significantly improve the prediction effect of the model. We use the expertise of large language models in text understanding to process this information.

In the UrbanGPT architecture, we integrate temporal data and spatial details of different granularities as instruction inputs for large language models. Temporal information covers the day of the week and specific time points, while spatial information includes urban areas, administrative divisions, and surrounding points of interest (POIs), as shown in Figure 3.

By integrating these diverse elements, UrbanGPT is able to deeply identify and understand the spatiotemporal dynamics of different regions and time periods in complex spatiotemporal contexts, thereby improving its reasoning ability in zero-sample situations.

Figure 3: Spatiotemporal cueing instructions that encode temporal and location-aware information

Spatiotemporal Instruction Fine-tuning for Large Language Models

There are two major challenges in fine-tuning large language models (LLMs) using instructions to generate text descriptions of spatiotemporal predictions. On the one hand, spatiotemporal predictions are usually based on numerical data, which have structures and patterns that are different from the semantic and syntactic relations that language models in natural language processing are good at.

On the other hand, LLMs usually use a multi-classification loss function to predict vocabulary during the pre-training stage, which results in the generation of probability distribution of vocabulary, while spatiotemporal prediction tasks require continuous value output.

To overcome these problems, UrbanGPT takes an innovative approach. Instead of directly predicting future spatiotemporal values, it generates auxiliary prediction tags. These tags are then processed through a regression layer to transform the model's hidden layer representation into more accurate prediction values. This approach enables UrbanGPT to make spatiotemporal predictions more effectively.

In the above formula, the hidden representation of the predicted tag is represented by , where the predicted tag can be introduced by expanding the LLMs vocabulary. W1, W2, W3 are the weight matrices of the regression layer, and [⋅, ⋅] is the concatenation operation.

experiment

Zero-shot prediction performance

Predictions for unseen areas within the same city

In cross-region forecasting, we use data from certain regions in the same city to predict future conditions in other regions that the model has not yet touched. By deeply analyzing the performance of the model in such cross-region forecasting tasks, we noticed that:

(1) Excellent zero-shot prediction capability. The data in Table 1 show that the proposed model outperforms the baseline model in regression and classification tasks on different datasets. The outstanding performance of UrbanGPT is mainly attributed to two core elements.

i) Spatiotemporal data-text alignment. Aligning spatiotemporal contextual signals with the text understanding capabilities of the language model is critical to the success of the model. This integration enables the model to fully exploit the urban dynamics information encoded in the spatiotemporal signals while combining the deep understanding of the text context of the large language model, thereby extending the model's prediction capabilities in zero-shot scenarios.

ii) Fine-tuning of spatiotemporal instructions. Through adaptive adjustment, LLMs can more effectively absorb key information in instructions and improve their understanding of the complex relationship between spatial and temporal factors. UrbanGPT successfully retains general and transferable spatiotemporal knowledge by combining spatiotemporal instruction fine-tuning with spatiotemporal dependency encoders, achieving accurate prediction in zero-shot scenarios.

(2) Deep understanding of urban semantics. Urban semantics provides deep insights into spatial and temporal characteristics. By training the model on a variety of datasets, its understanding of spatiotemporal dynamics in different time periods and geographic locations is enhanced.

In contrast, traditional baseline models usually focus more on encoding spatiotemporal dependencies while ignoring the semantic differences between regions, time periods, and data types. By incorporating rich semantic information into UrbanGPT, we significantly improve its ability to make accurate zero-shot predictions in unseen regions.

(3) Improving prediction performance in sparse data environments. Predicting spatiotemporal patterns in an environment with sparse data points is challenging, mainly because the model is prone to overfitting in this case. For example, in scenarios such as crime prediction, the data is often sparse, which makes the baseline model perform poorly in cross-region prediction tasks and has low recall, suggesting that there may be an overfitting problem.

To address this challenge, our model adopts an innovative strategy that combines spatiotemporal learning with a large language model and optimizes it through an effective spatiotemporal instruction fine-tuning method. This method enhances the model's understanding and representation of spatiotemporal data by incorporating rich semantic information, enabling it to handle sparse data more effectively and significantly improving prediction accuracy.

Table 1: Comparison of performance of zero-shot prediction scenarios across regions

Cross-city prediction tasks

In order to test the model's predictive ability across cities, we selected the Chicago taxi dataset for experimental verification. (Note that this dataset was not used in the training phase.) As shown in Figure 4, the test results show that the model performs better than the comparison method at all time points, which proves the effectiveness of UrbanGPT in cross-city knowledge transfer.

By combining the spatiotemporal encoder with spatiotemporal instruction fine-tuning techniques, the model is able to capture both universal and specific spatiotemporal regularities, allowing for more accurate predictions. In addition, by comprehensively considering different geographic locations, time factors, and learned knowledge transfer, the model is able to link spatiotemporal patterns across different functional regions and historical periods. This comprehensive spatiotemporal understanding provides key insights for accurate zero-shot predictions in cross-city scenarios.

Figure 4: Performance comparison of zero-shot prediction scenarios across cities

Typical supervised prediction tasks

This section focuses on the performance of UrbanGPT in a fully supervised prediction environment. Specifically, we evaluate the model's performance in long-term spatiotemporal prediction tasks by using a test dataset with a larger time span. For example, the model is trained using data from 2017 and tested using data from 2021.

The test results show that UrbanGPT significantly outperforms the baseline model in prediction tasks with long time spans, which highlights its excellent generalization ability when dealing with long-term predictions. This feature reduces the need for frequent retraining or incremental updates, making the model more suitable for practical application scenarios. In addition, the experimental results also confirmed that the introduction of additional text information not only does not affect the performance of the model, but also does not introduce noise, which further proves the effectiveness of using large language models to enhance spatiotemporal prediction tasks.

Table 2: Evaluation of prediction performance in an end-to-end supervised setting

Ablation experiment

(1) Importance of spatiotemporal context: STC. When spatiotemporal information is removed from the instruction text, the performance of the model decreases. This may be because the lack of temporal information forces the model to rely on the spatiotemporal encoder to process time-related features and perform prediction tasks. At the same time, the lack of spatial information also limits the model's ability to capture spatial correlations, making it more difficult to analyze spatiotemporal patterns in different regions.

(2) Effect of fine-tuning on multi-dataset instructions: Multi. We train the model only on the NYC-taxi dataset. Due to the lack of information on other urban indicators, this limits the model's ability to reveal the urban spatiotemporal dynamics. As a result, the model performs poorly. By integrating different spatiotemporal data from different cities, the model is able to more effectively capture the unique characteristics of different geographical locations and the evolution of spatiotemporal patterns.

(3) The role of the spatiotemporal encoder: STE. When the spatiotemporal encoder is removed from the model, the results show that this omission significantly reduces the predictive ability of large language models in spatiotemporal prediction tasks. This highlights the key role of the spatiotemporal encoder in improving the model's predictive performance.

(4) Regression layer in instruction fine-tuning: T2P. We instruct UrbanGPT to output its predictions directly in text format. The model performs poorly mainly because it relies on a multi-class loss function for optimization during training, which results in a mismatch between the probability distribution of the model output and the continuous value distribution required for spatiotemporal prediction tasks. To address this issue, we introduce a regression predictor into the model, which significantly improves the model's ability to generate more accurate numerical predictions in regression tasks.

Figure 5: UrbanGPT ablation experiment

Model robustness study

In this section, we evaluate the stability of UrbanGPT in dealing with different spatiotemporal pattern scenarios. We distinguish regions based on the magnitude of value changes (such as taxi traffic) within a specific time period. Smaller variances usually mean that regions have stable temporal patterns, while larger variances imply that regions have more diverse spatiotemporal patterns, which are common in commercially active areas or densely populated areas.

As shown in Figure 6, most models perform well in areas with low variance because the spatiotemporal patterns in these areas are more consistent and predictable. However, the baseline model performs poorly in areas with high variance, especially in the (0.75, 1.0] interval, which may be because the baseline model has difficulty in accurately inferring the complex spatiotemporal patterns in these areas in zero-shot scenarios. In urban management, such as traffic signal control and safety scheduling, accurate prediction of densely populated or prosperous areas is crucial. UrbanGPT shows significant performance improvement in the (0.75, 1.0] interval, which demonstrates its strong ability in zero-shot prediction scenarios.

Figure 6: Model robustness study

case study

A case study evaluates the effectiveness of different large language models in zero-shot spatiotemporal prediction scenarios, and the results are shown in Table 3. The results show that various LLMs are able to generate predictions based on the provided instructions, which verifies the effectiveness of the prompt design.

Specifically, ChatGPT mainly relies on historical averages when making predictions, without explicitly incorporating temporal or spatial data into its prediction model. Although Llama-2-70b is able to analyze specific time periods and regions, it encounters challenges in dealing with the dependencies of numerical time series, which affects the accuracy of its predictions.

In contrast, Claude-2.1 is able to summarize and analyze historical data more effectively, leveraging peak-hour patterns and points of interest to achieve more accurate traffic trend predictions.

Our proposed UrbanGPT combines spatiotemporal contextual signals with the reasoning capabilities of large language models by fine-tuning spatiotemporal instructions, significantly improving the accuracy of predicted values and spatiotemporal trends. These findings highlight the potential and effectiveness of UrbanGPT in capturing universal spatiotemporal patterns, making zero-shot spatiotemporal predictions possible.

Table 3: Zero-shot prediction examples of bicycle traffic in New York City using different LLMs

Summary and Outlook

This study proposes UrbanGPT, a spatiotemporal large language model with good generalization capabilities in diverse urban environments. To achieve seamless integration of spatiotemporal contextual signals with large language models (LLMs), this paper proposes an innovative spatiotemporal instruction fine-tuning method. This approach endows UrbanGPT with the ability to learn universal and transferable spatiotemporal patterns in various urban data. Through extensive experimental analysis, the efficiency and effectiveness of the UrbanGPT architecture and its core components are demonstrated.

Although the current results are exciting, there are still some challenges to overcome in future research. First, we will actively collect more types of urban data to enhance the application capabilities of UrbanGPT in the broader field of urban computing. Second, it is equally important to understand the decision-making mechanism of UrbanGPT. Although the model performs well in performance, providing interpretability of model predictions is also a key direction for future research. Future work will be committed to enabling UrbanGPT to explain its prediction results, thereby increasing its transparency and user trust.

References:

https://arxiv.org/abs/2403.00813

news

Zero-sample spatiotemporal prediction! HKU, South China University of Technology, etc. release spatiotemporal big model UrbanGPT | KDD 2024

Introduction

my contact information