news

bytehouse builds a new generation of cloud-native data warehouse to help reduce costs and increase efficiency

2024-09-25

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

with the explosive growth of data volume, the accelerated pace of enterprise cloud migration, and the increasing demand for real-time data, the cloud-native data warehouse market has ushered in an opportunity for rapid development.
according to data from idc and gartner research institutions, by 2025, 50% of enterprise data is expected to be stored in the cloud, 75% of databases will run on the cloud, 30% of global data processing is expected to be real-time data processing, and 80% of data is expected to be unstructured data. this will drive cloud-native data warehouses to become increasingly popular among enterprises.
recently, li qun, product manager of volcano engine's cloud-native data warehouse bytehouse, was invited to attend the "csdi summit china software r&d innovation technology summit". focusing on the theme of "key technologies and best practices of the new generation of cloud-native data warehouse bytehouse", he started from the history and frontiers of cloud data warehouses to introduce bytehouse's overall architecture, key highlights, performance breakthroughs, key designs of storage and computing separation, and bytehouse's business practices in various scenarios inside and outside the douyin group.
based on bytehouse's experience in multiple industries such as finance, games, and the internet, li qun first introduced the difficulties and challenges currently faced by cloud-native data warehouses. high performance, high concurrency, and high-throughput writing are already the basic requirements of today's enterprises for cloud data warehouses. with the continuous development of the internet, data is growing rapidly, especially point-of-care log data. some more active apps have data reaching tens of billions or even hundreds of billions per day, and large-scale killer applications generate hundreds of billions of events per day. this requires the data platform to not only support high-throughput writing and real-time deduplication, but also to achieve millisecond-level response to business requests.
in addition, enterprises also face problems such as complex data architecture, lack of flexibility, and difficulty in cost control. for example, in order to implement a data analysis function, an enterprise may need to introduce three, four, or even more components to build it, which makes it difficult to expand capacity, puts great pressure on operation and maintenance, and has high manpower maintenance costs.
in order to solve the above problems, bytehouse first achieved a breakthrough in performance. in complex queries, bytehouse launched a self-developed optimizer from the aspects of rbo (rule-based optimization capability), cbo (cost-based optimization capability), and distributed plan generation, which can accurately calculate the execution path with maximum efficiency and greatly reduce user query time. in addition, bytehouse has also optimized from the aspects of exchange, runtime filter, and parallel reconstruction. bytehouse has launched customized solutions for the six major scenarios of slow real-time throughput, slow bi reports, slow offline/online complex analysis, slow lake + warehouse federated analysis, slow crowd selection, and slow image search, and has produced practical results in actual customer scenarios.
secondly, elasticity is also one of bytehouse's core capabilities. based on bytehouse's elastic scaling capabilities, users can expand or shrink capacity based on time, resource load and other conditions, reducing the burden of manual management and improving resource utilization. at the storage level, bytehouse adopts a serverless architecture with low cost and unlimited expansion capabilities. at the computing level, bytehouse is based on the paas model and uses containerization to achieve statelessness or weak state, packaging the entire computing group into tenants and applications and presenting them to users, ensuring that there will be no resource requisition conflicts or performance degradation between tenants, allowing computing resources to be elastically pulled up and elastically expanded and shrunk within seconds.
finally, while improving efficiency, bytehouse also focuses on helping users save costs. bytehouse's cloud-native architecture supports customized time-sharing elasticity, so that users no longer need to pre-purchase resources for business peaks, helping to reduce costs by more than 30%. at the same time, in order to help users simplify the architecture, bytehouse provides richer data analysis capabilities by building a unified platform to maximize data efficiency. it has launched a full-text search engine, gis engine, and vector engine, allowing users to enjoy the ultimate performance of olap while using text retrieval, geospatial analysis, and vector retrieval capabilities without introducing other architectures. in addition, in terms of ecological compatibility, bytehouse supports sql ecosystems such as clickhouse and mysql, and lake warehouse integration, making application and data migration zero-cost.
in terms of application scenarios, li qun shared bytehouse best practices from three scenarios: real-time data warehouse, enterprise-level olap middle platform, and advertising precision marketing.
taking the precision marketing scenario of advertising as an example, as the traffic dividend of mobile internet fades, the refined marketing model has become the mainstream. selecting the most potential target audience from hundreds of millions of people is the essence of refined marketing and also the challenge faced by the data warehouse capability as the basic engine.
from the perspective of a short drama advertising marketing company that bytehouse once served, on the one hand, the company needs to adjust its business strategies in real time, requiring data analysis and update timeliness to be within 3 seconds and concurrent qps to reach 2000; on the other hand, in marketing scenarios, real-time updates of massive data will generate a large amount of data fragmentation, which will reduce query performance and waste storage space.
by introducing the joint solution of bytehouse, lianshan cloud and bytedance, the advertising and marketing company has built a general solution for the short drama industry with "one-click real-time synchronization, minimalist architecture, and low-threshold technology" to improve advertising data processing efficiency and delivery roi.
in terms of effect, through multi-level indexes, such as sort key index, partition key optimization, jump index, etc., bytehouse effectively reduces the amount of data scanned during advertising and marketing queries. with tens of millions of queries per day, the data return time can also be guaranteed to be in seconds, which is 5 times faster than before. in the computing group isolation strategy, bytehouse builds independent computing resources for data reading and writing in advertising and marketing scenarios, and then through a flexible sql distribution mechanism, it can support high concurrency of queries exceeding 2000 qps.
it is reported that bytehouse has also reached in-depth cooperation with many industry companies such as the china earthquake networks center, lilith games, and geekbang technology. with the new generation of cloud-native architecture, efficient and convenient operation and maintenance mode, and high-performance and more flexible real-time query capabilities, it has established a solid foundation for enterprises to seize digital opportunities and promote the digital transformation and upgrading of enterprises.
report/feedback