基于二分混合空间曲线的hbase多维索引构建及查询优化问题研究 硕士毕业论文 小论文 开题报告 开题答辩ppt 中期考察表 中期答辩ppt 答辩稿 毕业答辩ppt 毕设演示系统 实验记录 英文论文
文章来源:www.biyezuopin.vip   发布者:毕业作品网站  




接着,基于二分混合空间填充曲线降维后的一维数据,提出了一种基于流式批处理计算的hbase多维索引结构sfi-hbase。在数据插入方面,sfi-hbase使用spark streaming对插入全量索引层的数据提前进行聚合、存储,提供了对并发插入的支持,实现了良好的插入效率。在数据查询方面,sfi-hbase的全量索引层存储不同粒度的索引信息,能够根据范围、knn查询条件选择索引粒度,进而提高查询效率。在实验对比方面,与多种hbase多维索引九游会j9登录的解决方案进行了插入和查询实验对比,结果表明本文提出的索引结构具有综合最优的时空效率。与线性化技术经典九游会j9登录的解决方案md-hbase相比,sfi-hbase的范围查询效率提高10%左右,knn查询效率提高5倍以上,插入效率提高十倍以上。



research on hbase multi-dimensional index construction and query optimization based on binary blend space filling curve

master candidate: hu zhao

(software engineering)

directed by sun qiao, wang xinyang and sun yueran


hbase is an open-source database based on the distributed file system hadoop, capable of supporting real-time read and write operations for massive data. to facilitate rapid data insertion, hbase does not offer support for multi-dimensional indexing. however, many application scenarios of hbase require the use of multi-dimensional indexing, such as in the fields of e-commerce, iot (internet of things), transportation, etc. thus, numerous studies are dedicated to providing solutions for multi-dimensional indexing in hbase. among them, the global indexing scheme based on linearization techniques first linearizes multi-dimensional data into a one-dimensional sequence, and then builds an indexing structure on this one-dimensional sequence, achieving commendable insertion and query efficiency. however, this approach tends to be complex in ensuring data concurrency and consistency, and the query efficiency and employed linearization techniques of most schemes still have room for improvement.

this paper first theoretically improves linearization techniques. space-filling curves are a common type of linearization technique, where locality and clustering degree are two theoretical indicators to measure the "quality" of space-filling curves. among various space-filling curves, the z-curve, due to its simple encoding and support for binary partition trimming of multi-dimensional spaces, is widely applied. however, its locality and clustering degree are relatively poor. this paper mixes the z-curve, which supports binary partitioning, with other space-filling curves with excellent theoretical properties, proposing a new binary-mixed space-filling curve that not only supports binary partition trimming of multi-dimensional data spaces but also possesses better locality and clustering degree. under experimental conditions, the blend_z_hilbert curve, which mixes the z-curve with the hilbert curve, improves over 20% in locality and over 50% in clustering degree compared to the z-curve alone.

following this, based on the one-dimensional data after dimensionality reduction using the binary-mixed space-filling curve, a multi-dimensional index structure for hbase named sfi-hbase, based on stream-batch processing computation, is proposed. for data insertion, sfi-hbase uses spark streaming to aggregate and store the data intended for the full index layer in advance, providing support for concurrent insertions and achieving high insertion efficiency. for data querying, the full index layer of sfi-hbase stores different granularities of index information, enabling the selection of index granularity based on range and knn query conditions, thereby improving query efficiency. in experimental comparisons with various hbase multi-dimensional indexing solutions, the results demonstrate that the index structure proposed in this paper has the best spatial-temporal efficiency. compared to the classic linearization solution md-hbase, the range query efficiency of sfi-hbase improves by about 10%, and the knn query efficiency increases by more than five times, with over ten times improvement in insertion efficiency.

lastly, to validate the usability of sfi-hbase in real scenarios, an offline analysis platform for ride-hailing services based on an open-source dataset has been implemented. this platform can automatically synchronize the data inserted into mysql to the storage layer and full index layer of sfi-hbase through collection and computation, and efficiently query the ride-hailing order data based on multi-dimensional conditions such as time and latitude/longitude.

keywords: hbase, binary mixed space filling curve, multidimensional index, stream computing


1 绪论

1.1 研究背景与意义

1.2 国内外研究现状

1.2.1 索引技术

1.2.2 空间填充曲线

1.3 论文主要工作

1.4 论文章节安排

2 相关理论基础与技术简介

2.1 大数据存储与计算技术

2.1.1 hadoop生态圈

2.1.2 hdfs

2.1.3 hbase

2.1.4 spark streaming

2.3 索引相关技术

2.3.1 二级索引

2.3.2 聚簇索引

2.3.3 多维索引

2.3 空间填充曲线

2.4 本章小结

3 基于可变阶数的二分混合空间填充曲线

3.1 问题描述

3.2 二分混合空间填充曲线

3.2.1 基本概念

3.2.2 编码方法

3.2.3 解码方法

3.2.4 编解码时空复杂度

3.2.5 可变阶数设置

3.3 空间填充曲线评价指标定义和说明

3.3.1 局部性

3.3.2 聚集度

3.4 空间填充曲线实验对比与分析

3.5 本章小结

4 基于二分混合空间填充曲线的多维流式全量索引

4.1 问题描述

4.1.1 数据倾斜与齐夫分布

4.1.2 线性化技术索引方案存在的问题


4.2.1 sfi-hbase索引结构

4.2.2 范围查询

4.2.3 knn查询


4.3.1 插入效率对比

4.3.2 范围查询效率对比

4.3.2 knn查询效率对比

4.4 本章小结

5 基于sfi-hbase的海量数据分析平台设计与实现

5.1 系统需求分析

5.1.1 功能性需求分析

5.1.2 非功能性需求分析

5.2 系统设计

5.2.1 系统架构设计

5.2.2 功能模块设计

5.2.3 数据库设计


5.3.1 员工管理模块

5.3.2 数据同步模块

5.3.3 订单查询模块

5.3.4 统计信息模块

5.4 系统测试

5.4.1 功能测试

5.4.2 性能测试

5.5 本章小结

6 总结与展望

6.1 总结

6.2 展望


