信息化蓬勃发展,带来数据的爆发式增长。在云计算和大数据时代,基于数据开展生产、运营、决策成为常态,数据的存储及应用体系是企业生态运转的中枢神经。
近日,全球最具权威的第三方IT研究与顾问咨询公司Gartner联袂广州市品高软件股份有限公司最新合作的报告:《基于数据湖架构的大数据平台》(Big data platform based on Data Lake Architecture)正式发布,双方就数据湖(Data Lake)的现实挑战、技术实践与发展趋势展开了探讨。
本次报告中,Gartner引用了其资深分析师Nick Heudecker发表的一篇名为《数据湖设计的最佳实践》(Best Practices for Designing Your Data Lake)的文章。报告指出,一个成功的数据湖系架构需要数据的管理者合理地区分出数据的来源、挖掘、优化、监管和使用体系等,并逐一分析了如何更好地设计这些模块,从而最大化地使用数据。
这与品高软件最新发布的品高云数据湖管理平台(BingoInsight)设计理念不谋而合。据悉,基于数据湖架构的大数据平台BingoInsight是国内首个企业级的私有云数据湖,是新一代的数据汇聚、共享、交换、开放平台。
品高云数据湖能够实现全数据形态的共享存储,提供数据资源发布、存储、编目、使用及评价等全生命周期的数据开放支撑,并可通过联邦数据湖解决跨组织边界的数据主权和数据信任问题,为用户快速交付数据价值。了解更多,请添加品高云家的小表妹。
下附报告英文原文及翻译:
基于数据湖架构的大数据平台
A Big Data Platform Based on Data Lake Architecture
引言
Introduction
进入大数据时代,数据碰撞比传统的数据分析激发更大价值,而数据碰撞的前提是先实现企业内各业务线条之间、跨企业组织之间以及跨行业的数据汇聚、共享和开放,这是大数据技术应用面临的一项现实挑战。品高数据湖通过深度融合云计算和大数据技术,提供分布式的面向多组织用户的大数据应用平台,帮助用户构建可持续交付的数据生态链,用户相互之间可以基于平台进行数据交换和数据碰撞,从而深入挖掘数据价值,促进各用户组织的数据应用创新,有效提升组织数据应用能力。
In the era of big data, data collision, compared with traditional data analysis, inspires greater value. The premises for data collision are the convergence, sharing and opening of data among different business lines in enterprises, various organizations and industries, which stand as challenges for the applications of big data technologies. Bingo Data Lake, by means of a deep integration of cloud computing and big data technology, provides distributed big data application platforms facing multiple organizational users and helps the users to build a sustainable data ecological chain. Data exchange and collision based on platforms among users can be achieved, which further excavates data values, promotes data application innovation, and improves the data application capability of the organizations.
挑战
Challenges
当前大数据技术曲线已发展至进入实际应用阶段的拐点,云计算技术的成熟和稳定保证了大数据技术的落地。大数据技术涵盖了从软硬件基础架构到具体应用的多层次,技术生态更为丰富,应用层面更是能带来全新的甚至是颠覆性的认知,因此大数据技术及应用具有巨大的价值探索空间,不确定性和更大的可能性并存,这是令人振奋的挑战与机遇。
With the big data technology curve currently reaching an inflection point into the stage of practical application, the maturation and stability of cloud computing technologies guarantee to keep big data technologies on track. Big data technologies have a wide coverage from basic structures of software and hardware to layers of specific applications, which contributes to a richer technical ecology and might bring about brand-new or even disruptive cognition. Therefore, there is huge space for value exploration of the big data technologies and their applications, with uncertainty and greater possibility coexisting, bringing exciting both challenges and opportunities.
从技术方面,大数据技术生态繁荣,发展日新月异, Hadoop、Spark,MPP、NoSQL、kafka、机器学习、深度学习不断发展,不同技术解决不同问题,企业的大数据平台必定是混合式的架构,如何有效融合异构的技术成为企业构建大数据平台必须面临的问题。
In technical aspects, the big data technology ecology is booming. Technologies like Hadoop, Spark, MPP, NoSQL, Kafka, Machine Learning, and Deep Learning, each of which copes with distinct issues, are ever-developing. The big data platform of an enterprise has to be based on a hybrid architecture. How to achieve an effective heterogeneous technology convergence has become a prompt issue facing enterprises establishing their big data platforms.
从数据方面,跨部门、跨企业、跨行业的数据融合需求日趋明显,数据关联碰撞也是激发数据创新的基础,如何有效打破数据孤岛,解决数据主权,实现统一的数据汇聚和共享是企业面临的另外一个关键性问题。
In terms of data, demands on data fusion across departments, enterprises and industries have been gradually evident, while the association and collision of data are the basis on which data innovation is ignited. Thus another critical issue facing enterprises has been how to effectively break the data silos, solve the issues of data sovereignty, and achieve a unified data convergence and sharing.
正因如此,亚马逊、微软等大厂商凭借灵敏的市场嗅觉,顺应市场趋势,在2016年纷纷推基于公有云的的数据湖解决方案,以解决技术融合和数据融合的问题。另一方面,很多企业和组织因为存在内部数据融合以及有保护的对外数据交换等现实要求,开始对比借鉴公有云数据湖解决方案来规划组织内私有的数据湖平台建设。
Consequently, to address the above technical and data integration problems, fast-moving enterprises like Amazon and Microsoft, following the market trends, introduced their data lake solutions based on public clouds in 2016. On the other hand, due to realistic requirements of internal data integration and external data exchange, many enterprises and organizations have begun comparing and learning from public cloud data lake solutions in their planning of the construction of private data lake platforms.
品高公司一直致力于耕耘企业级市场,在大数据概念兴起阶段逐步洞察到大数据技术在企业落地的挑战,经过两年研发在2017年初推出了基于私有云的数据湖整体解决方案,以帮助企业和组织构建私有的大数据平台,使组织级的大数据应用及价值创新成为可能。
Bingo has been dedicated to the markets of enterprises, and perceived the challenges facing the big data technologies during their landing in enterprises. After two years’ development, Bingo introduced its data lake overall solutions based on private clouds in early 2017, aiming to help enterprises and organizations to build their private big data platforms and make possible the big data application and value innovation in the organizational level.
品高数据湖方案
Bingo Data Lake Solutions
品高数据湖依托BingoCloudOS(品高基础云产品),基于对象存储S3帮助企业构建数据湖,为广泛的政企客户组织内部门或分支机构之间、跨组织之间以及跨行业对接数据资源和进行数据应用创新提供了普适性的基础数据支撑环境。具体而言,品高数据湖提供涵盖数据存储、数据集成、数据处理、数据管理、数据消费等一站的数据服务,是可服务于数据全生命周期的解决方案。
Relying on BingoCloudOS, Bingo Data Lake helps enterprises establish their data lakes on the basis of S3 Object-based Storage, providing universal data supporting environments for the exchange of data resources and the innovation of data applications among different departments, branches, organizations, and industries. Specifically, Bingo Data Lake offers one-stop services covering the storage, integration, processing, management, and consumption of data, and can serve the whole life cycle of the data.
品高数据湖解决方案包括5部分,分别为数据湖存储、数据集成、数据处理、数据管理和数据消费。同时,Gartner数据湖最佳设计实践报告指出,保障数据湖成功落地需要重点考虑数据集成、数据探索和开发、数据治理、数据消费等四个方面,可以说,品高数据湖解决方案与Gartner观点不谋而合。
Bingo Data Lake solutions are comprised of 5 parts: data lake storage, data integration, data processing, data management, and data consumption. Meanwhile, to keep data lake efforts on track, four clauses need to be stressed, which are data acquisition, insight discovery and development, data governance and analytics consumption.
数据湖存储
Data Lake Storage
数据湖存储基于品高云对象存储技术实现,能够存储全数据类型(结构化数据、文本、图片、音视频等)的存储,数据湖存储提供以下特性保障数据湖的存储管理,
Data lake storage is based on BingoCloudOS object-based storage technology, and is able to realize the storage of all data types (structured data, texts, images, audio and video files, etc.). It has characteristics including:
高可用:可以实现99.999999999%的高可用性,支持大规模节点部署,单集群可以支持1024台服务器,单云16000台服务器,可以支撑海量数据存储、汇聚、共享
High availability: Availability as high as 99.999999999%; supporting large-scale node deployment with a single cluster supporting 1024 servers and a single cloud supporting 16000 servers, thus achieving massive data storage, convergence and sharing
良好的兼容性:兼容AWS S3协议,可与Hadoop、Spark、Greenplum等主流大数据计算技术无缝集成,快速支撑数据的开发、处理,高安全性
Good compatibility: Compatible with AWS S3, being able to achieve seamless integration with mainstream big data computing technologies like Hadoop, Spark, and Greenplum, supporting data development and processing, and highly secure
安全性:可以实现多个租户的数据隔离和共享,基于存储桶隔离多个租户的数据,并通过权限策略授权实现数据共享,支持服务端加密,实现敏感性数据的自动加密
Security: Capable of data isolation and sharing among multiple tenants. Data of tenants are isolated with buckets, and further realize data sharing via access policy authorization. Server side encryption is supported with automatic encryption for sensitive data
高能性:支持大文件切片、多节点并发传输,提升数据传输效率
High efficiency: Large file slicing and multi-node concurrent transmission supported, improving data transmission efficiency
Automatic duplication and synchronization across data centers supported without limitations from the data centers; global name space management across data centers supported; federated data lake buildable.
数据集成
Data Integration
数据集成是将数据提取、转换和加载的过程,以自动化的形式从源系统中提取数据,转换成一致的格式,并加载到数据湖中。品高数据湖提供数据湖集成工具,保障异构数据源能够快速、鲜活的流入数据湖。
Data integration refers to the process of the extraction, conversion and loading of data, in which data are automatically extracted from source systems, converted into consistent formats, and loaded to the data lake. Bingo Data Lake provides data lake integration tools and can ensure that heterogeneous data sources could pour into the data lake fast and alive.
易用:无需编码,通过可视化配置即可将数据发布至数据湖;
Ease of use: No need for coding, data being able to be transmitted to the data lake with visual configurations
异构数据源支持:支持与各种关系型数据库、Hadoop、NoSQL数据库、MPP等主流大数据技术无逢对接,自动获取数据至数据湖。
Heterogeneous data sources supported: Seamless integration with mainstream big data computing technologies like Hadoop, NoSQL, and MPP supported, with data automatically acquired into the data lake
任务调度:采用分布式的集成任务调度,并支持分钟、小时、日、周、月灯多种时间调度周期,提升数据湖的数据集成效率
Task scheduling: Distributed task schedule adopted, supporting time scheduling cycles of minutes, hours, days, weeks, and months, thus improving the data integration efficiency of the data lake
多种控制策略:支持集成作业重试、作业依赖、人工重跑等多种作业控制策略,保障数据集成作业的SLA
Multiple control policies: Job control policies such as job retry, job dependence, and manual re-run supported, ensuring SLA of data integration jobs
数据探索和开发
Data Discovery and Development
通过数据集成完成数据湖的数据集中后,品高提供内置的Hadoop套件,帮助用户快速探索、分析和处理数据湖的数据。
When data of a data lake are collected after the data integration, Bingo offers a built-in Hadoop package that can help users rapidly explore, analyze and process the data in the data lake.
内置Hadoop套件运行在品高云LXC(Linux container)上,性能损耗接近物理机,实现Hadoop集群的云托管,一方面,使得大数据处理集群的运维能够交给云平台管理,另外一方面,使得大数据技术能够与云计算技术进行深度的融合
The built-in Hadoop package runs on BingoCloudOS LXC (Linux container) with a performance cost close to a physical machine. It can perform cloud hosting of Hadoop clusters. The operation and maintenance of big data processing clusters can be managed on the cloud platform, and, at the same time, big data technologies can have a deep integration with cloud computing technologies
支持多租户使用统一Hadoop集群,多个部门、多个应用通过资源分配、资源隔离共享计算资源有效提升资源利用率
Multiple tenants using unified Hadoop clusters supported. Departments and applications can share the computing resources through resource allocation and isolation, thus effectively raising the level of resources utilization
支持Hadoop外部表直连数据湖的数据,可实现与本地数据碰撞关联计算,计算完后的数据可存储回数据湖
Direct connections between Hadoop external tables and the data in the data lake supported. It can calculate the collision and association with local data with the result data stored back into the data lake
多种计算方法支持,除品高内置Hadoop外,其它Hadoop、CDH、Greenplum均可连接和使用数据湖的数据
Multiple computing methods supported. In addition to Bingo’s built-in Hadoop, other Hadoop, CDH and Greenplum can also access and use the data in the data lake.
数据管理
Data Management
数据湖的数据如果无有效的数据治理手段和优化措施,必将成为数据沼泽,为此,数据管理是数据湖建设非常重要的一环,品高通过元数据管理、数据目录、数据监控统计、数据质量等手段,实现数据湖数据的可读、可检索、可管理和可用性。
Without effective governance and optimization, a data lake is bound to be turned into a data swamp. Data management, therefore, is a critical part of the construction of a data lake. By means of metadata management, data catalog, data statistics & monitoring, and data quality, Bingo guarantees the data in its data lake readable, retrievable, manageable and available.
支持通过元数据描述、注册数据湖数据样的元数据,包括数据资源名称、数据资源业务描述、数据资源字段信息、关联数据资源等信息,保障数据的可读性,并且能够自动从数据所属的数据源捕获相关元数据,减少元数据的维护工作
Metadata of the data samples in the data lake can be described and registered through metadata, including the names, business descriptions, field information, and association of the data resources, thus ensuring data’s readability. Also, metadata can automatically be captured from relevant data resources, resulting in less maintenance work
数据湖的数据资源支持按主题、组织、专题等维度编目数据,保障数据的可检索性
Data resources of the data lake can be catalogued according to subjects, organizations and features, ensuring data’s findability
可通过数据及时性、数据完整性、数据一致性、数据准确性等多个维度监控和分析数据湖的数据质量,并能够实现数据质量监控、分析、检查、报告的闭环管理,此外,还支持数据消费者对数据资源的质量进行评价评论,持续提升数据湖的数据质量
Data quality can be monitored and analyzed in terms of data’s timeliness, integrity, consistency, and accuracy, and it’s possible to perform a closed-loop management of the monitoring, analysis, inspection and report of the data quality. Moreover, data consumers can also evaluate and comment on the quality of the data resources, which will continuously improve the data quality of the data lake
能够实现从数据集成、数据存储、数据处理、数据消费的全过程性能指标的监控分析,实时监控分析各个环节的处理情况,帮助管理人员第一时间掌握数据湖的整体运行状况,对于数据湖的运营、可持续发展具有指导意义
Monitoring and analysis of the performance indexes can be achieved throughout the process of the integration, storage, processing and consumption of data. It will monitor and analyze in real time the handling of each link, which can help the managers to grasp the overall running conditions of the data lake in the first place and has guiding significance for the operation and sustainable development of a data lake
数据分析与消费
Data Analysis and Consumption
当大量数据被采集到数据湖中,经过开发处理,再将处理后的可用数据存入回数据湖,为各类大数据分析应用提供数据支撑。
Massive data can be collected into the data lake and then developed and processed. Processed available data can then be stored back into the data lake, providing data support for various big data analysis applications.
品高数据湖方案中提供大数据分析平台,通过自助分析、数据可视化等多种方式让用户进行数据消费,自由发掘数据的潜能和价值。平台中内置仪表盘、数据源管理、数据报表、数据报告以及与地理位置信息结合的数据运算和展示等多种分析组件,同时还可以支持第三方的数据分析工具、以及用户自己开发的分析工具等。
Bingo Data Lake solutions provide platforms for big data analytics, and enable users to conduct data consumption and explore the potential and value of data by means of self-analysis and data visualization. Built-in analysis components in the platforms include dashboards, data source management, data reports, and data processing and demonstration combined with geographic positions. Meanwhile, third-party data analysis tools and tools developed by users are also supported.
提供内置的自助查询工具,可直接通过图形化界面建立数据分析,用户可通过配置数据模型、过滤条件、结果字段等查询条件,即可获得相应的数据分析结果报表
Built-in query tools can help to perform data analysis with graphic interfaces. Users can set query conditions such as data model, filter condition and result field, and acquire relevant result reports of the data analysis
提供多样化的数据分析呈现图表,如地图工具、数据报表、 数据脑图、数据报告等,依据数据可视化的科学方法以合理的方式为用户呈现分析结果,极大提升分析结论的可读性
Diverse data analysis charts are provided, such as maps, data reports, data mind maps, etc. Analysis results are presented in the scientific and reasonable way of data visualization, contributing to much greater readability
支持数据分析过程的协作共享,从源数据到得出分析结果的过程中,可分别由不同的用户分工协作,其中可能包含数据管理员、分析人员、一线业务人员等等,让各类用户均能够参与到数据分析的过程中来,并以社交化的方式分享数据分析报告
Collaboration and sharing is allowed during data analysis. In the process of getting a result from source data, users can coordinate and distribute responsibilities. Persons involved might include data managers, analysts, first-line business personnel, etc., which allows participation of various users in the process of data analysis and enables the sharing of data analysis reports in a socialized manner
应用场景
Application Scenarios
基于上文中介绍的品高数据方案的功能特性和创新点,以下列举三个适合于应用数据湖方案的应用场景。
In accordance with the characteristics and innovations of Bingo data solutions, 3 scenarios suitable for data lake solutions are listed as follows.
场景1:跨组织边界的数据共享
Scenario 1: Data Sharing Across Organizational Boundaries
随着大数据的深入发展,各企业、政府纷纷建设了大数据平台,对于提升企业生产效率、销售模式以及政府治理水平等起到了有效的推动,数据应用不再局限于自身拥有的数据,要求通过多方数据共享后的汇聚分析实现更大力度的数据创新,进而促进企业或政府组织的治理质量提升。
As big data further develops, enterprises and governments have successively established their big data platforms, which contributes to the improvement of the enterprises’ production efficiency and sales patterns and the governments’ governance. The applications of data are no more confined to one’s own data, and the convergence analysis following data sharing among multiple parties can realize greater data innovation and improve the governance of enterprises or government organizations.
传统解决方案存在的问题
Problems with the Traditional Solutions
难实现异构技术融合
Difficulties in Achieving Heterogeneous Technology Convergence
组织机构产生的数据复杂多样,数据汇聚难度大。Hadoop 技术仅能够解决单个部门的数据存储和处理,但无法解决跨组织边界的技术融合和共享权限问题。跨组织边界的大数据技术路线不一,技术融合难度大。
Complicated and diverse data generated from organizations result in huge difficulty of data convergence. Hadoop technology is able to settle the data storage and processing of a single department, while unable to address issues over data integration and sharing rights across organizations. Big data technical routes across organizational boundaries are varied, which causes huge difficulty in technology integration.
数据共享模式存在不足
Defects of Data Sharing Modes
跨组织边界的数据共享开放常见模式有数据查询接口、FTP 文件交换、大数据交易所等。
Common modes of data sharing across organization boundaries include data query interface, FTP file exchange, big data exchange, etc.
FTP 文件交换存在安全性弱、交换性能差、数据主权难界定、需拷贝数据等问题。
FTP file exchange is weak in terms of security and exchange performance. Here, data sovereignty is hard to define, and data has to be replicated.
大数据交易所缺乏数据汇聚基础,难以满足大量数据的关联碰撞。
Big data exchange is in lack of a basis for data convergence, and is hard to fulfill the association and collision of massive data.
缺乏对运营体系的支持
Lack of Support for Operation Systems
大数据平台往往重技术、轻运营、轻质量,导致大数据平台无法可持续发展,有必要从数据评价、数据质量和数据开放指数建立全面的数据运营体系,保障数据共享的可持续发展。
Big data platforms often pay more attention to technologies than their operation and quality, which results in its difficulty in sustainable development. It is essential to create a comprehensive data operating system by referring to data’s assessment, quality and index of opening, and protect the sustainable development of data sharing.
应对与解决
Coping Solutions
针对以上问题和需求,品高数据湖方案通过深度融合云计算和大数据技术,以数据存储为基础,通过在本文所述的数据集成、数据开发、数据管理、数据消费四个方面的创新能力,解决组织部门之间、跨组织、跨行业的数据共享和开放,帮助组织构建可持续、健康的数据生态链,通过数据关联进一步挖掘数据价值,推动数据创新。
Aiming at problems and demands listed above, on the basis of data storage, by integrating cloud computing and big data technology, and by taking advantages of its innovative capabilities on the integration, development, management and consumption of data, Bingo Data Lake solutions settle the data sharing and opening across departments, organizations and industries, help organizations to create a healthy and sustainable data ecological chain, and further excavate data values through data association so as to promote data innovation.
场景2:促进基于数据的产学研的合作
Scenario 2: Promoting Production-Study-Research Cooperation Based on Data
行业生产数据与科研之间的矛盾
Contradiction Between Production Data and Research
政府机构、大型企业拥有大量生产数据,但技术储备和算法模型较弱,而高校、科研机构有技术、有算法模型,苦于没数据。
Government agencies and large scale enterprises possess massive production data but weak technical reserves and algorithm models, while universities and research institutions turn out to be the opposite.
利用数据湖建立生产和科研的桥梁
Building a Bridge Between Production and Research with a Data Lake
基于上述问题,可通过数据湖将行业生产数据脱敏后存储到数据湖,开放给科研机构、高校进行研究性探索,同时,研究成果可反馈应用于企业,从而有效促进基于数据的产学研合作。
On account of the problems above, production data can be desensitized through the data lake, stored in it, and opened to research institutions and universities for research purposes. Meanwhile, research results can in turn be applied by enterprises, which may effectively promote the Production-Study-Research Cooperation based on data.
场景3:联邦数据湖
Scenario 3: Federated Data Lake
跨组织的数据集中存在安全和信任问题
Security and Trust Issues in Cross-organizational Data Collection
在数据湖的建设过程中,会常常遇到跨企业间、不同政府部门间的跨组织数据湖建设。如果通过统一的数据湖来集中管理所有数据,数据的采集将会变得比较困难,包括组织间的数据互信、数据主权、数据安全等一些列问题。
During the constructions of data lakes, we will frequently encounter cross-organizational constructions across enterprises or different government departments. If we manage all data with a unified data lake, data collection will become difficult, and issues like mutual trust, sovereignty and security of the data will occur.
利用联邦数据湖构建开放的数据生态
Data Ecology Based on Federated Data Lakes
应对上述情况,品高数据湖方案提供去中心化的联邦数据湖,平台基于联邦数据湖实现跨部门、跨组织的数据共享,并通过数据开放平台,将数据相关的目录、工具、服务、模型开放出来,各组织和数据模型相关软件开发商均可在上面进行数据协作,帮助企业、政府构建可持续发展的数据生态链。
To address the situation, Bingo Data Lake solutions offer federated data lakes that are decentralized. The platform based on federated data lakes can realize data sharing across departments and organizations. Relevant catalogs, tools, services and models can be opened for all organizations and relevant software developers to collaborate, thus helping enterprises and governments to establish a healthy and sustainable data ecological chain.