Sažetak (engleski) | Data Lakes have risen as a Big Data analytics and storage foundation, mostly as a response to the exponential increase in data volume and velocity, which became an insurmountable challenge for traditional analytics solutions based on relational databases. Data Lakes act as a central repository for efficient storing and processing of Big Data in near real-time, with the basis being distributed systems which, due to their features, provide the support for scalability and parallelism. Moreover, as the data evolved from mostly structured, to semi-structured, towards mostly unstructured, the analytics solutions had to be adapted to the rising trends. The Data Lakes, following the philosophy of storing the data in their raw, unprocessed format, contributed to the processing overhead decrease, thus allowing near-real-time data processing analytics. This made Data Lakes, as a versatile data analytics platform, a choice for data-driven organizations looking to gain insights from the Big Data collected and stored in them. As the Data Lakes have technically evolved towards near real-time processing and analytics solutions for heterogeneous data, the evolution of their formal definition and models did not follow the same pace. This resulted in a certain discrepancy between various suggested models, each having a focus on a different areas of definitions and models and, in the end, not having an unadjusted approach towards Data Lake architectures. This led to many organizations struggling to leverage their implemented Data Lakes to achieve the full potential of the stored data. Spatio-temporal data streams have become widely prevalent as data sources over time. This is the result of the growth of the devices commonly referred to as the Internet of Things (IoT), especially the moving sensors that are characterized by providing both geolocation and time component in their generated data. The existing formal frameworks that describe spatio-temporal data types and their operations were not adapted to Big Data sources, especially the ones having their data stored in the unstructured blob data type. The contributions of this thesis are as follows: A proposal of a unified data access layer for Data Lakes that support spatio-temporal data types An enhancement proposal of the Data Lake model with data types for spatio-temporal data streams through a unified data access layer Proposal of algorithms for supporting operations on spatio-temporal data types and their integration into the unified data access layer of Data Lake Data Lakes were initially based on the horizontally and vertically scalable distributed analytical systems that could handle storing and processing of unstructured data, as the Big Data came from heterogeneous data sources that generate high-volume, high-speed, and high-velocity data. These aforementioned features, commonly abbreviated as 3V, directly affected the necessity of system scalability, as those features can vary over time. As mentioned previously, unlike Data Lakes, traditional data warehouses which were previously the primary analytics solutions, do not have the necessary scalability capabilities (at least not to the full extent) to handle Big Data. Data lakes follow the paradigm of extract-load-transform (ELT), or its extended form extract-load-transform-analysis (ELTA). Unlike data warehouses, Data Lakes rely on distributed systems that allow fast data retrieval and transformation through parallel processing tasks. As the data warehouses use the extract-transform-load (ETL) paradigm, the integration process usually becomes a bottleneck that prevents their near real-time capabilities. ELT paradigm is closer to the data ingestion process, which is a subpart of integration that does not put transformation, as the most intensive phase, prior to the data storage. Additionally, increasing the volume or rate of incoming data does not necessarily degrade real-time analytics support, but can be addressed by scaling the entire system. The data pipelines, used in Data Lakes, therefore do not end with the data being stored. It is important to note that within data pipelines, processes do not stop with data storage. On the contrary, after the data storage process, it is possible to continue using these data in the following defined processes, which link complex data processing into a unique sequence. Data Lakes were originally defined with a two-zoned architecture model, with a landing zone and a raw data zone. Therefore, their initial definition could be summed as: Data Lake is defined as a heterogeneous data repository with data stored in their raw, unmodified format. The two-zoned architecture provides Data Lakes with their basic features: raw data storage and scalable data access to the stored data. However, as the Data Lakes usage became more prominent, it became clear the existing architecture models have to be adapted into the ones oriented towards business support. Therefore, the initial definition is adapted into the following: Data Lake is defined as a heterogeneous data repository with data stored in their raw, unmodified format and other formats adjusted for further usage. Using this definition, the Data Lakes are formally enabled to step away from the two-zone architecture into multi-zone and multi-layered architectures. Through the research of the existing literature in the Data Lake architecture domain, several prominent models were analyzed and compared: Imnon's data pond model, layered Data Lake model, Zaloni data model based on zones and its derivative Data Vault based Data Lake model, Lambda architecture and Data Lakehouse. Imnon's model, through analysis, was deemed to have a significant shortcoming of not being able to reprocess stored data due to a lack of raw data zone storage (the data could not change the storage type), but it was the first one to formally split hot and cold storage zones. The layered Data Lake model, besides storage, also defines the data pipelines, metadata storage and extraction, and layers of data access. Zonal Data Lake models, on the other hand, define various zones with data stored in formats and models adjusted for further analysis and application access. With Lambda architecture, the data processing was defined by splitting batch and fast (real-time) data in order to apply analysis over the complete dataset. Data Lakehouse model combined approaches of using Data Lake for storing raw data and Data Warehouse for processed data adjusted to a well-known structured analytical format. In order to determine the types of data sources and types of data streams as their subsets, their systematization was needed. Two levels of systematization were done: one on the data source level and one on the data streams level. Structured data are data with a known structure, which is defined in advance. This means that the data itself and their structure are known before the data appear in a system. Structured data can be delivered through interfaces that support structured data, and the delivery is processed using embedded procedures or ETL processes. Structured data can be mapped to semi-structured through automated metadata-based processes, and operators can be applied to structured data to support the processing of spatio-temporal data streams. Semi-structured data have a given structure that is not as strict as in the case of structured data. Semi-structured data can be stored in their raw form through data ingestion processes, but it can also be included in data pipelines that perform data integration processes using object-relational mapping (ORM) procedures. These processes can result in the creation of new documents or flattened tuples in 1NF, while it is also possible to normalize the flattened tuples to 3NF using advanced methods based on metadata. Metadata for semi-structured data include information such as the data sources, entities, attributes, and mapping functions. Unstructured data are data without a fixed structure at the time of delivery, but which is acquired after it is received or stored in persistent memory. Unstructured data sources are minimally described with metadata about the source itself and the entities within the source. Mapping from unstructured data into semi-structured or structured data is done through dedicated mapping functions, and it is easier to be made to semi-structured data than to structured data. Moreover, unstructured data can be integrated with semi-structured and structured data in data pipelines using appropriate processes. Data streams, as a subset of data sources, are special types of data sources characterized by the continuous delivery of time-variant data. They are similarly divided into structured, semi-structured, and unstructured data streams. Structured data streams contain all supported data types, have a previously known structure with explicitly or implicitly defined time dimension, and connect to structured data interfaces that support the CQL query language and its associated operators, predominantly stream-to-relation operators. Semi-structured are easier to connect to heterogeneous schemes within data pipelines than structured data streams and can be concatenated to create a new schema. When working with semi-structured data streams, stream-to-document operators are mainly used within data ingestion processes. Unstructured data flows do not have a previously defined structure or their structure does not correspond to relational or semi-structured schemas. When dealing with unstructured data, its features have to be extracted from metadata in order to define the temporal and spatial domain of the records. Regarding the spatio-temporal data storage in Data Lakes, the structured data are stored as tuples in post-relational databases and require support for spatio-temporal data types and operations. Unstructured spatio-temporal data are stored as blob, which in case of HDFS usage can be exploited using systems such as SpatialHadoop and ST-Hadoop. Semi-structured spatio-temporal data, on the other hand, are stored in document and graph databases. Research in the Data Lake domain included research of existing definitions, analysis of known Data Lake models, and definition of the Data Lake feature set that includes aspects such as data quality, access control and restriction, data organization, data types, and scalability. Further research in this domain focused on data management in Data Lakes, defining areas of knowledge, and Lamba and Kappa architecture. Research in the domain of metadata in Data Lakes included exploring different types of metadata such as structural, inherent, semantic, technical, operational, and business metadata, as well as metadata classes such as intraobject, interobject, and global metadata. Additionally, the use of metadata in edge computing and Data Lake integration with edge computing components was also analyzed. The research in the domain of semi-structured data focused on formally defining the semi-structured data as the attribute-value pairs list (AVPL), defined as (a∈A)∧(v ∈ V)→{(a, v)} ∈ D. In that definition, an attribute represents an ordered set of one or more variables, and a value is an instance or set of instances that are associated with the corresponding attribute and its variables. An object of AVPL type can be a component of another AVPL object. Moreover, the schema construction operations are defined, as well as the definition of schema construction by using a separate set of operators from one or more AVPL. Finally, the semi-structured schema operators are defined as follows: creation of a class by instance collection, object merging, class merging, object composition, class composition, and inclusion operators. In the domain of spatio-temporal data types and operations, the focus was to cover the existing research and frameworks. Spatio-temporal data types and operations include base, spatial, and time data types, as well as range and temporal types. They also include relational types identifier, tuple, and relations, as well as stuple and stream types. Operators on these data types include stream-to-relation, relation-to-stream, and relation-to-relation, as well as sliding window operators. The analyzed formal framework for working with geospatial data streams is based on many-sorted algebra and second-order signatures, using the aforementioned data types and their operators. In this framework, a moving point is observed in three or four dimensions with mapping from a continuous time to a continuous spatial domain. The initial step in adjusting the Data Lake model for spatio-temporal data streams was extending the existing set of data types and operations, mentioned in the previous paragraph. Data types set is extended by adding a blob (Binary Large Object) data type in order to support unstructured data. The blob data type was defined in the SQL:1999 standard, but its full impact was seen with the rise of unstructured data. Even though most of the existing database systems technically supported using it, Data Lakes are largely dependent on data stored as blob, which made it an indispensable part of the data types set. Moreover, it was identified that the basis for the adjusted Data Lake model should be semi-structured data in order to support mappings between unstructured and semistructured, as well as mappings between semi-structured and structured data. Along with that, the operators supporting the usage of semi-structured data based on defined AVPL and the stream-related operators that support the usage of semi-structured data streams were defined. There are two sets of operators, the first one being operators of type document-to-document (selection, projection and join operators), similar to relation-to-relation operators, and the second one being stream-to-document operators. The second set of operators is closely related to existing sliding window operators defined for structured data streams. This set consists of operators NOW, LAST, UNBOUNDED, PAST, SINCE, and RANGE, together with the definition of spatial windows and partitioned windows based on AVPL. The proposed Data Lake model is divided into two layers: the data processing layer and the data storage layer. The data processing layer has two sets of interfaces: external and internal. The external interfaces are responsible for collecting incoming data and transforming them into a form that can be accessed by external systems. The internal interfaces, on the other hand, are responsible for exposing data to the data storage systems and accepting data from them. Moreover, the proposed model integrates two main approaches for Data Lake architecture, zonal and layered. By their integration, the proposed model provides a two-dimensional view of the Data Lake architecture, giving them equal importance. The data storage layer, on the other hand, has interfaces that communicate with data storage systems and support the storage of structured, semi-structured, and unstructured data. Each form of data storage requires a separate interface, and data is sent to and received from these interfaces in a form that is adjusted for a particular interface. This is the responsibility of the data access layer, which uses its internal interfaces to expose data in the right form for each data storage system. In addition to the data processing and data storage layers, the proposed model also defines Data Lake zones, which are defined by metadata and represent a logical division of the Data Lake based on the organization of data extraction, storage, and presentation. The zones themselves affect the tasks and complexity of individual data pipelines, and they are designed to allow data to be consumed in different forms depending on the needs of a particular zone. Overall, the proposed data lake model is presented to external systems as a single unit, with all supported data types and operations specified in a uniform way. This provides a solid foundation for unified data access and processing, and to a certain extent, it limits the technological solutions that can be used to implement the Data Lake. The proposed Data Lake model is divided into six zones with a specific purpose and function within it. These zones are defined as data landing zone, raw data zone, adjusted data zone, adjusted access zone, research zone, and application access zone. The data landing zone is where data are first received from external systems via appropriate interfaces. They are temporarily stored in this zone until their processing and moving into the raw data zone. The raw data zone contains persistently stored data in their original, unaltered form. The adjusted data zone contains data that have been adjusted through processes of integration or refinement in data pipelines, while the adjusted access zone is built on top of the adjusted data zone but introduces an additional layer of access level control. The research zone contains all data available for user analysis, and the application access zone contains data that are adapted for application usage. The data processing layer in a Data Lake is responsible for collecting, processing, and presenting structured, semi-structured, and unstructured data in a unified way. It consists of two main components: a metadata repository and a unified data access layer. The metadata repository enables high-quality data monitoring, while the unified data access layer adapts the form of incoming or outgoing data and directs it to data storage interfaces. The data processing layer is implemented through a series of data pipelines, while communication with external systems is achieved through access interfaces. The data processing layer also serves as a virtualization layer, hiding the complexity of the Data Lake model and enabling data from multiple storage systems to be presented in a single structure. The data storage layer of the Data Lake model consists of three storage system groups that support the storage of structured, unstructured, and semi-structured data. Data are routed to the appropriate storage system through the data processing layer based on the data in the metadata repository. This enables the storage of data in its original form, as well as its replication and adaptation for storage in other forms. The storage of the same data in different forms is particularly important in the research zone, where access to all data in the Data Lake is available. The use of NoSQL databases, such as document and graph databases, enables the storage and analysis of semi-structured data. The Unified Data Access Layer (UDAL) is the middleware in the Data Lake model. Its tasks include collecting data from external sources using interfaces, creating data pipelines for transferring data between interfaces, and exposing data to data storage systems. The UDAL is based on data pipelines, which consist of at least one input S_u and one output interface (S_i), as well as a manager (M) with an associated procedure (p). Data pipelines that work with individual data structures form the UDAL. Input and output interfaces enable data to be entered into the data pipeline and exposed by the data storage layer, respectively. The real-time interface manager in UDAL is responsible for communication between the input interfaces where data is collected and other managers within the data pipelines. It uses polling to periodically query the input interfaces for new data. There are two types of communication: the push method, where the data source communicates directly with the input interface, and the pull method, where the manager queries the input interface for new data. In both cases, the real-time interface manager passes the data into the data pipeline to begin the data ingestion process. Other managers within the data pipeline include the stream-to-batch mapping manager, the data structure mapping manager, and the data merging manager. These managers are responsible for converting data streams to data series, applying mapping operators to records, integrating and merging structured and semi-structured data. The metadata repository manager is responsible for maintaining communication between the metadata repository and the managers in the UDAL. This allows for central storage and retrieval of metadata about data structures and processes within the Data Lake. The subscription manager manages subscriptions to data and controls the correctness of data delivery to output interfaces. The temporary raw data container manager provides temporary storage for data in the fast data layer of the Lambda architecture before it is written to persistent memory. This ensures its availability through specialized output interfaces until it is processed and written to persistent memory. The manager also takes care of removing data from the temporary container when necessary. The metadata repository is an important part of the data lake architecture. It enables the tracking of data structure and evolution, as well as unified data processing. The metadata repository is divided into structural, characteristic, and semantic metadata, as well as intra-object and inter-object metadata. Additionally, the metadata repository can be used as a basis for building a system for data processing and storage. It can be extended with additional features and classes to support data management processes. This makes the metadata repository a central place for storing and describing data processing, enabling dynamic data management. The metadata repository defines three groups of classes: intraobject data classes that define a particular object and associated attributes, interobject data classes that specify relationships between data sets and attributes, and data processing classes that specify the monitoring of data processing on the batch granularity. The unified data access layer (UDAL) contains algorithms that are related to merging data streams in different formats, such as merging AVPL with AVPL streams and merging AVPL streams with relational data streams. These algorithms support UDAL managers in performing tasks such as mapping records from the tuple to AVPL, mapping blob files to AVPL, and managing records in the temporary container. Additionally, the UDAL includes a refresh algorithm for the real-time interface manager. These algorithms enable the UDAL to effectively process and manage incoming data. In this review of advanced analysis tools, a qualitative analysis was performed on tools that are currently among the most widely used or significant in their respective domains. The analysis was divided into several categories based on their features, and for each category, the ability to process spatial and spatio-temporal analyzes was examined. The categories that were analyzed are general-purpose data science tools, framework extensions, extensions of programming languages with libraries, extensions within the Big Data ecosystem, and visualization tools in the context of spatio-temporal data. The prototype of the data lake model was developed using the Docker container platform. The use of containers allows for interoperability between different systems and independence from infrastructure. The data lake prototype includes several containers that are used to implement the data processing and data storage layers. The data storage layer contains containers for structured, semi-structured, and unstructured data. The data processing layer contains containers with applications and interfaces, as well as the metadata repository, including containers for Apache Kafka, Apache Zookeeper, and Apache Spark, among others, which form the UDAL. The prototype also includes containers for the real-time interface manager, the subscription manager, and the temporary container manager. Moreover, UDAL implementation includes support for spatio-temporal data and includes algorithms for merging data streams and mapping data structures. The metadata repository is implemented as a post-relational database in a SQL Server container. Overall, the use of containers in the prototype enables easy expansion and integration with additional systems and applications. The Data Lake models have evolved from two-zone architectures to more complex multi-zone architectures with adjusted data, with mandatory support for analytics and visualization of semi-structured and unstructured data. Data processing in the proposed Data Lake model is performed through a specialized middleware name Unified Data Access Layer (UDAL that has several tasks, including data integration and ingestion, retrieval and storage of data in the data storage layer, support for external applications, and support for near real-time analysis. UDAL is based on data pipelines and managers, and the metadata repository is mandatory in the data processing layer. Future research includes the automatic extraction of metadata from raw data and its incorporation into UDAL, integration of the system for entity matching and merging into UDAL, and the development of custom Data Lake models depending on their deployment type. |