Sponsor
Portland State University. Department of Computer Science
First Advisor
David Maier
Term of Graduation
Spring 2021
Date of Publication
5-3-2021
Document Type
Dissertation
Degree Name
Doctor of Philosophy (Ph.D.) in Computer Science
Department
Computer Science
Language
English
Subjects
Information storage and retrieval systems, Quantitative research
DOI
10.15760/etd.7566
Physical Description
1 online resource (xix, 373 pages)
Abstract
With the advancement of data-collection technology and with more data being available for data analysts for data-intensive decision making, many data analysts use client-based data-analysis environments to analyze that data. Client-based environments where only a personal computer or a laptop is used to perform data analysis tasks are common. In such client-based environments, multiple tools and systems are typically needed to accomplish data-analysis tasks. Stand-alone systems such as spreadsheets, R, Matlab, and Tableau are usually easy to use, and they are designed for the typical, non-technical data analyst. However, these systems are limited in their data-analysis capabilities. More complex data analysis systems provide more powerful capabilities, such as database management systems (DBMSs). However, these systems are complex to use for the typical data analyst and they specialize in handling a specific category of tasks. For example, DBMSs specialize in data manipulation and storage but they do not handle data visualization. As a consequence, the data analyst is usually forced to use multiple tools and systems to be able to accomplish a single data-analysis task.
The more complex and demanding the data-analysis task is, the more tools and systems are typically needed to complete the task. One monolithic data-analysis system cannot satisfy all data-analysis needs. Embracing diversity, where each tool and system specializes in a specific area, allows us to satisfy more needs than a monolithic system could. For example, some tools can handle data manipulation, while others handle different types of visualizations. However, these tools typically do not interoperate, requiring the user to move data back and forth between them. The result is a significant amount of time wasted on extracting, converting, reformatting, and moving data. It would help to have a common client-side data platform that the data-analysis tools can all use to share their results, final and intermediate. Sharing intermediate results is especially important to allow the individual data-analysis steps to be inspected by a variety of tools. Moreover, sharing intermediate results can eliminate wasted computations by building on top of previous results instead of recomputing them, which can speed up the analysis process.
In this research we explore a new data paradigm and data model that allows us to build a shared data-manipulation system for a client-based data-analysis environment. In this shared system, we factor out the data manipulation process from data-analysis systems and tools (the front-end applications) into the shared system, leaving the front-end systems and tools to handle the unique tasks for which they are designed (e.g., visualizations). The shared system allows front-end applications to keep all or most of the intermediate results of their data-manipulation processes in main memory. The intermediate results can then be accessed and inspected by other front-end applications. This new data paradigm eliminates data movement between systems and significantly reduces unnecessary computations and repeated data-processing tasks, allowing the user to focus on the data-analysis task at hand. However there are significant challenges to implementing such a shared system.
Keeping all or most intermediate results in main memory is extremely expensive in terms of space. We present two novel concepts that we call SQL Graphs and block referencing that allow us to take advantage of two dimensions, space (main memory) and time (CPU), to store intermediate results efficiently. SQL Graphs are the data structure that we use to organize intermediate results, while block referencing is the mechanism that we use to store the data of these results. SQL Graphs and block referencing significantly reduce the space cost that is needed to store intermediate results and make our new data paradigm possible to operate on a client-based environment with limited capabilities (e.g., 8GB of RAM).
The main contributions of this research are as follows. We first describe and explore the problem in question that data analysts face. We then introduce a new data paradigm to solve this problem. Then we explore the challenges that arise from implementing the new data paradigm. We then talk about the two new concepts, SQL Graphs and block referencing to solve the space-cost problem. Then we introduce another new structure that we call a dereferencing layout index (DLI) to solve the time-cost problem. We run experiments on these new techniques and concepts using a prototype of a system that we implemented called the jSQL environment (jSQLe). We show our testing results and how effective the system is. We finally discuss some future work that can arise from this research and conclude this dissertation.
Rights
© 2021 Basem Ibrahim Elazzabi
In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/ This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
Persistent Identifier
https://archives.pdx.edu/ds/psu/35921
Recommended Citation
Elazzabi, Basem Ibrahim, "Storing Intermediate Results in Space and Time: SQL Graphs and Block Referencing" (2021). Dissertations and Theses. Paper 5693.
https://doi.org/10.15760/etd.7566
Comments
This work was supported by funding from TransPort (the Portland, Oregon regional coordinating committee for intelligent transportation systems), the Southwest Washington Regional Transportation Council (RTC), the Transportation Research and Education Center (TREC) at Portland State University, and the Intel Science and Technology Center for Big Data (ISTC-BD).