On this documentation page, we provide a technical description of the (cloud-based) tools, infrastructure and programming models used in the application.
To enable Big Data processing in our project, we used Google Cloud Platform
- one of the world's most popular public cloud vendors of computing services.
Google Cloud Storage is an online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. The service combines the performance and scalability of Google's cloud with advanced security and sharing capabilities.
Usage: Storage of raw and processed data (CSV tables and PNG plots) in GCS buckets.
Dataproc is a managed Spark and Hadoop service allowing to submit jobs for batch processing, querying, streaming, and machine learning. Works fastest when allocated in the same region as the GCS bucket.
Usage: Running a Dataproc cluster to submit PySpark jobs for data processing and analysis.
In particular, the allocated a cluster with the following configuration (see screenshot):
Google Cloud Monitoring is an an integrated service for collecting metrics, events, and metadata, uptime monitoring, dashboards, and alerts.
Usage: Monitoring the performance and cost efficiency of our cloud-based app to improve the overall scalability.
As described on the Home page, conventional programming cannot handle the amount of data we need for the analysis.
That is why we used the PySpark framework to analyze the Big Data.
PySpark is the Python API for Apache Spark - an open source, distributed computing framework and set of libraries for real-time, large-scale data processing.
Usage: Reading CSV files into PySpark DataFrames to filter, merge and analyze tabular data. Delegating memory management and distribution across cluster nodes.
Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. It remains the most popular choice for data analysis due to its flexibility, scalability, and impressive range of libraries.
Usage: Implementing the programming logic in PySpark framework, performing preprocessing / intermediate data operations (Pandas), plotting to visualize results (matplotlib).
To complete our project, we also made use of other tools needed for team collaboration, testing and presenting our results.
GitHub is a platform and cloud-based service for software development and version control, allowing multiple developers to store and manage their code.
Usage: Code base, version control, asynchronous collaboration, backup management.
Google Sites is a web-based application provided by Google that allows users to create and publish their own websites without the need for extensive coding or web development skills.
Google Slides is a cloud-based presentation software that provides a platform for collaboratively creating, editing, and delivering presentations online.
Usage: presenting the progress and the results of the MarketQuake study.
Visual Studio Code is a lightweight, open-source code editor developed by Microsoft. It is a highly customizable and extensible editor that supports a wide range of programming languages.
Features:
IntelliSense for smart code completion.
Built-in Git integration.
Debugging support.
Extension support for additional languages and features.
Cross-platform compatibility (Windows, macOS, Linux).
Integrated terminal for command-line interaction.
PyCharm is a dedicated integrated development environment specifically designed for Python development by JetBrains.
Features:
Intelligent code completion and analysis.
Integrated testing tools.
Powerful debugging capabilities.
Version control integration (Git).