Greenplum Database Architecture
Greenplum Database
Built to support Big Data Analytics, Greenplum Database manages, stores, and analyzes Terabytes to Petabytes of data. Users experience 10 to 100 times better performance over traditional RDBMS products – a result of Greenplum’s shared-nothing MPP architecture, high-performance parallel dataflow engine, and advanced gNet software interconnect technology.
Greenplum Database was conceived, designed, and engineered to allow customers to take advantage of large clusters of increasingly powerful, increasingly inexpensive commodity servers, storage, and Ethernet switches. Greenplum customers can gain immediate benefit from deploying the latest commodity hardware innovations.
1. Massively Parallel Processing Architecture for Loading and Query Processing
The Greenplum Database architecture provides automatic parallelization of data loading and queries. The high-performance loading uses Scatter/Gather Streaming technology, supporting loading speed greater than 10 terabytes per hour, per rack, with linear scalability. All data is automatically partitioned across all nodes of the system, and queries are planned and executed using all nodes working together in a highly coordinated fashion.
2. Polymorphic Data Storage-MultiStorage/SSD Support
Greenplum Database introduced Polymorphic Data Storage™, including tunable compression and support for both row- and column-oriented storage within a database. With Greenplum Database, this capability is extended to allow the placement of data on specific storage types, such as SSD media or NAS archival stores. Customers can easily leverage multiple storage technologies to enable the ideal balance between performance and cost.
3. Multi-level Partitioning with Dynamic Partitioning Elimination
Flexible partitioning of tables is based on date, range, or value. Partitioning is specified using DDL and enables an arbitrary number of levels. Dynamic Partition Elimination disregards irrelevant partitions in a table and allows for significant reduction in amount of data scanned and results in faster query execution times.
4. Out-of-the-Box Support for Big Data Analytics?
Greenplum Database delivers an agile, extensible platform for in-database analytics, leveraging the system’s massively parallel architecture. It natively runs MapReduce programs within its parallel engine and ensures automatic installation and updates of functional extensions, such as in-database GeoSpatial functions, PL/R, PL/Java, PL/Python, and PL/Perl.
5. High Performance gNet™ for Hadoop
Greenplum Database enables high performance parallel import and export of compressed and uncompressed data from Hadoop clusters using gNet for Hadoop, a parallel communications transport with the industry's first direct query interoperability between Greenplum Database nodes and corresponding Hadoop nodes. To further streamline resource consumption during load times, custom-format data (binary, Pig, Hive, etc.) in Hadoop can be converted to GPDB Format via MapReduce, and then imported into Greenplum Database. This is a high-speed direct integration option that provides an efficient and flexible data exchange between Greenplum Database and Hadoop. gNet for Hadoop is available for both Greenplum HD Community Edition and Enterprise Edition.
6. Analytics and Language Support?
Greenplum Database provides analytical functions (t-statistics, p-values, and Naïve Bayes) for advanced in-database analytics. These functions provide the needed metrics for variable selection to improve the quality of a regression model, as well as enhance the ability to understand and reason about the edge cases. Greenplum Database also supports a new level of parallel analysis capabilities for mathematicians and statisticians and support for R, linear algebra, and machine-learning primitives is offered.
7. Dynamic Query Prioritization?
Greenplum’s Advanced Workload Management is extended with patent-pending technology that provides continuous real-time balancing of the resources of the entire cluster across all running queries. This gives DBAs the controls they need to meet workload service-level agreements in complex, mixed-workload environments.
8. Self-Healing Fault Tolerance and Online Segment Rebalancing
Greenplum's fault-tolerance capabilities provide intelligent fault detection and fast online differential recovery, lowering TCO and allowing cloud-scale systems with the highest levels of availability. Greenplum Database can also perform post-recovery segment rebalancing without taking the database offline. All client sessions remain connected to allow no down time and te database remain functional while the system is recovered back into an optimal state
9. Simpler, Scalable Backup with Data Domain Boost
Greenplum Database includes advanced integration with EMC Data Domain deduplication storage systems via EMC Data Domain Boost for faster, more efficient backup. This integration distributes parts of the deduplication process to Greenplum database servers, enabling them to send only unique data to the Data Domain system. This dramatically increases aggregate throughput, reduces the amount of data transferred over the network and eliminates the need for NFS mount management.
10. Health Monitoring and Alerting
The Greenplum Database provides integrated email and SNMP notification in the case of any event needing IT attention. The system can also be configured to call home to EMC support for automatic event notification and advanced support capabilities.
Components that comprise a Greenplum Database system, and how they work together
The Greenplum Master
The Greenplum Segments
The Greenplum Interconnect
Redundancy and Failover in Greenplum Database
Parallel Data Loading
Management and Monitoring
Greenplum Master
The master is the entry point to the Greenplum Database system. It is the database process that accepts client connections and processes the SQL commands issued by the users of the system.
Since Greenplum Database is based on PostgreSQL, end-users interact with Greenplum Database (through the master) as they would a typical PostgreSQL database. They can connect to the database using client programs such as psql or application programming interfaces (APIs) such as JDBC or ODBC.
The master is where the global system catalog resides (the set of system tables that contain metadata about the Greenplum Database system itself), however the master does not contain any user data. Data resides only on the segments. The master does the work of authenticating client connections, processing the incoming SQL commands, distributing the work load between the segments, coordinating the results returned by each of the segments, and presenting the final results to the client program.
Greenplum Segments
In Greenplum Database, the segments are where the data is stored and where the majority of query processing takes place. User-defined tables and their indexes are distributed across the available number of segments in the Greenplum Database system, each segment containing a distinct portion of the data. Segment instances are the database server processes that serve segments. Users do not interact directly with the segments in a Greenplum Database system, but do so through the master.
In the recommended Greenplum Database hardware configuration, there is one active segment per effective CPU or CPU core. For example, if your segment hosts have two dual-core processors, you would have four primary segments per host.
Greenplum Interconnect
The interconnect is the networking layer of Greenplum Database. When a user connects to a database and issues a query, processes are created on each of the segments to handle the work of that query (see “Understanding Parallel Query Execution” on page 27). The interconnect refers to the inter-process communication between the segments, as well as the network infrastructure on which this communication relies. The interconnect uses a standard Gigabit Ethernet switching fabric.
By default, the interconnect uses UDP (User Datagram Protocol) to send messages over the network. The Greenplum software does the additional packet verification and checking not performed by UDP, so the reliability is equivalent to TCP (Transmission Control Protocol), and the performance and scalability exceeds TCP. With TCP, Greenplum has a scalability limit of 1000 segment instances. To remove this limit, UDP is now the default protocol for the interconnect.
Redundancy and Failover
Greenplum Database has deployment options to provide for a system without a single point of failure. This can be achived using the following redundancy components of Greenplum Database.
Segment Mirroring
Master Mirroring
Interconnect Redundancy
Segment Mirroring
When you deploy your Greenplum Database system, you have the option to configure mirror segments. Mirror segments allow database queries to fail over to a backup segment if the primary segment becomes unavailable. To configure mirroring, you must have enough hosts in your Greenplum Database system so that the secondary segment always resides on a different host than its primary. The mirror segment always resides on a different host than its primary segment.
Segment Failover and Recovery
When mirroring is enabled in a Greenplum Database system, the system will automatically fail over to the mirror copy whenever a primary copy becomes unavailable. A Greenplum Database system can remain operational if a segment instance or host goes down as long as all portions of data are available on the remaining active segments.
Whenever the master cannot connect to a segment instance, it marks that segment instance as down in the Greenplum Database system catalog and brings up the mirror segment in its place. A failed segment instance will remain out of operation until steps are taken to bring that segment back online. A failed segment can be recovered while the system is up and running. The recovery process only copies over the changes that were missed while the segment was out of operation.
If you do not have mirroring enabled, the system will automatically shutdown if a segment instance becomes invalid. You must recover all failed segments before operations can continue.
Master Mirroring
You can also optionally deploy a backup or mirror of the master instance on a separate host from the master node. A backup master host serves as a warm standby in the event of the primary master host becoming unoperational. The standby master is kept up to date by a transaction log replication process, which runs on the standby master host and keeps the data between the primary and standby master hosts synchronized.
If the primary master fails, the log replication process is shutdown, and the standby master can be activated in its place. Upon activation of the standby master, the replicated logs are used to reconstruct the state of the master host at the time of the last successfully committed transaction. The activated standby master effectively becomes the Greenplum Database master, accepting client connections on the master port (which must be set to the same port number on the master host and the backup master host).
Since the master does not contain any user data, only the system catalog tables need to be synchronized between the primary and backup copies. These tables are not updated frequently, but when they are, changes are automatically copied over to the standby master so that it is always kept current with the primary.
Interconnect Redundancy
The interconnect refers to the inter-process communication between the segments, as well as the network infrastructure on which this communication relies. A highly available interconnect can be achieved by deploying dual Gigabit Ethernet switches on your network, and redundant Gigabit connections to the Greenplum Database host servers.
Parallel Data Loading
One challenge of large scale, multi-terabyte data warehouses is getting large amounts of data loaded within a given maintenance window. Greenplum supports fast, parallel data loading with its external tables feature. External tables can also be accessed in ‘single row error isolation’ mode, allowing administrators to filter out bad rows during a load operation into a separate error table, while still loading properly formatted rows. Administrators can control the acceptable error threshold for a load operation, giving them control over the quality and flow of data into the database.
By using external tables in conjunction with Greenplum Database’s parallel file server (gpfdist), administrators can achieve maximum parallelism and load bandwidth from their Greenplum Database system.
Management and Monitoring
Management of a Greenplum Database system is performed using a series of command-line utilities, which are located in $GPHOME/bin. Greenplum provides utilities for the following Greenplum Database administration tasks:
Installing Greenplum Database on an Array
Initializing a Greenplum Database System
Starting and Stopping Greenplum Database
Adding or Removing a Host
Expanding the Array and Redistributing Tables among New Segments
Managing Recovery for Failed Segment Instances
Managing Failover and Recovery for a Failed Master Instance
Backing Up and Restoring a Database (in Parallel)
Loading Data in Parallel
System State Reporting
Greenplum also provides an optional system monitoring and management tool that administrators can install and enable with Greenplum Database. Greenplum Command Center uses data collection agents on each segment host to collect and store Greenplum system metrics in a dedicated database. Segment data collections agents send their data to the Greenplum master at regular intervals (typically every 15 seconds). Users can query the Command Center database to see query and system metrics. Greenplum Command Center also has a graphical web-based user interface for viewing these system metrics, which can be installed separately from Greenplum Database.