what's new in Greenplum database 4.2
1. Greenplum Database Extension Framework and Turnkey In-Database Analytics
Greenplum Database delivers an agile, extensible platform for in-database analytics, leveraging the system’s massively parallel architecture. With this release, Greenplum enables turnkey in-database analytics through Greenplum Extensions, which can be downloaded from EMC Subscribenet and installed using the new Greenplum Package Manager. This new Greenplum Database utility ensures automatic installation and updates of functional extensions like in-database Geospatial functions, PL/R, PL/Java, PL/Python, and PL/Perl.
Greenplum Extensions dramatically simplifies the task of enabling and managing advanced in-database functionality across a cluster. For example, extensions automatically get deployed on new nodes during expansions of Greenplum clusters.
2. High-Performance gNet for Hadoop
Greenplum Database enables high-performance parallel import and export of compressed and uncompressed data from Hadoop clusters using gNet for Hadoop, a parallel communications transport with the industry's first direct-query interoperability between Greenplum Database nodes and corresponding Hadoop nodes.
To further streamline resource consumption during load times, custom-format data (binary, Pig, Hive, etc.) in Hadoop can be converted to GPDB format via MapReduce, and then imported into Greenplum Database. This is a high-speed direct integration option that provides an efficient and flexible data exchange between Greenplum Database and Hadoop.
gNet for Hadoop is available for both Greenplum HD Community Edition and Enterprise Edition.
3. Language and Compatibility Enhancements for Faster Migrations to Greenplum
In this release, Greenplum Database offers enhanced Microsoft SQL Server support, including native support of more than 20 Oracle functions, correlated sub-queries, non-recursive with clause, and fixed format loader. These enhancements streamline support of third-party tools that generate such queries and make migration from Oracle faster and simpler.
Greenplum Database 4.2 also adds support for XML, enabling high-performance parallel load of XML documents into the database and support for XML data type and the XML Path language (xpath).
4. Simpler, Scalable Backup with Data Domain Boost
Greenplum Database now includes advanced integration with EMC Data Domain deduplication storage systems through Data Domain Boost for faster, more efficient backups. This integration distributes parts of the deduplication process to Greenplum Database servers, enabling them to send only unique data to the Data Domain system. This dramatically increases aggregate throughput, reduces the amount of data transferred over the network, and eliminates the need for NFS mount management.
Note: In Greenplum Database 4.2, Data Domain Boost is supported with Data Domain collection replication.
5. Targeted Performance Optimization
Greenplum Database 4.2 supports dynamic partition elimination and query memory optimization. Dynamic partition elimination disregards irrelevant partitions in a table, allowing for significant reduction in the amount of data scanned and resulting in faster query execution times. The query memory optimization feature intelligently frees and reallocates memory to different operator during query processing, allowing for better memory utilization, higher throughput, and higher concurrency.
Business Drivers
Enterprises in which the Greenplum Database can be confidently positioned often exhibit several explicit business drivers:
· The need to easily deploy and manage in-database analytics extensions
· The need for integration to Hadoop to process unstructured data (e.g., weblogs, Twitter feeds, AVR messages, etc.) for the use of advanced analytics
· The need for simple and scalable backup solution while leveraging existing EMC investments
· The need for enterprise-level high availability, storage, and disaster recovery using existing EMC investments
· Issues loading data into the data warehouse during the batch processing window
· Analysts waiting too long for queries to finish, or queries timing out in the current data warehousing/business intelligence environment
· Proliferating data marts and shadow databases because the current data warehousing environment lacks loading and query performance
· The need for analyst collaboration, which is driving consolidation of multiple data marts and shadow databases
· Performance and scale issues, which are forcing re-engineering of the existing data warehouse environment