What are the Common Causes of Performance Issues in Greenplum Database?
Post date: Sep 21, 2012 3:2:35 PM
The following issues are the most common cause of performance problems in Greenplum Database:
Hardware failures and invalid segments
Multiple workloads competing for system resources
Contention between concurrent transactions
Inaccurate database statistics
Unbalanced data distribution across the segments
Unoptimized database design
Hardware failures and invalid segments
The performance of Greenplum Database is dependant upon the hardware and IT infrastructure on which it is running. Greenplum Database is comprised of several servers (or hosts) acting together as one cohesive system. Greenplum Database’s performance will be as fast as the slowest host in the array. Problems with CPU utilization, memory management, I/O processing, or network load will affect performance. Common hardware-related issues are:
Disk Failure – Although a single disk failure should not dramatically effect database performance if you are using RAID, there is some impact caused by disk resynching consuming resources on the host with failed disks. The gpcheckperf utility can help identify segment hosts that have disk I/O issues.
Host Failure – When a host is offline the segments on that host are out of operation. This means that other hosts in the array are doing double duty as they are running both the primary segments and a number of mirrors. If mirrors are not enabled – service is interrupted. There is also a temporary interruption of service to recover failed segments. The gpstate utility can help identify failed segments.
Network Failure – Failure of a network interface card, a switch, or DNS server can bring down segments. If host names or IP addresses cannot be resolved within your Greenplum array, these manifest themselves as interconnect errors in Greenplum Database. The gpcheckperf utility can help identify segment hosts that have network issues.
Disk Capacity – Disk capacity on your segment hosts should never exceed 70 percent full. Greenplum needs some free space for runtime processing. You can reclaim disk space occupied by deleted rows by running VACUUM after loads or updates. The gp_toolkit administrative schema has a number of views for checking the size of distributed database objects.
Multiple workloads
A database system has a limited capacity of CPU, memory, and disk I/O resources. When multiple workloads compete for access to these resources, database performance suffers. Workload management can be used to maximize system throughput while still meeting varied business requirements. Greenplum Database workload management limits the number of active queries in the system at any given time in order to avoid exhausting system resources. This is accomplished by creating role-based resource queues. A resource queue has attributes that limit the size and/or total number of queries that can be executed by the users (or roles) in that queue. By assigning all of your database roles to the appropriate resource queue, administrators can control concurrent user queries and prevent the system from being overloaded.
As the Greenplum Database administrator, run maintenance workloads such as data loads and VACUUM ANALYZE operations after regular business hours. Do not compete with database users for system resources by performing administrative tasks at peak usage times.
Contention
Contention arises when two or more users or workloads try to use the system in a conflicting way. For example, if two transactions are trying to update the same table at once. A transaction seeking either a table-level or row-level lock will wait indefinitely for conflicting locks to be released. This means it is a bad idea for applications to hold transactions open for long periods of time (e.g., while waiting for user input).
Database statistics
Greenplum Database uses a cost-based query planner that relies on database statistics. Accurate statistics allow the query planner to better estimate the number of rows retrieved by a query in order to choose the most efficient query plan. Without database statistics, the query planner can not estimate how many records might be returned, and therefore cannot assume it has sufficient memory to perform certain operations such as aggregations. In this case, the planner always takes the safe route and does aggregations by reading/writing from disk, which is significantly slower than doing them in memory. The ANALYZE command collects statistics about the database needed by the query planner.
Identifying Statistics Problems in Query Plans
When looking at the query plan for a query using EXPLAIN or EXPLAIN ANALYZE, it helps to know your data in order to identify possible statistics problems. Check the plan for the following indicators of inaccurate statistics:
1. Are the planner’s estimates close to reality? Run an EXPLAIN ANALYZE and see if the number of rows estimated by the planner is close to the number of rows actually returned by the query operation.
2. Are selective predicates applied early in the plan? The most selective filters should be applied early in the plan so that less rows move up the plan tree.
3. Is the planner choosing the best join order? When you have a query that joins multiple tables, make sure that the planner is choosing the most selective join order. Joins that eliminate the largest number of rows should be done earlier in the plan so that less rows move up the plan tree.
Tuning Statistics Collection
The following configuration parameters control the amount of data sampled for statistics collection:
1. default_statistics_target
2. gp_analyze_relative_error
These parameters control statistics sampling at the system level. It is probably better to only sample increased statistics for the columns used most frequently in query predicates. You can adjust statistics for a particular column using the
ALTER TABLE...SET STATISTICS command. For example:
ALTER TABLE sales ALTER COLUMN region SET STATISTICS 50;
This is equivalent to increasing default_statistics_target for a particular column. Subsequent ANALYZE operations will then gather more statistics data for that column, and hopefully produce better query plans as a result.
Data distribution
When you create a table in Greenplum Database, it is important to declare a distribution key that allows for even data distribution across all segments in the system. Because the segments work on a query in parallel, Greenplum Database will always be as fast as the slowest segment. If the data is unbalanced, the segments that have more data will return their results slower.
Database design
Many performance issues can be improved by database design. Examine your database design and ask yourself the following:
Does the schema reflect the way the data is accessed?
Can larger tables be broken down into partitions?
Are you using the smallest data type possible to store column values?
Are columns used to join tables of the same datatype?
Are your indexes being used?