As our app implements different logic for generating data for markets, sectors and extreme-performing stocks, we measured the speed-up for these 3 cases by dividing the sequential execution time by the parallel one (1 or 2 executors). The results give us interesting insights about the overheads in our program and possible optimization techniques. Let us explain those in the form of "Why?" and "How?" questions.
Why does the execution take so long, especially for sectors and extremes?
Unlike for markets, the analysis of these aspects requires filtering all stocks individually in certain steps, which takes longer than just merging all of them by a column in the beginning.
Apart from that, our team allocated a cluster with just 2 cores for each executor, which is surely slower than the clusters with 4 executor cores, etc. This was necessary to fit into the scarce amount of financing we had received for this project. But at the same time, it was just enough to complete our study within the set deadlines.
Why is there nearly no speed-up for markets and sectors?
Writing CSV files to Google Cloud Storage creates a very large overhead. In the sectors analysis, the program writes tables with over 200 rows to the bucket 3 times. When analyzing all markets at once, the program writes them 5 times (for all 4 markets individually and merged). A quick look at our program's logs shows that, in the latter case, writing to GCS takes a lot more time than the analysis itself. However, the decision to write files for individual stocks was made on purpose - to avoid running the program with different arguments 5 times to process the same data in the end.
What optimizations could be done in the future, if the requirements for the execution speed and the speed-up change?
Firstly, allocate a more efficient (and expensive) cluster with at least 4 cores per executor. Then, increase the number of executors to achieve the desired speed. Secondly, decrease the number of writes to Google Cloud Storage. Only write the last step of the current configuration, even though it might lead to losses in usability.
The graph displaying CPU usage over time shows a fluctuating pattern with peaks and troughs, indicating that the workload has alternating periods of high and low CPU demand. This typically suggests that there are computationally intensive operations followed by less intensive ones or perhaps idle time when the process might be waiting for input/output operations.
In the memory usage graph, there is a steady increase with occasional spikes. This increase implies that as the process continues, it's accumulating data in memory, which might not be getting fully released. The spikes could be due to temporary activities that require more memory, such as reading large files. The graph doesn't show a decrease in memory usage, which might be normal for a process that accumulates data over time without releasing it or could potentially indicate a memory leak if the pattern continues and memory is not freed eventually.
The graphs show the CPU and memory usage over time during the execution of a process or task. The CPU usage graph is highly variable, with frequent spikes to high usage levels, indicating periods of intensive processing. These could correspond to computationally heavy tasks being executed. Between the spikes, the CPU usage drops, which could indicate waiting times or less intensive processing periods.
The memory usage graph starts with a sharp increase, indicating a significant allocation of memory resources early in the process. Following this, the memory usage levels off with some minor fluctuations but generally maintains an upward trend. This suggests that the process is accumulating memory over time, possibly due to data being loaded into memory and not fully released.
The absence of a decreasing trend in memory usage might imply that the application does not release memory back to the system, which could be typical for certain workloads or indicative of a memory leak if the memory is not expected to be retained. The fluctuations in memory usage might represent garbage collection events or other memory management activities where memory is temporarily freed and then reallocated.
The graph shows CPU and memory usage over time, each plotted as a separate line graph. The CPU graph is characterized by frequent spikes in usage, suggesting the process involves intermittent bursts of activity. This could be typical for applications that process data in batches or have varying computational loads. The relatively low periods indicate less activity, which could be due to waiting for I/O operations or simply less demanding tasks.
The memory graph shows a staircase pattern with plateaus, which suggests that the process has phases where memory consumption is stable, interspersed with steps where it increases significantly. These steps may correspond to the process allocating more memory for new tasks or loading additional data into memory. Once allocated, the memory does not appear to be released, as indicated by the lack of downward trends.
Over time, the memory does not decrease, implying that the process may retain memory once it's allocated or that there is a gradual accumulation of data that's not being released. The overall upward trend in memory suggests that the process's memory footprint grows as it continues to run. If this pattern were to continue indefinitely, especially in a long-running process, it could potentially lead to memory exhaustion.
The final plateau in memory usage indicates a period of stability where the process's memory demand remains constant. This might be a state where the process is performing steady work without additional memory requirements, or it could be in a waiting state with no further memory allocations.
The trio of graphs presents a consistent story of a process that places intermittent yet significant demands on the CPU while gradually increasing its memory footprint over time. The CPU usage suggests a process that alternates between intensive computation and periods of less activity. The spikes could align with specific tasks that require heavy processing, such as data analysis or complex calculations. In contrast, the valleys might correspond to times when the process is either idle or performing less demanding tasks.
Memory usage across the snapshots shows a rising trend with different patterns. One exhibits a sharp initial increase, another a stepwise ascent, and the last a steady climb. This upward trend in memory usage suggests data is being accumulated and held in memory throughout the process's execution. The absence of any substantial downward adjustments implies that the memory, once allocated, is not being released. This could be by design if the process requires access to growing datasets, or it might signal inefficiency in memory handling, potentially leading to a memory leak.
Overall, while the system appears capable of handling the process's demands without immediate issues, the persistent increase in memory consumption merits attention to ensure the sustainability of the process in the long term. Ensuring that memory and CPU usage align with the expected behavior of the application is key. If the memory trend continues unchecked, it could eventually lead to issues with resource depletion, particularly in scenarios where the process is expected to run continuously. Therefore, it might be beneficial to delve deeper into the process's behavior to optimize resource usage, smooth out CPU spikes for better predictability, and address the continuous growth in memory usage.