Kuka Robotics - Early Fault Finding + PLC - Journey
FFE Journey:
Download and dump of project information while still finalizing and testing and finalizing LLRP Protocol, which had to go live into production line and suppliers within a month, with around 20 – 25 units.
Continued to testing and getting LLRP, for about 1 month into this project before, start full-time focus on FFE, mainly access and dumps information.
Initial takes finished the program separation and migration to the EDC, as that had yet to be completed as this was bundle into existing app on the same server partially.
Start off Inheriting 5 code bases and 3 servers, 1 old and a partially ported new Prod and INT in the EDC running of shared database, which was not coping.
Lets just say EDC was 2 process machine with Antivirus Gap eating the 50% of the 1 core all the time, not a lot of capacity to do much, but try keep this project afloat as it was already falling over and everything was timing out.Validate all the code provide had no missing parts, there was some fun with the C# projects, for little while, but we had it all eventually.
Finding the Truth of the code, 2 weeks go through all git and all code on different servers and compare all the differences, formulate notes, compare then then make some calls, to find a base and then re-stich a new truth from git+INT(partial features not in source)+Prod, Old server, some schedule task from the old server and New, from which deploy in future could take places. Had some fun figure out which scripts were running as old server had like 3/4 copies all over the place too, all connecting to the same db.
Finished the separation of the share product for existing application onto the EDC INT, Prod server and attempted to shut down the old dinosaur, lots of fun with services accounts and figureout all the hiding wiring and services used and dbs used, to confirm all migrate and old could be shutdown. As was just not 1 services account and not just 1 db in use as it was partially separated to the EDC and not compete.
Re-write the whole morning pre-production robot scanning, because it doesn't work nor does it complete, for various reason, too many robots now and too many bugs. Used power shell and modified PHP, to run 16 -32 Threads per plant, so 32-64 threads on a machine, to complete process 11000 robots in 15-20 minutes.
The task would check if robots were online and in the ready-to-go state, before production line commenced ignition, and also check the backup status of robot programs, that all were upto date in the archives. Many bugs of months to fix here and just wrong code.Fix up all the remaining broken code in places for reference stamdatum files, and everything else I could spot.
All reports seem to start timing out and MQ was falling over, 3 months in already.
Setup the whole Azure devops and github projects and separated code into 4 repos and build script for deploying to multiple machines and multiple PHP.
Using remoting protocols and jump through hoops to get go onto the machine, via in direct security at a distance, abstraction builds machines.
Figure out what versions of SAP dll were being used and look to compile from scratch new dlls for PHP7,8. Took some time to identify the used version of SAP dll and locate missing dependencies, preventing getting the project from running on other servers. Depends was not working. Get the Deve machine working with SAP. Seem to come right, with all the locked down security on old windows server machine, in which finally got dll error and identify missing dll for the product. I did find attempt dump few 100 dlls into a generic folder, which possible previous attempt to solve this issues by others before me.
We could finally start to look at migrating to the Next Generation data centre from the EDC, however, new Challenges were presented, Mercedes-Benz, had decided to outsource there IT to InfoSys, which refused to build or take anything over in the Next Generation Datacentre setup and build my Mercedes, Instead then would rebuild and migrate all there applications to a Future Mode of Operation, reverse new server in the same Datacentre.
This present new technical challenges, because they would not build out and had yet to build our all the MQ infrastructure to support the FFE migration out of the EDC Datacentre, which would be bulldozed at the end of the year, which postponed to a year later.
I then with my German colleague Ralf(German/Germany), overseeing the project migration from Germany and 2 additional support colleagues, Rodrigues(UK) and Sandeep(India-Germany) to assist me in my challenged look to drive our migration.
We start off with get a plan and how work with and compromise with new FMO and NGDC and IBM infrastructure, I created and had list of tickets with multiple colleagues round robbin them and following them up, until few months later we had our queues, by our issues with this migration had just started.
During this time I had setup all the servers and got all other code and firewall clearance into all the plants and division setup and in place and tested and working for the final switch over and we on track looking to migrated 10 months in, June/Jully.
The time line here for period to end of the year fail migration gets blury in my memory for about 4 months as there so many things to case up and get built our from various departments for infrastructure, Germany/India/MQ Team, InfoSys, just took time. At any 1 time, there like 10-15 3rd part tickets in play, apart from the development work I was doing and fixing and re-working, with focus sort the product out for migration and any bugs that came in and feature improves we need to assist us. Hinting down manages , finding people were division no longer had personal in them, to process a ticket or firewall clearance, identify and find the new person that taken over or who could step-in and assist temporary, until position would be fill in the future.I continued to fix all sorts of clear issues with robot configurations, SAP pattern matching and setup in the current data and investigating issues while starting to order all the new databases.
Attempt to tackel the performance, re-work all reports with hierarchy structure, into parent-child, where the child would be a background fetch/ajax request and then inline into the page places holder, to reduce the load on the server and speed things up. prevent the reloading the running of the whole parent-containing report.
Lan Status page, additional fixies and many improvements around what is displayed to determine the state of the factory, which robots using MQ and which are running on legacy TCP, UDP.
Start reworking the MQ POC message processing client, use an algorithm to find the turning point of a Porabular to optimise the processing, stead fetch batch x and write to db. It should now process as fast as possible in batches and find the optimal size batch, with upper and lower limits on all settings, encase catastrophic failure of the algorithm, think PID control with PLCS.
Implemented MQ C# client hook to redirect and debug traffic for specific topic specified in the browser to the client browsers.
This is because production was missing traffic and I had to prove to the supporting team, missing in prod queue, but seen in INT queue. Took a good 4 weeks before that got fixed from there side. They could just double check there sheets, but obvious FFE is the problem. Well amazing the messed up the messages route for production traffic or turned it off, because it was falling over. Obviously its all protected and locked down, so browse super limited by the application, to prevent too many logs and crashing it, I thought of that already. In future if traffic goes missing we have simple way to debug the issues and provide to the MQ routing teams there are issues with there routing in 5 minutes.March/April - rearchitected the slow application, as all reports were basically timing out at 30seconds and re-wrote all reports v2 to run in parallel to the old ones. Speed up was enormous, milliseconds for reports now, instead 90% of them timing out. Staff could now use the program and get value from it once again. I continue to re-write and rip queries apart all over the place, because all did table scanning, cause issues with the database.
Fixing of all the robot backfile locations and once of check, as code need massive cleanups and performance improvements, doing to many things multiple times in efficiently.
Implemented support for Log Details algorithm knowing I would need data schema changes later, and how to so wouldn't have to redo it too much. In addition to this there was limited storage capacity and the data structure of the DB, were not great and weren’t able to just add indexes, to speed up reports and address problems of reports timing out or table scanning. I attempt in smaller DB to adder require index, to get idea of overhead and time index, but database size grew by nearly 50%, this wouldn’t be sustainable, as need like 10 different forms of compound index to address all issues.
The final database after migration was increase to 300Gig, just imagine it would have need to have been 5X that size with all the index required. 1.5TB.
Current we had little share database in the EDC where we consumed all the resource of the 15 projects on the DB using about 30- 40gig, already.
I there devised and build a custom indexing algorithm, not using any plugs or database functionality, which cloud then assist me in retrieving the incremental index information in the future for Log Details and many other things in a sustainable cost-effective way.Continue to optimise and re-write all other queries and remove all those causing table scans and locking on the system.
Rebuild from scratch Sap NetWeaver and require DDLS for PHp7/8 Compile from source to be able to keep SAP functionality, as no SAP API endpoints had been developed yet, would only be realised months later.
Implemented support for decoding further information from MQ and TCP, PLC, Robot Head, extra and other information relevant for debugging.
Implemented parallel data structure, that would be re-worked upon migration to V3 on migration to the Next Generation data centre and all data reprojected.Implemented all additional pages to support PLCs, Robot Heads, relationships, enhance the log Details pages and all other reporting pages.
The started implementing blacklist filters to reduce the amount of data ingested from MQ, to more specific messages we wanted, as some bad ones, not worth saving.
Re-work of all the SAP integration, because nothing actually worked, we started from scratch and implemented new fields to assist got all work perfectly for whole of Sindelfingen between Joost Kroen their and myself, work off the discrepancies we still found over the next few months.
Migrated to Next Generation Data Centre, after all server and schedule task were setup and database and firewall rules extra, many tickets and ticket chasing for MQ, as 3 months in company outsource IT to Infosys, which would build a new data centre, and I had delays as MQ infrastructure all had to be build, but have be built in the future datacentre they would setup MQ back hall capacity in NGDC, to complicate matters.
Migration failed, because of latency issues, went from 1ms between Round trip times to 30-70ms, for API calls on the IBM MQ on the latest hardware 10-13 years newer.
Got tun of logs in place, then upgrade disks to fiber disks which improved latency, then start re-work all of IBM MQ,
to get off the offending function call taking 70ms to return slowing things down. That was not enough, as minimum latency was like 15ms.
benchmarks systems and we continue to look for more issues before attempting another migration.
Benchmarked the Database to process crazy traffic from in memory and we discovered the database would hang for random periods up to 3 minutes with no writes every 1 or 2 during a 30 minute period.
We got personal from the DB and networking and VM teams over the course of months to figure this problem out. Network couldn’t confirm any problems on there side repeatedly.
Thank to Cornel from MS and his another colleague of his for all the assistance, were look to hunt down locks, and deal with many technical database configuration issues, to identify the source of the problem. WAL, extra tiger tools.
Eventually we restored to running and convincing IT to just setup the program on the DB server and run the benchmark, to eliminate network issues, as we could find nothing.
We then concluded, it had to be network as there were no issues after multiple attempts.
1 year later the data centre networking team came back to us and Ralf informed me and said other team had the same issue now and found the offending switch, in which had capacity problems. Other team ran into similar issues.We presented the situation to Ralf, Sandeep, Ralf, and had big factor half MQ infrastructure was between 2 regions due to the outsource, mid-flight of IT.
A call was them made by the head of Ralf overseeing all migrations after presenting the situation, that we move directly to the Future Data Centre, we figure should take a month, I then started again from scratch with ordering the server and getting the fire wall in place, we figured this could be done in like 2 weeks, as we had everything at our figured and ready already. It was Mid October, we could hopefully get setup before end of 16 Dember of that year before shutdown of IT.
Sadly this was not to be. New firewall process and security had been put in place, with many hoops and process which yet to be defined, that I then had to get them to pioneer into place, before the new Future data centre teams of Infosys would process any firewall rules, with sign offs.
Plus there few people in departments I had to hunt down that had resigned and left the company, which took some weeks to escalate and track down the responsible person from S.A, to get a new person to take on the approvals. Eventually some rhythm started to form and we getting some were, but by this time, was into December/January, were most everyone who could then approve, would be and took leave, urgently trying to squeeze things in before the 16 dec freeze date, so be approved for first thing in January. One big round robin effort. It became apparent, things only realistically materialize for potentially migration by the start of or middle of January.
Migrated to Next Generation Data Centre, after all server and schedule task were setup and database and firewall rules extra, many tickets and ticket chasing for MQ, as 3 months in company outsource IT to Infosys, which would build a new data centre, and I had delays as MQ infrastructure all had to be build, but have be built in the future datacentre they would setup MQ back hall capacity in NGDC, to complicate matters.Mine while all continued, I re-worked more of the product and continued to re-write the whole MQ and work with IBM, trying to do things, that said no thought of doing that before, to overcome the latency problem and be able to process enough messages. As I was assured latency would be improving in the next data centre. This took alot of testing and check different ways for the most optimal way to pull messages off from using IBM API as efficiently as possible.
I also attempted to do in such as way that I could support other teams and it could form the base for their project, which they also having issues with, think reusable library, just put their needed parts in a harness. I also start to design and code chainable 3 way transactional pipeline, where I build 7 several testing configuration to determine the best way to do that for IBM.
I made use of our currently setup NGDC-INT and NGDC-Prod box for both testing and development of MQ and TCP/UDP, while building out the FMO, because there server we already their and setup, we run into December/Jan already.
I deployed all MQ enhancement to our old EDC dc as well as NGDC, testing things live for latency issues. This allow us to up the process on the old service at 1ms to new levels of 6000-8000 messages and the database was the bottneck here.
Were are in the NGDC and FMO MQ, IBM side was the bottleneck.
During this time additional challenges presented themselves, I found myself having to sit were our internet pipe came in, because I had connectivity issues for remote session, which would prevent me from testing the MQ infrastructure or mid test, I could loose manual remote control.
Migrated to Next Generation Data Centre, after all server and schedule task were setup and database and firewall rules extra, many tickets and ticket chasing for MQ, as 3 months in company outsource IT to InfoSys, which would build a new data centre, and I had delays as MQ infrastructure all had to be build, but have be built in the future datacentre they would setup MQ back hall capacity in NGDC, to complicate matters.Eventually, all was setup and after have to re-work more code and things for the new EDC for further security enhancements, which just let us definitely push security fracture even higher, on the next data centre move for everyone. We migrated and I ran out of log disk space with the re-project of the database into V3.
Finally, after some urgent support from Infosys and VM disk teams, I could continue with the migration, which went off without a hitch this time.
After the migration all was good, I continued further upgrades to the system, to support V3 needed and additional fixies, to TCP/UDP protocols with enhance messaging, support module.
Continue to enhance the program adding in support for further blacklist filters dimensions and fields and enhancement of the filtering algorithm,
As there now was so much traffic, that Hit policy was not cutting it anymore. Continue to enhanced, more rules being added slower things were getting.
My supervisor from Germany then retried, in April, in which he wanted me to take over from him.
I continued to further fixed all remaining issues with FFE and enhancements.
Revamp all UDP and TCP protocals to new version support with module and fix all issues of validation, connection pooling and issue around robots not closing connections, resulting in port timingout, as they would look to establish new connections at time for batches. Backwards compatible protocol upgrades, as the state of the plant was not known and couldn't be assumed. so we detect a lot of things that are then made visible in lan status, to detect miss configured Robots.
Security hardening, from Infosys, some issues yet get to look at everything.
Rework priority
Re-work filters to support second filter
Implemented duration reporting for the length in time that events are raise for analysis. A lot more complicated than one would think to have done this, not just simple summing time.
Implementation of new working revamp notification system, that didn't just run off message text, use fe with the context of a module.
Chronologic Browsing, debugging and finding messages leading up to crashes.
Re-work whole application to avoid deadlock with PHP session files
Deploy to all other plants around the global, see how efficient we were and how traffic, we could process as the cost was paperwork for clearance for USA and to push and challenge KUKA IOT, and be the competitor
Additional Black, White Blue belt list of items after courses to address in the program.
Generic Reference and Stumdatum file formats, that any plant could download fill in upload and use the product for their plant.
The final database size was about 450gig in the new production system for message storage, I did recommend they double it and period of retention as approached to have the full clean year of data captured in the latest data centre. So that would be 2 years at 1TB of data.