Mayo Clinic, Rochester, MN, USA, 2019 to Present
Senior Analyst and Programmer on GCP/MCCP AIF (ML/AI)
· Selected to work on the team of MCCP (Mayo Clinic Cloud Platform) Data Environment Services of AI Factory Services(2019/9)
· Followed the instructions of Mayo Enterprise Architect and the AIFS Section head in charge to explore the capabilities of native GCP (Google Cloud Platform) in terms of ML/AI/data science – Data storage (Cloud Storage, Bigquery, Bigtable, Cloud Filestore, Cloud Spanner, Cloud SQL – MySQL/PostgreSQL, Cloud Memorystore, Firebase RTDB, Firestore and Datastore – Firestore in Datastore Mode), Cloud App/Compute/ Kubernetes Engine, Cloud AI-Platform (ML-Engine: Data labelling services, Model Training Services, Prediction Services – Online or batch predictions, on-demand Notebooks – Jupyter Python 2/3 or R Notebooks), and custom-trained ML/AI model transformation and deployment on GCP AI-Platform and its prediction/inference services (Note: Covering the processes of AI/ML/DS across discovery, translation and production). Built workable docker container images of Jupyter Python 3 and R Notebooks that are configured to work with GCP data storages (Cloud Storage, Bigquery, Bigtable, Firestore, Datastore or Firebase RTDB, etc), GCP AI-Platform, Mayo RDBMS and Big Data (e.g., DB2, SqlServer, PostgreSQL, HDFS, Hive, etc)
· Successfully completed 12 Coursera (Google) Training for Data Scientists in 1.5 months, and received certificates with each course grade achieved: 100%. This includes two consecutive DS Specialties: (i) Machine Learning with TensorFlow on Google Cloud Platform (Note: Aligned with Mayo Clinic ML/DL/DS Discovery phase); and (ii) Advanced Machine Learning with TensorFlow on Google Cloud Platform (Note: Aligned with Mayo Clinic ML/DL/DS Translation and Production phase(s))
· Participated Mayo DSP (data science program) training - Building Predictive Models in Health Care, and Translational of the Predictive Model to Production
· Developed lots of GCP-compatible python and R code on both Generic ML/AI + Deep Learning including Data analytics (data preparation, transformation and visualization etc), Model training, Hyper-parameter tuning, Model optimization, Model exporting and deployment, Batch or online inference or prediction using trained/saved model, and Container construction for ML/AI
· Dispatched to work with MCCP Core Build (IaC) team (for speeding its Phase 1 go-live) collaborating with Google GCP PSO engineers to develop and test Terraform modules for dynamically provisioning on-demand GCP resources (GCS, BQ, CPU/GPU/TPU, GCE, GKE, Data Flow, Pub/Sub, AI VMs and Notebooks, Containers, …) to Mayo data scientists /researchers/physicians to do ML/DL on MCCP
· Developed the automated testing for MCCP AIF services – DLVM automated SQA testing, AutoML-Tables Automated Functional testing, AutoML-Vision Automated Functional testing, Large-Container Building, Scanning and Model Training Automated Functional testing for Mayo AIF platform provisioning and customer/production supports. Although the majority of the code were written in Python, some were written in R as per the testing (especially SQA testing) design and requirement in addition to the base languages of Bash and Mark-down. CNN (convolutional neural network) Model Training using GPU and TPU in addition to the popular CPU, with matching Tensorflow version, was included in our DLVM SQA automated testing, which should make Mayo AIF (DES) look great to the customers if all the DLVMs we provisioned using TFE all passed the testing
· Helped AIF team to identify and solve small or big issues in the development of AIF platforms – DLVMs and their Notebooks in addition to make the right decision on adoption of the appropriate technologies (e.g., separate single-region GCS bucket for AutoML while multiple-region GCS bucket for other AIF services; GCA instead of Twistlock for container vulnerability scanning for Phase 1 release), customer-oriented KBs (knowledge base articles) improvement, production/customer support policies shaping, etc
Mayo Clinic, Rochester, MN, USA, 2014 to 2019
Senior Analyst and Programmer on Big Data + ML/AI
· Promoted to be a senior analyst and programmer to design, implement, test and deploy architectures of a variety of Big Data solution projects based on solid expertise in lifecycle management from conception to completion. Major technologies adopted: Oracle, DB2, SQL Server, Sybase, PostgreSQL, MySQL, IBM MQ/ESB, IBM WebSphere Application Server/Apache Tomcat Server, Unix, Linux, Windows, ElasticSearch/Kibana, Storm, Flume, HDFS, HBase, WebHDFS, WebHBase, Hive, HCatalog (Templeton), WebHcat, MapReduce, YARN, Zookeeper, Sqoop, Kafka, Solr, Oozie, Phoenix, Spark/Livy (v1/v2), Zeppelin, Ambari, Kerberos, Ranger, Ranger KMS, Knox, F5 Balancer, Traefik, Active Directory, LDAP, Pig, Hue, Nagios, Ganglia, Ambari Metrics, Falcon, DataFlow (NiFi)/HDF/CDF, Talend, DataStage, SmartSense, Granfana, Atlas, Eclipse, Jupyter, R/R studio, CDSW, RapidSQL, Tableau, AppDynamics, IBM Streams (Studio)/DSI/DSR, SVN, GitHub, Docker, Kubernetes, and TFS/ServiceNow/TAYS/HwxSP ticketing
· Daily or frequently used program languages for completing routine tasks: Java, Unix/Linux shell scripting, Python, Scala (Spark), SPL, R, and Marked-up (Hugo/Zeppelin-%md)
· Contributed both code and documentation as a contributor to Apache Knox and HBase projects etc - the critical components of the Big Data ecosystem
· Developed and deployed Ambari Alerts (written in python and Jason) including vital Big Data functional testing alerts on HDFS/WebHDFS, HBase/WebHBase, Hive, and ElasticSearch via Knox Gateway service, Node Usage alert , HDF (NiFi) cluster node disconnecting alert etc, which are running on all Mayo Clinic HDP (CDP) and HDF (CDF) clusters including Normal and DR (Disaster Recovery) clusters
· Developed and deployed critical Ambari mpacks for easily managing Hadoop cluster ecosystem components or their functionalities on. May Clinic Big Data clusters - e.g., WebHBase mpack, BDTS Dashboard mpack, Kafka Topics Inter-cluster Mirroring mpack, etc, which are running as services of Ambari on Mayo Clinic Normal (3) and DR (Disaster Recovery, 2) Hadoop clusters
· Developed Best Practice example codes and documentation on WebHDFS, WebHBase, Hive and Zeppelin-Livy (Note: Livy is the rest API of Spark; v1/v2) on HDFS, Hive, HBase, RDBMS (PostgreSQL, MS SqlServer, DB2, etc.) , ElasticSearch and Kafka, which are good for onboarding and training new Big Data users/ clients
· Developed Machine Learning (ML)/Deep Learning (DL) example code of 10 use-cases that covered all types and all major categories of ML/DL model algorithms for the following 5 language + major libraries: Spark (Scala), PySpark (Python v2/v3), Pure Python (Python v2/v3; ScikitLearn, Tensorflow, Keras, Matplotlib, Pandas, Numpy, etc.), SparklyR (R), and Pure R (R; tensorflow, keras, tidyr, ggplot2, neuralnet, etc.). Explored ML capabilities of Zeppelin, CDSW, ElasticSearch, R/RStudio, etc.
· Architected and implemented the critical Big Data solutions to the issue of large number of small HDFS files generated by > 60 MayoTopology instances continuously running on 3 Mayo Clinic (MC) Hadoop clusters to reduce namenode heap stress and maintain the cluster health until the retirement of Storm topologies in 2018, and to the issue of data corruption caused by namenode or journal node failures on 5 MC Hadoop clusters to ensure data integrity on MC Big Data platforms
· Initiated and programmed both unsecure and secured versions of Automated Enterprise-Readiness Hadoop Certification Testing (A-ERHCT) Suite (in Java). Reduced ~70% cost and ~90% time of 6 MC Hadoop clusters new installation, expansion, conversion, upgrading, OS hardening, enterprise-security enabling, SSL & upgrading, and configuration changes by running the A-ERHCT test suite to identify and quickly resolve the critical issues (>10 P1 and > 50 P2 and P3 TAYS issues) and maintain the full functionality of all > 25 usable components of Hadoop Stacks for the first 4 years of Mayo Big Data program
· Decreased > 90% the outage and unavailability time of all past 7 Hadoop clusters by identifying and resolving >1200 critical Hadoop ecosystem issues, standardizing Hadoop administration methods/protocols and ticketing, training Hadoop >7 administrators on root-cause identification and fixing skills or keytab & work accounts password updating to maintain Hadoop clusters healthy and high availability, and establishing Ranger policies of HDF, Hive, HBase, Storm, Kafka, Knox, Yarn and Solr for bdts, bdsupportedadmin and other specific AD groups or single users
· Improved Hadoop Stack components usability or expanded their enterprise use-case spectrum by learning and pioneer-developing reusable sample or example code (commands, scripts, and code written in Java, Python, Scala, XML or .Net) and SOPs to share with members of Big Data client teams
· Re-architected the installation and configuration of Storm, Kafka and Spark (v1 and v2) for improving the storage capacity, data processing performance, and data analytics or generic machine learning scope and capabilities (Spark + PySpark) on MC Hadoop clusters by Mayo Clinic users and NiFi applications
· Guaranteed the 100% success of MayoTopology (Storm) architecture development and deployment on MC Hadoop clusters by collaborating with other programmers across Mayo IT on its design, coding and testing, by programming and deploying MC ESB queue monitoring and draining software, and by creating and performing end-to-end Stress and Trickle testing (end-to-end) using the right test data generated post the comparative analysis of the difference between Amalga (SQL Server) database and EDT MCLSS (DB2) database
· Archived ~13 TB Amalga Historical HL7-message data by ETL and ELT from RDBMS into MC Hadoop clusters using sqoop, shell-scripting and Java programs
· Achieved the migration of ~20 TB of HDFS, HBase and Hive data from Version#1 production Hadoop cluster into Version #2 production Hadoop cluster where data cannot be migrated between the 2 clusters using cp, distcp, Falcon/Oozie, snapshots due to the difference of HDP version + Local KDC and enterprise KDC and that no cross-realm trusting is allowed - by MC security by implementing the correctly-designed algorithms
· Successfully performed HDP/HDF/ElasticSearch/IBM Streams platform production, on-call and customer (in good relationship) supports for up to 5 years.
· Collaborated with Teradata, Hortonworks and Cloudera engineers for successfully solving complex and difficult data or data ingestion platform issue, e.g., HDF-Kafka-SSL two-way authentication and authorization issue, Oozie-Sqoop-Hive issue, Zeppelin-Spark2-Livy2-HBase issue, and the on-going Oozie-Spark2 ClassNotFound issue; extensively and deeply collaborated with different employees of Mayo Clinic in IT/non-IT departments for solving their critical data access , data processing or analytics issues
Additional Experience & Skills
· Showcased the MC Big Data program by leading a team of 7 authors to publish and share the work to the world on the high-impact IEEE CS professional journal - IEEE TII, functioning as the first and the corresponding author; presented the charter #1 work and the enterprise security work of Mayo Clinic Big Data platforms respectively in Hadoop 2015 Summit San Jose and DataWorks 2017 Summit San Jose - as an organization committee selected speaker in both international conferences
· Reviewed manuscripts for professional computer science journals or conferences such as IEEE Transactions on Industrial Informatics, Recent Patents on Computer Science, 2018 International Workshop on Computer Science and Technology (2018IWCST)
· Improved leadership, management, communication, collaboration, presentations skills via continuously attending on-line training classes or learning on-line materials and put what have been learnt into practice in the real-time daily work
· Successfully led the CDSW Sandbox POC project – managing, coordinating, training, and helping different interested users at Mayo Clinic for the evaluation of CDSW as a secure enterprise ML/data science platform
· Continuously improved ML/AI/Data Science knowledge and technologies by actively attending (as a member) and presenting to the weekly meeting(s) of Mayo Clinic Machine Learning-Deep Learning Journal Club; self-motivated to attend the Machine Learning/AI/data science webinars hosted by Cloudera, Amazon, Google, and Elastic, etc. or to study critical on-line ML/AI/data science tutorials or related public code or datasets in GitHub
Mayo Clinic, Rochester, MN, USA, 2012 to 2014
Analyst and Programmer on DataStage EDW + Big Data Platform Construction
• Selected to be one of the nine founding member of Mayo Clinic first or pioneering Big Data project from company-wide IT professionals based on the criteria: outstanding performance history, excellent productivity record, strong programming capability, remarkable technical skills, and strong team-work skills
• Designed and constructed the first Hadoop cluster (BDDev) at Mayo Clinic by installing, configuring, adjusting and issue-fixing CentOS 6.5 inside Xen-hypervisor, Hortonworks Hadoop Stack HDP1.3.2 (15 components: HDFS, MapReduce, Hive, Pig, HBase, Sqoop, Flume, Hcatalog, WebHcat, Zookeeper, Ganglia, Nagios, Oozie, Hue, and Ambari), ElasticSearch 1.0.0, and Storm 0.9.0.1 for the development usage of the initial or Charter #1 Big Data project
• Initiated and programmed 3 software tools - Fully Automated DataStage Job (Application) Unit Testing program, S2T Guide Error Finding & Correction program, and DataStage SQL Query Plus program - by coding using Java, shell-scripting, Visual Basic, and SQL programming language to speed up the development, defects correction and QA testing of DataStage ETL jobs by reducing > 90% time
• Achieved the highest EDW developer productivity by developing (designing, implementing, and QC testing) > 30 new DataStage jobs (Staging, ADS and Mart) and fixing defects for >130 existing DataStage jobs (ADS and Mart jobs) in less than one year for SCA and P3 projects
• Authored the CS paper on DataStage testing - Development of Fully Automated DataStage Application Unit Testing Software Tool" accepted by the 11th International Conference on Fuzzy Systems and Knowledge Discovery
• Published the paper on DataStage performance tuning - "Comparative Analysis of DataStage Types and Parameters on the Performance of a DataStage Application (Job)" in the 9th International Conference on Natural Computation
Tanson Corp, Bloomington, MN, USA, 2011 to 2012
Bioinformatics System Design and Development Consultant (Contract)
(Contracted to Mayo Clinic BORA project)
• Boosted the bioinformatics software development efficiency by standardizing the software and service pack installation & upgrading, configuration, SVN setup, and database cataloging of DDQB studio, DB2 server, and Tomcat server for BORA_DDQB java project, and those of IBM RAD (v8), DB2 server, and WebSphere application servers for BORA Control Center projects
• Resolved the long-existing software issues by debugging and fixing >10 P1 and >10 P2 issues in the java program - BORA_DDQB, data warehouse model, and implementation algorithm
• Speeded up the release of BORA software system release (v1.0) by designing, programming and performing the version 1.0 and 1.1 QA testing plan and suites using HP Quick Test Professional for identifying and fixing the 13 failures or incidents
IBM, Rochester, MN, USA 2011 to 2011
Software Engineer
• Revitalized the software development collaboration between IBM and Mayo Clinic IT/Bioinformatics teams by contributing coding, debugging, testing, software analysis, database remodeling, and process standardization. Major technology used: Java/JavaEE/Servlet/JSP/SQL&PL/SQL/JavaScript/HTML plus IBM RAD, DB2, DDQB, ILOG, WebSphere, Tomcat, DB Solo, Cytoscape
• Strengthened the IBM-Mayo Clinic-collaborated software DDQB_BORA by developing a data transformation plugin - "Gene Expression Data Format Transformation" plugin - which transformed gene expression data based on Entrez Gene ID or Gene Symbol with the results saving to a CSV file or displaying on the web browser, which can be further processed by Cytoscape or other graph analysis software
• Expanded the IBM DDQB software use cases by creating a generic graphing plugin that incorporated free and/or commercial graph and analysis software programs such as Cytoscape, JFreeChart, DOJO JS, ManyEyes, COGNOS and ILOG for easily generating graphs from numeric data retrieved by DDQB from a RDBMS database or data warehouse