Big Data Expertise

§ Serving as a Reviewer on Big Data for the high-impact factor CS Journal - IEEE Transactions on Industrial Informatics (since 2016), other CS journals such as Recent Patents on Computer Science, and international CS conferences such as 2018 International Workshop on Computer Science and Technology (2018IWCST)

§ Speaker of 2017 DataWorks Summit/Hadoop Summit (June 13-15 2017) San Jose, selected by 2017 Hadoop/DataWorks Summit Conference Committee. Presentation title “Securing Enterprise Healthcare Big Data by the Combination of Knox/F5, Ranger, TFA and Kerberos Coupled with Enterprise Active Directory and LDAP ” , well received and praised (View SLlIDES)

§ Speaker of 2015 Hadoop Summit San Jose, selected by 2015 Hadoop Summit Conference Committee. Presentation title “Hadoop Platform Coupled with ElasticSearch, DataStage and Relational Database/Web Applications Processes Daily Healthcare Data for Clinic Use at Mayo Clinic” – shortened title “Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic”, well received and praised (View SLIDES)

§ Invited Hadoop Speaker for Interview Video Recording (June 10, 2015, ~30-40 min) with the purpose of publishing on YouTube (watch the YouTube Video By Hortonworks) and on Hortonworks Customer Page (Video Carousel) about Mayo Clinic

§ The First and Corresponding Author for a full Computer Science (CS) paper on Big Data titled "Real-Time or Near Real-Time Persisting Daily Healthcare Data into HDFS and ElasticSearch Index Inside a Big Data Platform", which is in the process of publication in a High-Impact-Factor IEEE journal (View the Final Press-Proof Version)

§ Hands-on Big Data/Hadoop expertise (DevOps & agile environment): Hadoop administration (Hadoop clusters new installation/expansion/upgrading/issue identifying & fixing, Hadoop clusters administration & customer support, and production on-call), Big Data development (design, program coding and testing), Hadoop testing (unsecure and enterprise-secured Automated enterprise-readiness Hadoop certification testing suite (A-ERHCT)), Big Data-involved solution architecture design and implementation (MC BD Charter 1 Solution Architecture, unsecure and enterprise-secured HDFS-ConcV Solution Architectures, secure Data Migration/Backup Solution Architecture [Data in HDFS, Hive and HBase(~20 TB) from BDProd1 (local KDC and local realm) to BDProd2/BDSdbx (AD KDC and realm), unsecure ESB Queue Monitoring and Draining Solution Architecture) at Mayo Clinic. Major achievements:

Successfully and independently designed and implemented Mayo Clinic (MC) Charter-1 Big Data Development Cluster (BDDev) using CentOS 6.5 from Scratch and Hortonworks Hadoop Stack HDP1.3.2 (15 components) for MC Big Data Core Team Hadoop Development. Successfully resolved issues bubbling out under various situations
A major contributor in designing, coding and testing Mayo Clinic production Storm topology – MayoTopology and HL7 parsers for real-time or near-real-time processing MC Clinic Daily Healthcare data (56 HL7 document types) from MC ESB (enterprise service bus, implemented by IBM MQ) queues generated by individual EMR (3EMR instances) source systems – Indexing HL7-V2-derived-JSON-documents data onto the index of an ElasticSearch cluster deployed on the edge nodes of MC Production Hadoop Cluster for enterprise clinical and non-clinical usage; storing original HL7 messages & their metadata in the HDFS using flume + /doctype/year/month/day/ folder structure, and into a HBase table of the Hadoop cluster for enterprise clinical and non-clinical usage; forwarding HL7 messages to NLP (natural language processing) input queues on MC ESB for downstream NLP pipeline annotations deployed on MC DataStage production server with the annotated data to be directly consumed by Amalga/EASE RDB/Web applications in support of MC colorectal surgery
- Successfully and independently designed and implemented an Automated Enterprise-Readiness Hadoop Certification Testing (A-ERHCT) Suite (View One-Time Testing Report on MC BDProd2) - Three versions: (Un-kerberized version, Locally-Kerberized Version and Enterprise-Secured version) (Enterprise-Secured: A Hadoop cluster is protected by Enterprise Kerberos, AD (Active Directory), LDAP, Knox, Ranger and OS-Hardening). The Un-Kerberized version has been used in Cloudera Sandbox, Hortonworks Sandbox, and 4 MC Big Data Hadoop Clusters (BDDev, BDInt, BDProd and BDSdbx) (2014-2016). The Locally-Kerberized Version has been used in 3 MC Big Data Hadoop Clusters (BDDev, BDInt, and BDProd) (2015-2016) while Enterprise-Secured version has been successfully used in the recently upgraded and security architecture-remodeled Big Data Hadoop Clusters (BDDev, BDSdbx, BDInt,2/BDTest2, and BDProd2) Hadoop Cluster (HDP/TDH2.3.4 version) (2016):

The suite covers all possible enterprise-level use cases and test scenarios for the actual functions of Hadoop ecosystem components – HDFS, MapReduce, Yarn, Sqoop, Hive, Pig, HBase / HCatalog, Flume, Storm, Kafka, Solr, Oozie, Phoenix, Spark… (+ ElasticSearch) ... WebHDFS, WebHcat, Knox (Secured WebHDFS, WebHCat, WebHBase, Oozie...), on any entry node (all edge nodes and secondary master node + knox server nodes that a user or client may use) on each Hadoop cluster. For example, the 6 test scenarios for HBase certification testing will all test the functions to manipulate (disable, drop, create, load and query) a HBase table with the difference in loading a HBase table between the test scenarios – using the approach of HBase Shell, Sqoop-Importing, Hive-HBase-Integration, Hive-Generating-HFiles, Pig-Through-Hcatalog, or HBase Java Client.

The suite has been proved to be very useful for identifying (diagnosing) and fixing Hadoop component problems caused by Hadoop cluster new installation, expansion, upgrading, or setting changes:

It is under the evolution to accommodate new Big Data processing components that will be incorporated into a Hadoop cluster. It has been successfully and extensively used by MC in the past for MC Big Data Appliance Hadoop cluster new installation, expansion and upgrading (TDH1.32. to TDH2.1.2 to TDH2.1.11...HDP/TDH2.3.4…HDP/TDH2.3.4) – Successfully found and fixed > 10 P1 (priority #1)-issues and > 50 P2 (priority #2)-issues

Successfully and independently designed and implemented the architecture to measure the daily HL7-persisting capacity of MC Big Data Appliance (both V1.0 and V2.0) (Note: Integration [BDInt] or Production [BDProd] Hadoop cluster coupled with ElasticSearch for enterprise clinical and non-clinical usage), and the daily HL7-processing capacity of MC Big Data Appliance (both V1.0 and V2.0) coupled with DataStage (NLP Annotators) - Amalga/EASE (RDB/Web) in support of colorectal surgery. The results were used for making the decision as to scale-up and/or scale-out MC Hadoop clusters within MC Big Data Appliance
Successfully and independently designed and implemented the first version of MC Automated Queue Monitoring and Draining Program currently running on MC integration/test (BDInt) and production (BDProd) Hadoop clusters for monitoring and conditional draining and saving HL7 messages (Note: Save into a RDB as the 1st layer and into Linux local file(s) as the 2nd layer) from MC ESB queues – It has been proved to be very useful whenever MayoTopology instances are down or a Hadoop cluster is experiencing hardware and/or software change such as expansion and upgrading (Note: This program has replaced my previously developed DrainerTopology)
Successfully and independently developed Historical HL7 Back-Loading Program, which back-loaded (ingested) and verified all Mayo Clinic Historical (prior to 12/31/2014) HL7 messages from an RDBMS (Amalga MsSql database tables) into the HDFS of the BDProd and BDInt Hadoop clusters within MC Big Data Appliance – >13 TB data (~1.83 billion rows of Amalga HL7 and Non-HL7 message data) with a replication factor of 3 on each cluster. The data ingestion used 2 approaches – ETL- & dump-approaches, and both adopted Sqoop technology
Successfully and independently designed and implemented MC Enterprise-Level Production Persisted Healthcare Data Validation Program for validating all the data persisted in the HDFS, ElasticSearch index, and HBase tables on BDProd Hadoop cluster and its backup BDInt Hadoop cluster, and for generating relevant meta-data on the persisted production data for batch analytics usage.
Successfully and independently designed and implemented MC Stress Testing and Trickle Testing Programs for integrating and system testing MayoTopology prior to its production running. Stress Testing will consecutively run 5-7 times while Trickle Testing will consecutively run 1-2 weeks. They were successfully and extensively used by MC in the past for the development and evolution of MayoTopology
A critical or major contributor in designing and implementing the architectures (solution and data-flow architectures) that involve Hadoop technology and handling Big Data challenges for the following enterprise business needs:

(a) enterprise level clinical and non-clinical usage of real-time and partial or all historical HL7 messages generated by daily MC healthcare (a lambda architecture, completed);

(b) daily processing or annotating relevant HL7 messages in support of colorectal surgery (completed);

and

(c) enterprise-level transforming real-time and/or historical HL7 messages into FHIR (fast healthcare interoperability resources) objects (on-going)

Independently and successfully designed and implemented the solution architecture for migrating and backing-up MC production Hadoop cluster's HDFS, Hive and HBase data from BDProd1 (local KDC) to BDProd2 (enterprise-secured) while the two clusters failed to talk to each other and not in the same realm due to MC security policy:

Successfully analyzed and identified the difference between MC historical HL7 messages in Amalga Database (MsSQL DB) and MC EDW non-HL7 EMR (normalized structured data) in EDT MCLSS Database (BD2 DB). This enabled MC Big Data Core Team to make the right decision as to use data of historical HL7 messages in Amalga DB for charter 1 development etc
Successfully and independently or in collaboration with Teradata/Hortonworks engineers or analysts fixed all the P1 and P2 issues occurred during MC new Hadoop cluster installation, existing Big Data cluster expansion and/or upgrading from TDH1.3.2 to TDH2.1
Successfully generated and shared Hadoop ecosystem components (Storm, Kafka, Ranger, Hive ODBC/beeline/JDBC, Sqoop, HBase/WebHBase, HDFS/WebHDFS, Knox…) example/sample commands, scripts, java program codes, and SOPs for BDTS, DAAS/FHIR, NLP, Registry and Plumber teams at MC. Successfully trained MC Big Data Developers on enterprise-secured Storm and Kafka for all MC Hadoop clusters
Successfully configured on all enterprise-secured MC Hadoop clusters: Storm and Kafka for non-root Users; HDFS, Oozie, Falcon, Storm, Kafka, Ranger KMS and Atlas auth_to_local rules; Knox topologies name node HA for Oozie; proxy user for HTTP/HTTPs for Hue, Falcon, Oozie, WebHBase and other Hadoop ecosystem components; Ranger policies of HDF, Hive, HBase, Storm, Kafka, Knox, Yarn and Solr for bdts, bdsupportedadmin and other specific AD groups or single users
Successfully disabled less secured Hive Shell & Hive JDBC while enforced Hive beeline and Knox-Gateway Hive JDBC on all MC Hadoop clusters
Successfully installed latest JDBC drivers of Sql Server, DB2, Oracle, MySQL, PostgreSql and Sybase for Sqoop on all MC Hadoop Clusters
Successfully conducted > 2 years of Big Data production on-call support, Hadoop clusters health support and customer support (for all 4 Hadoop clusters – BDDev, BDInt1/BDInt2, BDProd1/BDProd2 and BDSbx), and successfully identified and fixed the root-causes of hundreds of on-call or customer issues on MC Hadoop clusters (P1, P2, P2 or S2-S4 TAYS tickets of Teradata & P1-P4 TFS tickets of MC) independently or with the collaboration with Teradata/Hortonworks engineers/on-call support personnel
Successfully designed and implemented batch HDFS concatenation and validation program for reducing MayoTopology HdfsFlumeBolt generated historical small HDFS files (large number) in various data formats on MC BDProd Hadoop cluster. The validation uses a RDBMS database for fast large amount of analyzed metadata storage and retrieval. For example, successful HL7-derived-JSONV1 format batch concatenation had obtained the following validation report:

Independently and successfully designed and implemented the solution architecture to concatenate and validate MC production storm topology instances-generated HDFS data in large amount of small HDFS files which are detrimental for a Hadoop cluster especially a production Hadoop cluster. Two versions of implemented program (ConcV) tool - Unsecured version (Note: It has concatenated and validated ~40-45 million small HDFS files) and Enterprise-Secured Version (It's under the way of fully testing on Mayo Clinic BDSdbx and BDInt2 / BDTest2 Hadoop Clusters, and will soon migrate to BDProd2 production Hadoop cluster). Each version contains a daily (deltaDay=3) concatenation and in-memory validation program and a Saturday concatenation and in-memory validation program - the latter will handle all late-arrival data landing on the days folder before the day specified by the deltaDay.

The unsecured versions of programs was scheduled to run by Linux cron jobs (Note: Oozie failed to work on the old version of HDP/TDH) but the approach lacks monitoring and logs. The enterprise-secured secured versions of programs are scheduled to run by Oozie coordination jobs using Oozie Java Workflow action. The secured version testing on MC BDSdbx Hadoop clusters was very successful while on BDTest2 is on-going. In addition, the unsecured version's integration/system testing got the following report:

HDFS concatenation is required by current MayoTopology implementation for the BDProd Hadoop cluster to function properly at MC. Without concatenation, MC BDProd Hadoop cluster that runs the production storm topology, is getting more and more stressed, and to some point the whole cluster may fail to work (Note: Hadoop system was originally not designed to work for large number of small HDFS files due to the possible failure of the name node that manages the cluster's HDFS name spaces...). The Yesterday Concatenation & Validation program is now being turned into an Any Day Concatenation & Validation Program with more flexibility on deployment due to its taking 9 input arguments to update the program parameter values in the program properties file - the program at the run-time uses the parameter values in the program properties file. The data validation is an in-memory process of special data analysis, including analyzing all qualified source HDFS files in each day folder (…/doc-type/year/month/day/*) and their corresponding destination or concatenated single HDFS file (… /doc-type/year/month/day/part-m-00000) to get the lists of uuid and uuid-size from both source files and the destination concatenated file for each day per doc-type, and further comparing the source and destination lists to determine if the concatenation for a day of a doc-type is a success or a failure – with a success required to simultaneously meet the following 4 criteria (Note: If one or more of the above 4 criterion items is not satisfied, then the concatenation for a day of a doc-type will be considered as a failure):

(1) The total number of JSON documents (JsonItem#) in the sources files = = The total number of JSON documents (JsonItem#) in the concatenated destination file ? a success;

(2) The corrupted number of payload (HDFS Corrupted Payload#) in the concatenated destination file = = 0 ? a success;

(3) The data loss number of payload (HDFS Data Loss Payload#) in the concatenated destination file = = 0 ? a success;

(4) The data increase number of payload (HDFS Data Increase Payload#) in the concatenated destination file = = 0 ? a success

In addition, data validation also provides the additional information on HDFS file numbers and duplicates present in MayoTopology-generated HDFS source and destination concatenation files – duplicates from MayoTopology or flume warning/errors (UUID Duplicates JsonItem#). Moreover, the program also totalizes (summarizes) all the above information for all doc-types for each target-day concatenation and validation.

The Java program has the following architecture and work flow:

§ Hands-on DataStage skills and expertise on an agile environment:

Successfully developed (designed, implemented, and QC tested) > 30 new DataStage jobs (Staging, ADS and Mart) and fixed defects for >130 existing DataStage jobs (ADS and Mart jobs) in less than one year
A conference paper titled “Development of Fully Automated DataStage Application Unit Testing Software Tool” was published in the 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2014)
A conference paper titled “Comparative Analysis of DataStage Types and Parameters on the Performance of a DataStage Application (Job)” was published in The 9th International Conference on Natural Computation, 2013 (ICNC 2013)
Independently initiated and developed (designed, implemented, and tested) 3 software tools for Mayo Clinic DataStage Job (application) development and testing:
- Fully Automated DataStage Job (Application) Unit Testing Tool: ~37,000 lines of Java code + hundreds of lines of unix shell-scripting code + hundreds of lines of Visual Basic code & user interfaces. Prior to this tool, no software tools on the market were available for EDW DataStage job unit testing
- S2T Guide Error Finding & Correction Tool: For data analysts to write correct S2T specification for each EDW target table by finding and correcting any errors in a complete or incomplete S2T guide
- DataStage SQL Query Plus Tool: For ETL job developers to quickly and accurately develop ETL applications using the SQL queries, lookup /Join information and possible sparse lookup orchestrate columns generated by running the software that uses an existent target table S2T guide. It has the potential to significantly decrease EDW ETL job development time while increase the data accuracy of EDW work in Mayo Clinic

Page updated

Google Sites

Report abuse