Big Data ETL
Our client is a major international pharmaceutical company conducting research and development related to a wide range of human medical disorders, including mental illness, neurological disorders, anaesthesia and analgesia, gastrointestinal disorders, fungal infection, allergies, and cancer.
The amount of data required for healthcare researchers to be able to do their work increases daily. Our customer’s databases contain terabytes of data.
This makes it a challenge to perform operations on data — loading new datasets, transforming data, and querying resulting databases — in manageable time.
It is also critical to make sure that the data is of adequate quality, as the results of the medical research depend on it.
Another challenge is to lower the costs of ETL process by automating as much work as possible.
The team performed database and application design, development, quality assurance; data extractions, loads, and transformations; administering of databases and query optimization; users support and issue tracking.
In order to be able to operate on huge amounts of data, the latest advanced SQL Server Enterprise edition features were used, such as page-level data compression, horizontal range partitioning, and partition switching.
Some databases are migrated to the massively parallel database system — Analytic Platform System, which improved query performance significantly.
Data quality was enforced by Validation Analysis System — a utility that collects statistics and finds anomalies within loaded data, and a website displaying results.
The majority of ETL operations were automated by Integration Services packages and CDM Builder application.
The CDM Builder — multithreaded & distributed application for transforming a number of popular databases to the Common Data Model. The purpose of the Common Data Model (CDM) is to standardize the format and content of the observational data.
Data load time was significantly improved — in some cases from days and weeks down to hours.
The quality of data was improved with the help of Validation Analysis System.
Data integration process was made mostly automated, requiring little to no human attention.