ETL – Apple XML file
I’ve been learning about how data moves around, and I started using Azure Data Factory, which is a cool tool for handling data. This project is aimed at demonstrating and consolidating my knowledge using ADF
So, I found this feature on my iPhone’s Health app. It lets me download raw data in XML format from my iPhone and Apple Watch. Curious about what’s in there, I decided to develop an ETL (Extract, Transform, Load) pipeline process within Azure Data Factory.
The data goes back to 2019. When I downloaded and unzipped it, it turned out to be a big file—777 MB. Inside, I found 18,843,340 rows of data.
The folder contains:
- XML complex file with all the data
- GPX data, which can be used to recreate the route taken during outdoor trips and can be used directly in the Google Earth Pro app
- CSV file containing an ECG reading
I have created resource group under my pay-as-you-go subscription containing
- Azure Data Factory
- Data Lake Storage gen 2
- Azure SQL Database
- Azure SQL Server
- Logic app
- Gmail API connection
- Logic App Log Analytics Workspace
- Metric alert Rule
- Action Group
I’ve set up a server in my VMware Workstation Pro, allowing me to work locally in SQL Server Management Studio (SSMS) on my on-premises virtual machine (VM). The connection is stable because I’ve configured a static IP address for my VM and added a rule in the Azure Firewall.
Description of the process
Fetching Data from iPhone Health App:
- Access the Health app on your iPhone.
- Navigate to the Summary section and click on Photo on the right side.
- Export the data from the app. You can download it onto your PC, Dropbox, OneDrive, or in like in my case, onto a server NAS (personal cloud).
- The downloaded file is typically zipped, with a size of approx. > 36 MB. Inside, you’ll find the Health data file in XML format, which has a complex structure.
Setting Up Data Lake Storage:
- Create a Data Lake Storage in Locally-Redundant Storage (LRS) as Data Lake Storage Gen 2, utilizing a hierarchical namespace.
- Configure the access tier to be HOT (frequently accessed).
- Establish containers/directories with a hierarchical structure within the Data Lake Storage.
First Step in Master Pipeline – Copy Data Activity (Unzip):
- Utilize the Copy Data Activity to unzip the downloaded file.
- Configure the activity with a dataset representing binary data with compression level set to ZipDeflate (.zip), pointing to the ZIP directory within the Data Lake container.
Creating Storage Event Trigger in Data Factory:
- Set up a Storage Event Trigger within Azure Data Factory, directing it to monitor the specific container created earlier.
- Configure this trigger to initiate the entire process on the Master_pipeline whenever a file named “export.zip” appears in the ZIP directory within the container.
- This trigger serves as a mechanism to automatically kickstart the data processing pipeline whenever new data is detected
Creating Integration Runtime:
- for purpose of this project I have used Azure AutoResolveIntegrationRuntime
- Azure AutoResolveIntegrationRuntime, is a dynamic integration runtime provided by Azure Data Factory.
- This runtime automatically handles infrastructure provisioning and scaling based on workload demands, ensuring optimal performance and resource utilization for the project’s data integration tasks.
Connecting Copy Data Activities:
- Establish a connection between the first Copy Data Activity (Unzip) and the second Copy Data Activity (Backup-reverse).
- Configure the second Copy Data Activity to copy the entire unzipped data from the first activity and transfer it into the BackupUnzipped directory.
- This activity ensures a seamless transfer of data from the unzipped file to the designated BackupUnzipped directory, and storage for further processing or archival purposes.
Adding Wait Activity:
- Wait Activity into the pipeline to delay the process by 30 seconds.
- This Wait Activity serves as a pause mechanism, allowing for controlled timing within the pipeline execution.
- The 30-second delay ensures proper synchronization or sequencing of subsequent pipeline activities, optimizing the overall data processing workflow.
Delete Activity:
- Implement a Delete Activity within the pipeline to clear out the contents of the apple_health_export/ZIP directory.
- This activity ensures the directory remains empty at all times, aligning with the requirement that the StorageEventTrigger should only trigger the pipeline when the export.zip file appears in this directory.
- By removing any existing files within the directory, the pipeline is primed to respond effectively to new data arrivals, maintaining a streamlined and efficient data processing flow.
Creating SQL Azure Database Server:
- Set up a SQL Azure Database Server
- Utilize SQL authentication method for easier connectivity with your on-premises VM.
- SQL authentication provides straightforward access to the database
- This choice simplifies the connection process between your Azure environment and on-premises resources.
Creating SQL Database:
- Create SQL Azure Database Server, with serverless option.
- Choose the Standard-series (Gen5) configuration with flexibility for up to 80 vCores, 240 GB memory, and 3 GB RAM.
- Enable the auto-pause feature, configured to activate after one hour of inactivity.
- With the auto-pause functionality, the database will automatically suspend after one hour of idle time, optimizing resource usage and minimizing costs during periods of inactivity.
Configuring SQL Server Firewall:
- Access the SQL Server management interface and navigate to the firewall settings.
- Add the static IP address of the VM to the firewall rule exceptions.
- By adding the VM’s static IP address to the firewall exceptions, you ensure that the VM can establish a secure connection with the SQL Server.
- This configuration enhances security by allowing only authorized connections from the specified VM while blocking unauthorized access attempts.
Open SQL Server Management Studio:
-
- Log into your VM.
- Open SQL Server Management Studio.
Connect to SQL Server:
-
- Enter the server name and credentials to connect to your SQL Server instance.
Create Tables:
CREATE TABLE Apple_Health (
[Type] NVARCHAR (150) ,
[sourceName] NVARCHAR (50) ,
[unit] NVARCHAR (20) ,
[creationDate] DATETIMEOFFSET ,
[startDate] DATETIMEOFFSET ,
[endDate] DATETIMEOFFSET ,
[value ] NVARCHAR (150) )
CREATE TABLE Workout (
[workoutActivityType] NVARCHAR (100) ,
[duration] DECIMAL (10,5) ,
[durationUnit] CHAR (5) ,
[sourceName] NVARCHAR (30) ,
[creationDate] DATETIMEOFFSET ,
[startDate] DATETIMEOFFSET ,
[endDate] DATETIMEOFFSET ,
[sum] DECIMAL (10,5) ,
[unit] CHAR (10) )
CREATE TABLE Activity (
[dateComponents] DATE ,
[activeEnergyBurned] DECIMAL(10,3) ,
[activeEnergyBurnedGoal] INT ,
[activeEnergyBurnedUnit] CHAR (10) ,
[appleMoveTime] INT ,
[appleMoveTimeGoal] INT ,
[appleExerciseTime] INT ,
[appleExerciseTimeGoal] INT ,
[appleStandHours] INT ,
[appleStandHoursGoal] INT )
After executing the queries, you can verify that the tables were created successfully by refreshing the Object Explorer in SSMS or log into Azure SQL Database using the Query editor in the Azure Portal.
Create Child Pipeline XML_pipeline
-
- Navigate to your Azure Data Factory instance.
- Create a new pipeline and name it as desired, e.g., “ChildPipeline”.
1. Add Copy Data Activities:
-
- Within the ChildPipeline, add three Copy Data activities.
- Configure each Copy Data activity to use the export.xml file from the Data Lake container as the source.
- Set the destination of each Copy Data activity to the respective SQL table.
2.Mapping XML Data:
-
- In the mapping tab of each Copy Data activity, adjust the array mappings according to the columns in the destination SQL table.
- Since the XML file has three main arrays ([HealthData], [Workout], and [ActivitySummary]), map each array to the corresponding table columns.
3.Exclude Unnecessary Data:
-
- Exclude unnecessary data like sourceName, sourceVersion, device, TypeIdentifier, manufacturer, model, hardware, software, etc., from the mapping. Only map the required data to the SQL tables.
4.Testing Child Pipeline:
-
- Test the child pipeline to ensure it operates as expected.
- Be mindful of the time it may take for the pipeline to complete
5.Insert Child Pipeline into Master Pipeline:
-
- Once the child pipeline is tested and working correctly, insert it into your main MASTER_PIPELINE.
Create Data Flow: SQL_to_CSV_data_flow
-
- Navigate to your Azure Data Factory instance.
- Create a new Data Flow and name it as desired, e.g., “SplitDataFlow”.
Set Source Dataset:
-
- Configure the source dataset to point to the SQL table containing your data.
Conditional Split Transformation:
-
- Add a Conditional Split transformation.
- Use the condition [ColumnName] == [RowName] to split the data.
Sink Configuration:
- Add a Sink transformation for each branch of the Conditional Split.
- Configure each Sink to point to the correct container in the Data Lake.
Define the filename for the output file using a pattern, for example, Concat(‘FileName’,’.csv’).
Repeat the Process:
-
- Repeat the process for each required data set to separate.
- Create separate containers in the Data Lake for each type of data if necessary (e.g., ActiveEnergyBurned, BasalEnergyBurned, DistanceWalkingRunning, HeartRate, StepCount).
- Configure the Sinks accordingly to sink the data into the appropriate containers with the desired filenames.
Finalize and Validate:
-
- Finalize the Data Flow by ensuring all configurations are correctly set.
- Validate the Data Flow to ensure there are no errors or issues.
“This process could have been skipped, but I will use this data in my future projects”
Insert Data Flow into Child Pipeline (DataFlowPipeline):
-
- Open the DataFlowPipeline in your Azure Data Factory instance.
- Add an activity to the pipeline.
- Select the “Data Flow” activity type.
- Choose the previously created Data Flow (SplitDataFlow) from the dropdown menu.
Insert Child Pipeline into Master Pipeline (MASTER_PIPELINE):
-
- Open the MASTER_PIPELINE in your Azure Data Factory instance.
- Add an activity to the pipeline.
- Select the “Pipeline” activity type.
- Choose the previously created Child Pipeline (DataFlowPipeline) from the dropdown menu.
Validate and Publish:
- Validate the changes made to the pipelines to ensure there are no errors.
- Once validated, publish the changes to make them live.
Create a new Logic App:
Design Workflow:
- Add an HTTP trigger for receiving requests.
Add Gmail Action:
- Add “Gmail – Send Email” action.
Configure Email Details:
- Specify recipient email, subject, and body.
Save and Test:
- Save the Logic App.
- Test it by triggering the HTTP request.
JSON code with explanation on Microsoft learn official website : https://learn.microsoft.com/en-us/azure/data-factory/how-to-send-email
{
“properties“: {
“dataFactoryName“: {
“type“: “string”
},
“message“: {
“type“: “string”
},
“pipelineName“: {
“type“: “string”
},
“receiver“: {
“type“: “string”
}
},
“type“: “object”
}
{
“message” : “This is a custom dynamic message from your pipeline with run
ID @{pipeline().RunId}.”,
“dataFactoryName” : “@{pipeline().DataFactory}”,
“pipelineName” : “@{pipeline().Pipeline}”,
“receiver” : “@{pipeline().parameters.receiver}”
}
Web activity
- Open the MASTER_PIPELINE.
- Add a Web activity.
- Configure the activity with the Logic App’s URL.
Create alerts and metrics
- Access Azure Monitor.
- Define metrics for monitoring.
- Set up alerts based on predefined conditions.
- Configure notification actions for alerts.
Here’s the description of the error handling setup for the pipeline:
Retry Policy:
-
- Configure retry settings to allow 3 attempts with a 30-second interval between each attempt.
Alerts & Metrics:
-
- Set up alerts to trigger whenever the pipeline fails to run.
- Configure the alerts to push notifications to an Action Group, specifically sending SMS alerts.
Log Analytics Workspace:
-
- Create a Log Analytics workspace.
- Set up diagnostic logging for Azure Data Factory to include all log categories.
- Send the diagnostic logs to the Log Analytics workspace for monitoring and analysis.
Master_Pipeline accept two parameters
FileName Parameter for Unzipping:
-
- Add a parameter to the Master_Pipeline named “FileName” to specify the file to unzip in the container ZIP, typically “export.zip”.
Email Web Activity/Logic App Parameter:
-
- Add another parameter to the Master_Pipeline for the email Web activity or Logic App
The detailed description of creating this tube includes 16 datasets and 4 link services, but I omitted this process because I want to shorten the description to a minimum. Output of this pipeline I will use it for further development in Azure and for subsequent projects
What contain main table Apple Health ?
For a full list with description please visit apple dev. website – https://developer.apple.com/documentation/healthkit/data_types
There are a few examples what contain column [Type] :
‘HKQuantityTypeIdentifierStepCount‘ –A quantity sample type that measures the number of steps the user has taken.
‘HKQuantityTypeIdentifierBasalEnergyBurned‘–A quantity sample type that measures the resting energy burned by the user.
‘HKQuantityTypeIdentifierWalkingSpeed‘ –A quantity sample type that measures the user’s average speed when walking steadily over flat ground.
‘HKQuantityTypeIdentifierActiveEnergyBurned‘ –A quantity sample type that measures the amount of active energy the user has burned
‘HKCategoryTypeIdentifierHeadphoneAudioExposureEvent‘ –A category sample type that records exposure to potentially damaging sounds from headphones.
‘HKCategoryTypeIdentifierRapidPoundingOrFlutteringHeartbeat‘ –A category type that records a rapid, pounding, or fluttering heartbeat as a symptom.
‘HKQuantityTypeIdentifierHeartRateVariabilitySDNN‘ –A quantity sample type that measures the standard deviation of heartbeat intervals.
‘HKQuantityTypeIdentifierHeartRate‘ –A quantity sample type that measures the user’s heart rate.
‘HKQuantityTypeIdentifierHeight‘ –A quantity sample type that measures the user’s height.
‘HKQuantityTypeIdentifierBodyMass‘–A quantity sample type that measures the user’s weight.
‘HKQuantityTypeIdentifierDistanceWalkingRunning‘ –A quantity sample type that measures the distance the user has moved by walking or running.
‘HKQuantityTypeIdentifierHeadphoneAudioExposure‘ –A quantity sample type that measures audio exposure from headphones.
‘HKCategoryTypeIdentifierDizziness‘ –A category type that records dizziness as a symptom.
‘HKQuantityTypeIdentifierSwimmingStrokeCount‘ –A quantity sample type that measures the number of strokes performed while swimming
‘HKQuantityTypeIdentifierAppleExerciseTime‘ –A quantity sample type that measures the amount of time the user spent exercising.
‘HKQuantityTypeIdentifierDistanceSwimming‘ –A quantity sample type that measures the distance the user has moved while swimming.
‘HKCategoryTypeIdentifierAppleStandHour‘ –A category sample type that counts the number of hours in the day during which the user has stood and moved for at least one minute per hour
‘HKCategoryTypeIdentifierHandwashingEvent‘ –A category sample type for handwashing events.
‘HKCategoryTypeIdentifierMindfulSession‘ –A category sample type for recording a mindful session.
‘HKQuantityTypeIdentifierWalkingHeartRateAverage‘ –A quantity sample type that measures the user’s heart rate while walking.
‘HKQuantityTypeIdentifierWalkingDoubleSupportPercentage‘ –A quantity sample type that measures the percentage of time when both of the user’s feet touch the ground while walking steadily over flat ground.
‘HKQuantityTypeIdentifierFlightsClimbed‘ –A quantity sample type that measures the number flights of stairs that the user has climbed.
‘HKQuantityTypeIdentifierAppleStandTime‘ –A quantity sample type that measures the amount of time the user has spent standing.
‘HKCategoryTypeIdentifierChestTightnessOrPain‘ –A category type that records chest tightness or pain as a symptom.
‘HKQuantityTypeIdentifierRestingHeartRate‘ –A quantity sample type that measures the user’s resting heart rate.
‘HKQuantityTypeIdentifierWalkingAsymmetryPercentage‘ — A quantity sample type that measures the percentage of steps in which one foot moves at a different speed than the other when walking on flat ground.
‘HKQuantityTypeIdentifierVO2Max‘ –A quantity sample that measures the maximal oxygen consumption during exercise.
‘HKQuantityTypeIdentifierWalkingStepLength‘ –A quantity sample type that measures the average length of the user’s step when walking steadily over flat ground.
‘HKQuantityTypeIdentifierHeartRateRecoveryOneMinute‘ –A quantity sample that records the reduction in heart rate from the peak exercise rate to the rate one minute after exercising ended.
‘HKCategoryTypeIdentifierAudioExposureEvent‘ –A category sample type for audio exposure events.
‘HKQuantityTypeIdentifierEnvironmentalAudioExposure‘ –A quantity sample type that measures audio exposure to sounds in the environment.
‘HKQuantityTypeIdentifierSixMinuteWalkTestDistance‘ –A quantity sample type that stores the distance a user can walk during a six-minute walk test.
‘HKQuantityTypeIdentifierAppleWalkingSteadiness‘ –A quantity sample type that measures the steadiness of the user’s gait.
How use GPX file in workout-routes folder ?
- Download Google Earth Pro.
- Open Google Earth Pro.
- Go to “File” and select “Open” or directly drag and drop the GPX file into the software to load it.
At this stage, the data is sorted, cleaned, and inserted into appropriate containers, and the SQL table. Everything is done by one click UPLOAD
In addition to the obvious readings such as heartbeat or calories, it is also interesting to note the range and types of data my device collects. For example :
>>>> A quantity sample type that measures the percentage of steps in which one foot moves at a different speed than the other when walking on flat ground.<<<<
ExportedTemplate-Project available on >>>>>> https://github.com/jacenty79/Azure-Data-Factory-