Translate

Thursday, June 24, 2021

How to Move Data From XML to CSV in Data Bricks?

 In this post, we are going to use PySpark to process xml files to extract the required records, transform them into DataFrame, then write as csv files (or any other format) to the destination. The input and the output of this task looks like below.

Monday, June 29, 2020

How to Parameterize connections to your data stores in Azure Data Factory ?

Azure Data Factory (ADF) enables you to do hybrid data movement from 70 plus data stores in a serverless fashion. Often users want to connect to multiple data stores of the same type. For example, you might want to connect to 10 different databases in your Azure SQL Server and the only difference between those 10 databases is the database name. You can now parameterize the linked service in your Azure Data Factory. In this case, you can parameterize the database name in your ADF linked service instead of creating 10 separate linked services corresponding to the 10 Azure SQL databases. This reduces overhead and improves manageability for your data factories. You can then dynamically pass the database names at runtime.


Simply create a new linked service and click Add Dynamic Content underneath the property that you want to parameterize in your linked service.

You can also parameterize other properties of your linked service like server name, username, and more. We recommend not to parameterize passwords or secrets. Store all connection strings in Azure Key Vault instead, and parameterize the “Secret Name” instead. The user experience also guides you in case you type incorrect syntax to parameterize the linked service properties.

Data Analyst, Data Engineer and Data Scientist Skill Sets


Data Analyst vs Data Engineer vs Data Scientist Skill Sets

Data Analyst
Data Engineer
Data Scientist
Data Warehousing
Data Warehousing & ETL
Statistical & Analytical skills
Adobe & Google Analytics
Advanced programming knowledge
Data Mining
Programming knowledge
Hadoop-based Analytics
Machine Learning & Deep learning principles
Scripting & Statistical skills
In-depth knowledge of SQL/ database
In-depth programming knowledge (SAS/R/ Python coding)
Reporting & data visualization
Data architecture & pipelining

 Hadoop-based analytics
SQL/ database knowledge
Machine learning concept knowledge

 Data optimization
Spread-Sheet knowledge
Scripting, reporting & data visualization
 Decision making and soft skills


Data Engineer Roles and Responsibilites


Responsibilities for Data Engineer
·         Create and maintain optimal data pipeline architecture,
·         Assemble large, complex data sets that meet functional / non-functional business requirements.
·         Identify, design, and implement internal process improvements: automating manual processes, optimizing data delivery, re-designing infrastructure for greater scalability, etc.
·         Build the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using SQL and AWS ‘big data’ technologies.
·         Build analytics tools that utilize the data pipeline to provide actionable insights into customer acquisition, operational efficiency and other key business performance metrics.
·         Work with stakeholders including the Executive, Product, Data and Design teams to assist with data-related technical issues and support their data infrastructure needs.
·         Keep our data separated and secure across national boundaries through multiple data centers and AWS regions.
·         Create data tools for analytics and data scientist team members that assist them in building and optimizing our product into an innovative industry leader.
·         Work with data and analytics experts to strive for greater functionality in our data systems.

How can I schedule a pipeline in Azure?

You can use the scheduler trigger or time window trigger to schedule a pipeline. The trigger uses a wall-clock calendar schedule, which can schedule pipelines periodically or in calendar-based recurrent patterns (for example, on Mondays at 6:00 PM and Thursdays at 9:00 PM). 



Pipeline execution and triggers in Azure Data Factory