Getting Data from AWS data lake into Azure Data Lake

With Organizations becoming more cloud agnostic time and again a need arises to migrate your data from one cloud platform to another to make use of the ETL and Machine Learning platform on another cloud vendor.

In this article we would be covering how to migrate your data from AWS S3 data Lake to Azure Data Lake storage account. We would be using az copy for the purpose.

Step1. Authenticate to AWS account where the AWS data lake resides in. For this purpose we would be using the User security Access key and ID

set AWS_ACCESS_KEY_ID=<value>

set AWS_SECRET_ACCESS_KEY=<value>

Step2: Authenticate to the Azure Data Lake storage account. For this purpose we would be using the security Access tokens generated from the storage account where the data lake resides in and embed the same in the URL of the container where we want to put our data in

Step3- Initiate the data transfer via Az Copy

azcopy copy "https://azuremstnks.s3.ap-south-1.amazonaws.com/weather/" "https://mohahdi1hdistorage.blob.core.windows.net/mohahdi1-2021-11-05t06-51-24-789z?<SAS token goes here>" --recursive

INFO: Scanning...

INFO: Authenticating to source using S3AccessKey

INFO: Failed to create one or more destination container(s). Your transfers may still succeed if the container already exists.

INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support

Job ef9b6cef-1514-f543-7e9b-6554c027533c has started

Log file is located at: C:\Users\Mohanish\.azcopy\ef9b6cef-1514-f543-7e9b-6554c027533c.log

99.6 %, 115 Done, 0 Failed, 1 Pending, 0 Skipped, 116 Total,

Job ef9b6cef-1514-f543-7e9b-6554c027533c summary

Elapsed Time (Minutes): 0.2673

Number of File Transfers: 116

Number of Folder Property Transfers: 0

Total Number of Transfers: 116

Number of Transfers Completed: 116

Number of Transfers Failed: 0

Number of Transfers Skipped: 0

TotalBytesTransferred: 2314043185

Final Job Status: Completed

For Azure HD Insights the actual HDFS storage is at the Storage account associated with the cluster at creation time. So you can directly read a parquet file from Storage account container into the cluster as below:-

parquet_name='wasb:///weather/weather'

query="""SELECT station,measurement,year

FROM parquet.`%s.parquet`

WHERE measurement=\"PRCP\" """%parquet_name

print(query)

df2 = sqlContext.sql(query)

#print 'number of rows=',df2.count()

df2.show(5)

to_Pandas is not longer correct you should use toPandas() to convert a spark data frame to a pandas dataframe

We convert a spark data frame to a Pandas data frame which is easier to work with and resides in the head node.

Search This Blog

Mohanish Mahajan's Tech Talks

Getting Data from AWS data lake into Azure Data Lake

Comments

Post a Comment

Popular posts from this blog

python3: unpickling error

Azure Data Analytics: Part1: Hosting Data Lake storage: Gen1 and Gen2