Getting Data from AWS data lake into Azure Data Lake

 With Organizations becoming more cloud agnostic time and again a need arises to migrate your data from one cloud platform to another to make use of the ETL and Machine Learning platform on another cloud vendor. 


In this article we would be covering how to migrate your data from AWS S3 data Lake to Azure Data Lake storage account. We would be using az copy for the purpose.

Step1. Authenticate to AWS account where the AWS data lake resides in. For this purpose we would be using the User security Access key and ID

set AWS_ACCESS_KEY_ID=<value>

set AWS_SECRET_ACCESS_KEY=<value>

Step2: Authenticate to the Azure Data Lake storage account. For this purpose we would be using the security Access tokens generated from the storage account where the data lake resides in and embed the same in the URL of the container where we want to put our data in

Step3- Initiate the data transfer via Az Copy


azcopy copy "https://azuremstnks.s3.ap-south-1.amazonaws.com/weather/" "https://mohahdi1hdistorage.blob.core.windows.net/mohahdi1-2021-11-05t06-51-24-789z?<SAS token goes here>" --recursive
INFO: Scanning...

INFO: Authenticating to source using S3AccessKey
INFO: Failed to create one or more destination container(s). Your transfers may still succeed if the container already exists.
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support

Job ef9b6cef-1514-f543-7e9b-6554c027533c has started
Log file is located at: C:\Users\Mohanish\.azcopy\ef9b6cef-1514-f543-7e9b-6554c027533c.log

99.6 %, 115 Done, 0 Failed, 1 Pending, 0 Skipped, 116 Total,


Job ef9b6cef-1514-f543-7e9b-6554c027533c summary
Elapsed Time (Minutes): 0.2673
Number of File Transfers: 116
Number of Folder Property Transfers: 0
Total Number of Transfers: 116
Number of Transfers Completed: 116
Number of Transfers Failed: 0
Number of Transfers Skipped: 0
TotalBytesTransferred: 2314043185
Final Job Status: Completed


For Azure HD Insights the actual HDFS storage is at the Storage account associated with the cluster at creation time. So you can directly read a parquet file from Storage account container into the cluster as below:-

parquet_name='wasb:///weather/weather'
query="""SELECT station,measurement,year 
FROM parquet.`%s.parquet` 
WHERE measurement=\"PRCP\" """%parquet_name
print(query)
df2 = sqlContext.sql(query)
#print 'number of rows=',df2.count()
df2.show(5)

to_Pandas is not longer correct you should use toPandas() to convert a spark data frame to a pandas dataframe
We convert a spark data frame to a Pandas data frame which is easier to work with and resides in the head node. 

Comments

Popular posts from this blog

python3: unpickling error

Azure Data Analytics: Part1: Hosting Data Lake storage: Gen1 and Gen2