Getting Data from AWS data lake into Azure Data Lake
With Organizations becoming more cloud agnostic time and again a need arises to migrate your data from one cloud platform to another to make use of the ETL and Machine Learning platform on another cloud vendor.
In this article we would be covering how to migrate your data from AWS S3 data Lake to Azure Data Lake storage account. We would be using az copy for the purpose.
Step1. Authenticate to AWS account where the AWS data lake resides in. For this purpose we would be using the User security Access key and ID
set AWS_ACCESS_KEY_ID=<value>
set AWS_SECRET_ACCESS_KEY=<value>
Step2: Authenticate to the Azure Data Lake storage account. For this purpose we would be using the security Access tokens generated from the storage account where the data lake resides in and embed the same in the URL of the container where we want to put our data in
Step3- Initiate the data transfer via Az Copy
azcopy copy "https://azuremstnks.s3.ap-south-1.amazonaws.com/weather/" "https://mohahdi1hdistorage.blob.core.windows.net/mohahdi1-2021-11-05t06-51-24-789z?<SAS token goes here>" --recursive
INFO: Scanning...
INFO: Authenticating to source using S3AccessKey
INFO: Failed to create one or more destination container(s). Your transfers may still succeed if the container already exists.
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support
Job ef9b6cef-1514-f543-7e9b-6554c027533c has started
Log file is located at: C:\Users\Mohanish\.azcopy\ef9b6cef-1514-f543-7e9b-6554c027533c.log
99.6 %, 115 Done, 0 Failed, 1 Pending, 0 Skipped, 116 Total,
Job ef9b6cef-1514-f543-7e9b-6554c027533c summary
Elapsed Time (Minutes): 0.2673
Number of File Transfers: 116
Number of Folder Property Transfers: 0
Total Number of Transfers: 116
Number of Transfers Completed: 116
Number of Transfers Failed: 0
Number of Transfers Skipped: 0
TotalBytesTransferred: 2314043185
Final Job Status: Completed
For Azure HD Insights the actual HDFS storage is at the Storage account associated with the cluster at creation time. So you can directly read a parquet file from Storage account container into the cluster as below:-
parquet_name='wasb:///weather/weather'
query="""SELECT station,measurement,year
FROM parquet.`%s.parquet`
WHERE measurement=\"PRCP\" """%parquet_name
print(query)
df2 = sqlContext.sql(query)
#print 'number of rows=',df2.count()
df2.show(5)
to_Pandas is not longer correct you should use toPandas() to convert a spark data frame to a pandas dataframe
We convert a spark data frame to a Pandas data frame which is easier to work with and resides in the head node.
Comments
Post a Comment