Batch data ingestion on AWS

Creating a Batch data transfer ingestion solution on AWS

 

In this blog we would be creating a batch data transfer solution on AWS which would transfer data stored locally on my laptop machine into a data lake (S3) hosted on AWS.

Batch data ingestion is ideal for scenarios where data has already been produced and is sitting somewhere else.

AWS components used:-

1.       S3 Bucket

2.       AWS Transfer family server- SFTP

3.       SFTP client on the local machine (Filezilla)

4.       IAM tole with required permissions to ingest file into S3

5.       IAM role assigned to the SFTP server created in step 2

 

Creating SFTP server:-

  • 1.       From the AWS console select AWS transfer family
  • 2.       Select Create Server

  • 3.       In the protocols window select create SFTP and click next


  • 4.       In identity provider window select service managed and click next

  • 5.       In endpoint configuration select publicly accessible.


  • 6.       In domain select AWS S3


  • 7.       In cloudwatch logging window select create new role and click next


  • 8.       Verify the details and click create server.

Create S3 Bucket (To ingest the data to)
1. Select S3 from the list of services in the AWS console
3. Create S3 bucket in respective region

Create IAM role to give list access to the created bucket above
1. Select IAM from the list of services in AWS console
2. Go to roles tab
3. New role
4. Select transfer as the use case and click on permissions
5. Click on create policy
6. Select JSON and enter the below policy code
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowListingOfUserFolder",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
                ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::mstnkssftp"
                ]
        }
        ]
}
7. Click create

Create SFTP folder in your local machine which you would use to ingest batch data into AWS

1. mkdir test

2. cd test

 Create Key pair on your local machine to be able to use sftp

1. ssh-keygen

2. hit enter to choose the default values

3. the key pair is present in the present directory

4. copy the public key present in the .pub folder


Create User for the sftp server:-

1. Click on the SFTP server created


2. Under users section click add user

3. Enter the username
4. In the key section enter the public key which we had copied in the step above
5. Click add

Download the filezilla client on your local machine

Create a new connection in the filezilla client pointing to the endpoint of the sftp server created in AWS

Start uploading your batch data



 


Comments

Popular posts from this blog

python3: unpickling error

Azure Data Analytics: Part1: Hosting Data Lake storage: Gen1 and Gen2