Spiderless, Web Spider on Serverless

A web spider / scraper / website change detector built with Lambda, API Gateway, DynamoDB and SNS

spider-less

Web spider on Serverless!

About Spiderless

Spiderless is the backend layer of KMPPP, a web spider as a service application, it allows you to monitor and get notified of nearly anything on the web. It is built on top of these technologies:

| Technology | Used For | | ------------- | ------------- | | Bulma, Buefy | UI | | Vue.js | Front-end logic | | AWS S3 | Website hosting | | AWS Lambda | Backend API | | AWS SNS | Message queue | | AWS DynamoDB | Database | | AWS API Gateway | API gateway | | AWS Cloudfront | CDN | | AWS Route 53 | DNS |

Architecture

serverless application architecture

API Endpoints

GET subscriptions

Description

Get a list of subscriptions (a maximum of 1 MB of data limited by DynamoDB).

Parameters

None

Request

curl /api/subscriptions

Response

[
  {
    "createdAt": 1544833435070,
    "targets": [
      {
        "selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span",
        "label":"ratingCount"
      }
    ],
    "id": "b4d98de0-ffff-11e8-a4c9-9b9ee9089058",
    "url": "https://www.imdb.com/title/tt0111161/",
    "interval": 60
  }
]

POST subscriptions

Description

Create a new subscription to feed the spider.

Parameters

  • url (required) - Target website url
  • targets (required) - List of css selectors from which text contents are expected to be extracted
  • interval (required) - The interval (in minutes) between scrape

Request

curl -X POST /api/subscriptions -d '{"url":"https://www.imdb.com/title/tt0111161/","targets":"[{\"label\":\"ratingCount\",\"selector\":\"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span\"}]","interval":"60"}' -H "Content-Type: application/json"

Response

{
  "id": "ef417d30-ffff-11e8-a4c9-9b9ee9089058",
  "url": "https://www.imdb.com/title/tt0111161/",
  "targets": [
    {
      "label":"ratingCount",
      "selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span"
    }
  ],
  "interval": 60,
  "createdAt": 1544833533059,
  "updatedAt": 1544833533059
}

DELETE subscriptions

Description

Delete a subscription.

Parameters

  • id (required) - Subscription id

Request

curl -X DELETE /api/subscriptions/:id

Response

{
  "id": "d72c05d0-ffff-11e8-a4c9-9b9ee9089058"
}

Functions List

scrape

Description

Scrape target websites and extract target contents.

Invoke

yarn invoke:local scrape -d '{"createdAt":1544833435070,"updatedAt":1544833435070,"targets":[{"selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span","label":"ratingCount"}],"id":"b4d98de0-ffff-11e8-a4c9-9b9ee9089058","url":"https://www.imdb.com/title/tt0111161/","interval":60}'

Response

[
  {
    "label": "ratingCount",
    "content": "2,025,796"
  }
]

cron

Description

Fetch subscriptions from database and filter out the ones need to be executed.

Invoke

yarn invoke:local cron

Response

None

Development

# install dependencies
yarn install

# start api server on port 8090
yarn start

# invoke function locally
yarn invoke:local function_name

# invoke remote function
yarn invoke cron function_name

Deploy

# first setup your aws credentials https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html
yarn deploy