en

Oncrawl

Oncrawl plugin for Dataiku provides a recipe that allows you to easily export URLs or aggregated data from crawls and log monitoring events

Plugin information

Version 1.0.0
Author Oncrawl
Released 2021-06
Last updated 2021-06
License Apache Software License
Source code Github
Reporting issues Github

Description

Oncrawl plugin for Dataiku provides a recipe that allows you to easily export URLs or aggregated data from crawls and log monitoring events.

Export data from Oncrawl platform to bring data science into your SEO management: use machine learning to detect anomalies when crawling your website, or predict the evolution of your traffic or the behavior of the  search engine crawlers… and so many innovative projects to improve your SEO.

How to use

Prerequisites

  • API access in your subscription.
  • Live crawls or logs.

Create a new dataset using the plugin.

  1. Install the plugin in your Dataiku DSS instance.
  2. In your DSS flow, create a recipe “Oncrawl – Data queries”.
  3. Set an API access token
  4. After that, feel free to edit default configuration :
    • Choose project source:  all or only a specific one
    • Choose data kind among pages, links or logs
    • Define a crawl or logs timeframe
  5. If you selected pages or links:
    • Choose a crawl config.
      An empty crawl configs list means that you have no crawl available for the selected project or timeframe, adjust these parameters
    • Choose crawl source: all, only a specific one, or the last one into the date range selected
  6. Last step: choose output kind among aggregations or URL export
    • Aggregations: edit JSON object to define your own aggregations: write one or several OQL queries into the array of aggregate queries. OQL language for aggregations is described here
    • URL export: edit JSON object writing your own OQL query to filter the output URL list. OQL language for URL export is described here.  Optionally you can set a specific list of fields you want to export instead of exporting all available fields. It means that if you are only interested by titles for example, your query looks like this:
      {
      “oql”: {
      “field”: [
      “fetched”,
      “equals”,
      “true”
      ]
      },
      “fields”: [
      “title”,
      “title_evaluation”
      ]
      }

      A list of fields is available here.  The recipe stores the columns “title” and “title_evaluation” in the output dataset:

      Output dataset only with the fields title and title_evaluation

        Output dataset only with the fields title and title_evaluation

Get the Dataiku Data Sheet

Learn everything you ever wanted to know about Dataiku (but were afraid to ask), including detailed specifications on features and integrations.

Get the data sheet