Extract Data
Description
This Activity extracts data across multiple web pages, filters it, and transforms it into the specified format.
Properties
Input
- Delay (in MS) – It specifies the delay in milliseconds. This time indicates the current Activity gets completed, and the next Activity will begin with any specific operations. The in-between time will ensure enough time for the next page to load.
- Max Records– It specifies the maximum number of pages to extract the data. Set the value as -1 to fetch all the records.
- Selector – You can specify the elements selectors in JSON format using specific query syntax. Please encode all the embedded XPath(s) inside JSON if it contains special characters. For more information, refer to XPath Encoding.
- Web Page – It specifies the webpage object of the currently opened web page.
Misc
- DisplayName – Add a display name to your Activity.
- Private – By default, Activity will log the values of your properties inside your workflow. If private is selected, then it stops logging.
Optional
- Continue On Error – It Specifies whether the automation should continue even when the Activity throws an error. If True, the Activity continues without throwing any exceptions. If False, the Activity throws an exception. The default value is False.
Catches no error if this Activity is present inside the Try-Catch block and the value of this property is True.
- Timeout – An Argument of type Timespan Specifies the amount of waiting time the Activity takes to run before an error occurs and aborts the workflow. By default, the Timeout value is 10 minutes.
Output
- Result – Represents the output data table containing the extracted information from the webpage.
Example
Download Example
Selector JSON format
This Activity takes the selectors in a JSON format. You can download the template JSON file from here:
Download Selector JSON Template
Multiple properties inside the JSON must be appropriately provided for our Buddy to extract the required data. The table below describes the properties inside the JSON template.
Property | Description |
---|---|
rowSelector | Specifies the row selection XPath to fetch a row of data from the result data set. |
columnSelector | Specified one or more column selection criteria. You must specify the following properties per column further.
|
nextLinkSelector | Specify the XPath to be used to traverse to the next page or next set of results. |
In the above example, it utilizes https://www.google.co.in to search for specific text. It then extracts the 'Title' and 'Link' data from the search result. The search result might contain one or more rows of data, with each row containing multiple columns, and the result might go across pages too. The snapshot below depicts the JSON format used in the example.
{
"rowSelector": "//*[@id='rso']/div",
"columnsSelector": [
{
"name": "Title",
"selector": ".//h3",
"fetchUrl": false
},
{
"name": "Link",
"selector": ".//a",
"fetchUrl": true
}
],
"nextLinkSelector": "//*[@id='pnnext']/span[2]"
}
Here is a summary of the components in the JSON file mentioned above to use for data extraction:
-
rowSelector: //*[@id='rso']/div. This XPath code selects all the div nodes that are children of a parent div node with a specific Id 'rso.'.
-
columnsSelector: Refers to an array of columns used to extract data Title (.//h3) and Link (.//a) from the elements within each row selected above.
- name: The column name you see in the data grid. Title
- selector: The Xpath for the column as per the row element's position. .//h3 and .//a.
- fetchUrl: Set it to true to fetch the href value.
-
nextLinkSelector: //*[@id='pnnext']/span[2]. To navigate to the next page, uses the XPath to locate the element.