Skip to content

HTML Functions#

CONTENTS_BY_TAG_NAME#

Syntax#

CONTENTS_BY_TAG_NAME(<string>;<tag name>)

Description#

Returns all content of elements with the specified tag name as a list.

Example

Download the example file: HTML_File_Example.html

Given the following excerpt from the HTML file:

<tfoot>
    <tr>
        <td>
            <i>affected:<br>4 Million People</i>
        </td>
        <td>
            <i>affected:<br>2 Million People</i>
        </td>
        <td>
            <i>affected:<br>1 Million People</i>
        </td>
    </tr>
</tfoot>
<tbody>
    <tr>
        <td>New York</td>
        <td>San Francisco</td>
        <td>Atlanta</td>
    </tr>
    <tr>
        <td>Bread</td>
        <td>Biscuits</td>
        <td>Rolls</td>
    </tr>
    <tr>
        <td>Sandwich</td>
        <td>Soup</td>
        <td>Salad</td>
    </tr>
</tbody>

In order to return the content found in the table data tag <td>, select the column with your HTML data as the HTML argument and use td as the tag name argument.

The result is all of the content in the <td> tag displayed as a list:

[affected: 4 Million People, affected: 2 Million People, affected: 1 Million People, New York, San Francisco, Atlanta, Bread, Biscuits, Rolls, Sandwich, Soup, Salad]]

ELEMENTS_BY_SELECTOR_QUERY#

Syntax#

ELEMENTS_BY_SELECTOR_QUERY(<string [containing HTML elements]>;<selector query>)

Description#

Returns all elements that match the selector query as a list.

For more information on selector queries, see jsoup.org.

Example

Download the example file: HTML_File_Example.html

Given the following excerpt from the HTML file:

<table border="1" rules="groups">
    <thead>
        <tr>
            <th>Association 1</th>
            <th>Association 2</th>
            <th>Association 3</th>
        </tr>
    </thead>
    <tfoot>
        <tr>
            <td><i>affected:<br>4 Million People</i></td>
            <td><i>affected:<br>2 Million People</i></td>
            <td><i>affected:<br>1 Million People</i></td>
        </tr>
    </tfoot>
    <tbody>
        <tr>
            <td>New York</td>
            <td>San Francisco</td>
            <td>Atlanta</td>
        </tr>
        <tr>
            <td>Bread</td>
            <td>Biscuits</td>
            <td>Rolls</td>
        </tr>
        <tr>
            <td>Sandwich</td>
            <td>Soup</td>
            <td>Salad</td>
        </tr>
    </tbody>
</table>

The goal is to extract only the table data content that is located in the table body. Looking at the jsoup documentation on defining queries, a possible query to use is:

ancestor child: child elements that descend from ancestor

In this case, first extract the ancestor table body and then the child table data.

tbody td

The results are the table data <td> elements that are located in the table body <tbody> tag:

[<td>New York</td>, <td>San Francisco</td>, <td>Atlanta</td>, <td>Bread</td>, <td>Biscuits</td>, <td>Rolls</td>, <td>Sandwich</td>, <td>Soup</td>, <td>Salad</td>]

ELEMENTS_BY_TAG_NAME#

Syntax#

ELEMENTS_BY_TAG_NAME(<string [containing HTML elements]>;<tag name>)

Description#

Returns all elements with a specified tag name as a list.

Example

Download the example file: HTML_File_Example.html

Given the following excerpt from the HTML file:

<body>
<h1>Affected People</h1>
    <table border="1" rules="groups">
        <thead>
            <tr>
                <th>Association 1</th>
                <th>Association 2</th>
                <th>Association 3</th>
            </tr>
        </thead>
        <tfoot>
            <tr>
                <td><i>affected:<br>4 Million People</i></td>
                <td><i>affected:<br>2 Million People</i></td>
                <td><i>affected:<br>1 Million People</i></td>
            </tr>
        </tfoot>
        <tbody>
            <tr>
                <td>New York</td>
                <td>San Francisco</td>
                <td>Atlanta</td>
            </tr>
            <tr>
                <td>Bread</td>
                <td>Biscuits</td>
                <td>Rolls</td>
            </tr>
            <tr>
                <td>Sandwich</td>
                <td>Soup</td>
                <td>Salad</td>
            </tr>
        </tbody>
    </table>
</body>

Return all the HTML elements in italics.

Select the column with your HTML data as the HTML argument and use i as the tag name argument.

The result is all of the <i> elements displayed as a list:

[[<i>affected:<br />4 Million People</i>, <i>affected:<br />2 Million People</i>, <i>affected:<br />1 Million People</i>]]

PROPERTY_VALUE_BY_TAG_NAME#

Syntax#

PROPERTY_VALUE_BY_TAG_NAME(<string [containing HTML elements]>;<tag name>;<property name>])

Description#

Returns the value of the specified tag name and property as a list.

Example

Download the example file: HTML_File_Example.html

Given the following excerpt from the HTML file:

<h1>Affected People</h1>
<div id="div1">The cities in which we live
<div id="div2">The food we eat for dinner
<table border="1" rules="groups">
    <thead>
        <tr>
            <th>Association 1</th>
            <th>Association 2</th>
            <th>Association 3</th>
        </tr>
    </thead>
    <tfoot>
        <tr>
            <td><i>affected:<br>4 Million People</i></td>
            <td><i>affected:<br>2 Million People</i></td>
            <td><i>affected:<br>1 Million People</i></td>
        </tr>
   </tfoot>

This example returns the value assigned to a property of a tag. In this case, the div id values.

Select the column with your HTML data as the HTML argument, use div as the tag name argument, and use id as the property name in which to search for the value.

The result is all of the values from the id property in the div tag:

[[div1, div2]]

REMOVE_ELEMENTS_BY_TAG_NAME#

Syntax#

REMOVE_ELEMENTS_BY_TAG_NAME(<string [containing HTML elements]>;<tag name>)

Description#

Removes the specified HTML elements (tags and content) from a string. This removes inner elements as well.

Example

Download the example file: HTML_File_Example.html

Given the following excerpt from the HTML file:

<head>
    <title>HTML Example</title>
</head>
<body>
    <h1>Affected People</h1>
    <div id="div1">The cities in which we live
    <div id="div2">The food we eat for dinner
    <table border="1" rules="groups">
        <thead>
            <tr>
                <th>Association 1</th>
                <th>Association 2</th>
                <th>Association 3</th>
            </tr>
        </thead>
        <tfoot>
            <tr>
                <td><i>affected:<br>4 Million People</i></td>
                <td><i>affected:<br>2 Million People</i></td>
                <td><i>affected:<br>1 Million People</i></td>
            </tr>
        </tfoot>
        <tbody>
            <tr>
                <td>New York</td>
                <td>San Francisco</td>
                <td>Atlanta</td>
            </tr>
            <tr>
                <td>Bread</td>
                <td>Biscuits</td>
                <td>Rolls</td>
            </tr>
            <tr>
                <td>Sandwich</td>
                <td>Soup</td>
                <td>Salad</td>
            </tr>
        </tbody>
    </table>
</body>

This example removes all table <table> elements from the HTML.

Select the column with your HTML data as the HTML argument and use table as the tag name argument.

The result is the HTML data without the <table> element:

<head>
    <title>HTML Example</title>
</head>
<body>
    <h1>Affected People</h1>
    <div id="div1"> The cities in which we live
    <div id="div2"> The food we eat for dinner
</body>

REMOVE_HTML_TAGS#

Syntax#

REMOVE_HTML_TAGS(<string [containing HTML elements]>)

Description#

Removes all HTML tags.

Example

Download the example file: HTML_File_Example.html

Given the following excerpt from the HTML file:

<tbody>
    <tr>
        <td>New York</td>
        <td>San Francisco</td>
        <td>Atlanta</td>
    </tr>
    <tr>
        <td>Bread</td>
        <td>Biscuits</td>
        <td>Rolls</td>
    </tr>
    <tr>
        <td>Sandwich</td>
        <td>Soup</td>
        <td>Salad</td>
    </tr>
</tbody>

This examples removes all of the HTML tags and display the content of the HTML object in a string.

Select the column with your HTML data as the HTML argument.

The results is the content as a string type with the HTML tags removed:

[New York San Francisco Atlanta Bread Biscuits Rolls Sandwich Soup Salad]