Skip to content

TEXT_DATASET

Load one of the 20 newsgroup sample datasets from scikit-learn.The data is returned as a dataframe with one column containing the text and the other containing the category.Params:subset : "train" | "test" | "all", default="train"Select the dataset to load: "train" for the training set, "test" for the test set, "all" for both.categories : list of strSelect the categories to load. By default, all categories are loaded. The list of all categories is: 'alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'remove_headers : boolean, default=falseRemove the headers from the data.remove_footers : boolean, default=falseRemove the footers from the data.remove_quotes : boolean, default=falseRemove the quotes from the data.Returns:out : DataFrame
Python Code
from flojoy import flojoy, DataFrame, Array
from sklearn.datasets import fetch_20newsgroups
from sklearn.utils import Bunch
import pandas as pd
from typing import cast, Literal, Optional


# TODO: Add more datasets to this node.
@flojoy
def TEXT_DATASET(
    subset: Literal["train", "test", "all"] = "train",
    categories: Optional[Array] = None,
    remove_headers: bool = False,
    remove_footers: bool = False,
    remove_quotes: bool = False,
) -> DataFrame:
    """Load one of the 20 newsgroup sample datasets from scikit-learn.

    The data is returned as a dataframe with one column containing the text and the other containing the category.

    Parameters
    ----------
    subset : "train" | "test" | "all", default="train"
        Select the dataset to load: "train" for the training set, "test" for the test set, "all" for both.
    categories : list of str, optional
        Select the categories to load. By default, all categories are loaded.
        The list of all categories is:
        'alt.atheism',
        'comp.graphics',
        'comp.os.ms-windows.misc',
        'comp.sys.ibm.pc.hardware',
        'comp.sys.mac.hardware',
        'comp.windows.x',
        'misc.forsale',
        'rec.autos',
        'rec.motorcycles',
        'rec.sport.baseball',
        'rec.sport.hockey',
        'sci.crypt',
        'sci.electronics',
        'sci.med',
        'sci.space',
        'soc.religion.christian',
        'talk.politics.guns',
        'talk.politics.mideast',
        'talk.politics.misc',
        'talk.religion.misc'
    remove_headers : boolean, default=false
        Remove the headers from the data.
    remove_footers : boolean, default=false
        Remove the footers from the data.
    remove_quotes : boolean, default=false
        Remove the quotes from the data.

    Returns
    -------
    DataFrame
    """

    to_remove = tuple(
        ["headers" for remove_headers in [remove_headers] if remove_headers]
        + ["footers" for remove_footers in [remove_footers] if remove_footers]
        + ["quotes" for remove_quotes in [remove_quotes] if remove_quotes]
    )

    newsgroups = fetch_20newsgroups(
        subset=subset,
        categories=categories.unwrap() if categories else None,
        remove=to_remove,
    )

    newsgroups = cast(Bunch, newsgroups)
    data = newsgroups.data
    labels = [newsgroups.target_names[i] for i in newsgroups.target]

    df = pd.DataFrame({"Text": data, "Label": labels})
    return DataFrame(df=df)

Find this Flojoy Block on GitHub

Example

Having problems with this example app? Join our Discord community and we will help you out!
React Flow mini map

In this example, the TEXT_DATASET node is used to load the 20 newsgroups dataset. Only the training subset is selected, and the two categories that are loaded are comp.graphics and alt.atheism.

REMOVE_HEADERS, REMOVE_FOOTERS, and REMOVE_QUOTES are also set to true in order to remove the headers, footers, and quotes from the data.