![]() ![]() ![]() (:obj: string, optional, defaults to "text") (:class: pandas.Dataframe, optional, defaults to None)ĭataframe containing text and one-hot encoded features. For more information refer to this issue. Running this sequence through the model will result in indexing errors". ![]() "Token indices sequence length is longer than the specified maximum sequence length for this model (2987 > 512). When running any summarizations you may see the following warning message which can be ignored: Output = augmentor.get_abstractive_summarization(text) Running singular summarization on any chunk of text is simple: text = chunk_of_text_to_summarizeĪugmentor = Augmentor(min_length=100, max_length=200) import pandas as pdĬsv = 'path_to_csv' df = pd.read_csv (csv ) augmentor = Augmentor (df, text_column = 'review_text' ) df_augmented = augmentor.abs_sum_augment () # Store resulting dataframe as a csvĭf_augmented.to_csv (csv.replace ( '.csv', '-augmented.csv' ), encoding = 'utf-8', index =False ) All available parameters are detailed in the Parameters section below. If additional columns are present that you do not wish to be considered, you have the option to pass in specific one-hot encoded features as a comma-separated string to the 'features' parameter. If multiprocessing is set, the call to abstractive summarization is stored in a task array later passed to a sub-routine that runs the calls in parallel using the multiprocessing library, vastly reducing runtime.Įach summarization is appended to a new dataframe with the respective features one-hot encoded.Ībsum expects a DataFrame containing a text column which defaults to 'text', and the remaining columns representing one-hot encoded features. The append index is storedĪn abstractive summarization is calculated for a specified size subset of all rows that uniquely have the given feature. Namely, if a given feature has 1000 rows and the ceiling is 100, its append count will be 0.įor each feature it then completes a loop from an append index range to the append count specified for that given feature. AlgorithmĪppend counts or the number of rows to add for each feature are first calculated with a ceiling threshold. Singular summarization calls are also possible. It also uses multiprocessing to achieve optimal performance. It uses the latest Huggingface T5 model by default, but is designed in a modular way to allow you to use any pre-trained or out-of-the-box Transformers models capable of abstractive summarization.Ībsum is format agnostic, expecting only a dataframe containing text and all features. Recent developments in abstractive summarization make this approach optimal in achieving realistic data for the augmentation process. MLSMOTE has been proposed, but the high dimensional nature of numerical vectors created from text can sometimes make other forms of data augmentation more appealing.Ībsum is an NLP library that uses abstractive summarization to perform data augmentation in order to oversample under-represented classes in datasets. Undersampling combined with oversampling are two methods of addressing this issue.Ī technique such as SMOTE can be effective for oversampling, although the problem becomes a bit more difficult with multilabel datasets. Imbalanced class distribution is a common problem in ML. Absum - Abstractive Summarization for Data Augmentation ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |