MasterarbeitCode/data/README.md

#Datasets

## Terminology
"Cleaning the ingredients" means removing amounts, comments, etc. from the ingredients.
This results in eg. "2 m.-große Tomate(n)" and "400 g Tomate(n) , in Stückchen, frisch
oder aus der Dose" to be simplified to the actual ingredient "Tomate".

"Cleaning the steps/instructions" means replacing the shortened form of the ingredient
(e.g. "Spargel") with the full, cleaned ingredient (e.g. "Spargel_grün").

## Full Datasets
A full dataset contains the complete recipes and additional information. Each recipe is structured as follows:

    Recipe URL (String)

        image: Image URL (String)

        name: Name of the recipe (String)

        quantity: Amount the recipe makes, e.g. number of portions (String)

        ingredients: List of ingredients the recipe calls for (List of Strings)

        instructions: the instructions on how to make the recipe (String or List of Strings)

        comments: List of user comments (Strings) about the recipe

**dataset_parts**: Directory contains the dataset parts that were pulled from chefkoch.

**dataset_fin.json**: Entire dataset (combined dataset parts) as retrieved from chefkoch. Instructions of each recipe are one String.

**dataset_test.json**: A test dataset that only contains a few recipes. This dataset was used to test code,
as running it with the entire dataset often takes a while. Ingredients are cleaned.
Instructions are separated into sentences (List of Strings).

**dataset_cleaned_nice.json**: Entire dataset with cleaned ingredients. Instructions of each recipe are one String.

**dataset_sep_sentences.json**: Entire dataset with cleaned ingredients. Instructions are separated into sentences (List of Strings).

**dataset_cleaned_steps.json**: Entire dataset with cleaned ingredients and instructions. Instructions are separated into sentences (List of Strings).

**dataset_cleaned_steps_not_empty.json**: Entire dataset with cleaned ingredients and instructions. Instructions are
separated into sentences (List of Strings). Recipes without instructions removed manually by searching for recipes containing **instructions": []**.

**full_dataset.json**: Entire dataset with cleaned ingredients and instructions. Instructions are
separated into blocks of multiple sentences with up to 512 tokens. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.

##Occurrances
### Cleaned Ingredients
In these datasets each cleaned ingredient is only listed once (key). The value of each key
is the number of occurrences of that ingredient in all recipe ingredient lists.

**all_ingredients_nice.json**: Contains all cleaned ingredients. Umlauts are written as such.

**mult_ingredients_nice.json**: contains all cleaned ingredients that occurred more than 20 times
(number can be changed) in all ingredient lists. Umlauts are written as such.

**mult_ingredients_sorted.json**: contains all cleaned ingredients that occurred more than 20
times (number can be changed) in all ingredient lists. Umlauts are written as such.
Ingredients are sorted by number of occurrences.

### Cleaned Steps
**cleaned_steps_occurrance.json**: contains all cleaned ingredients and how many times each
ingredient occurrs in the cleaned steps. This is important for training the model later on,
as training will not be good for ingredients that only occur a few times in the steps.

## Ingredient sets
These json files contain all ingredients that could belong to the different categories.
These files are: **bread.json**, **fish.json**, **meats.json**, **pasta.json**, **rolls.json**, **sausage.json**

## Instructions Only
**cleaned_sep_sentences.json**: contains only the instructions of each recipe. Each set of
instructions is separated into its sentences. Each step is cleaned. The structure is:

     Recipe URL (String)

        instructions: the instructions on how to make the recipe (List of Strings)

**cleaned_sep_sentences_not_empty.json**: same as above, but removed recipes that don't have any instructions

**complete_dataset512.json**: sentences of instructions of a recipe are combined until token amount nears 512. Not [SEP]
tokens.

**complete_dataset_SEP512.json**: sentences of instructions of a recipe are combined until token amount nears 512. [SEP]
tokens included.

**model_datapoints.txt** and **model_datapoints_SEP.txt**: list of only the datapoints from **complete_dataset... .json**

**training_data.txt**: instruction datapoints from recipes set aside for training

**testing_data.txt**: instruction datapoints from recipes set aside for testing

## Other
**ground_truth.json**: ground truth used for evaluation
**synonyms.json**: synonyms of ingredients found in ground truth