added to README files, added full dataset versions to data

2021-04-15 20:19:09 +02:00
parent cf40ad15fb
commit 1ea0677029
9 changed files with 61 additions and 543 deletions
--- a/data/README.md
+++ b/data/README.md
@@ -25,8 +25,6 @@ A full dataset contains the complete recipes and additional information. Each re

        comments: List of user comments (Strings) about the recipe

-**dataset_parts**: Directory contains the dataset parts that were pulled from chefkoch.
-
 **dataset_fin.json**: Entire dataset (combined dataset parts) as retrieved from chefkoch. Instructions of each recipe are one String.

 **dataset_test.json**: A test dataset that only contains a few recipes. This dataset was used to test code, 
@@ -42,8 +40,14 @@ Instructions are separated into sentences (List of Strings).
 **dataset_cleaned_steps_not_empty.json**: Entire dataset with cleaned ingredients and instructions. Instructions are 
 separated into sentences (List of Strings). Recipes without instructions removed manually by searching for recipes containing **instructions": []**.

-**full_dataset.json**: Entire dataset with cleaned ingredients and instructions. Instructions are 
-separated into blocks of multiple sentences with up to 512 tokens. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
+**full_dataset_vers1.json**: Entire dataset with cleaned ingredients and instructions. Instructions are 
+separated into blocks of multiple sentences with up to 512 tokens, separated by [SEP]. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
+
+**full_dataset_vers2.json**: Entire dataset with cleaned ingredients and instructions. Instructions are 
+separated into blocks of multiple sentences with up to 512 tokens, not separated by [SEP]. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
+
+**full_dataset_vers3.json**: Entire dataset with cleaned ingredients and instructions. Instructions are 
+separated into sentences. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.

 ##Occurrances
 ### Cleaned Ingredients
@@ -64,7 +68,7 @@ Ingredients are sorted by number of occurrences.
 ingredient occurrs in the cleaned steps. This is important for training the model later on,
 as training will not be good for ingredients that only occur a few times in the steps.

-## Ingredient sets
+## food_categories directory
 These json files contain all ingredients that could belong to the different categories.
 These files are: **bread.json**, **fish.json**, **meats.json**, **pasta.json**, **rolls.json**, **sausage.json**

@@ -78,18 +82,7 @@ instructions is separated into its sentences. Each step is cleaned. The structur

 **cleaned_sep_sentences_not_empty.json**: same as above, but removed recipes that don't have any instructions

-**complete_dataset512.json**: sentences of instructions of a recipe are combined until token amount nears 512. Not [SEP]
-tokens.
-
-**complete_dataset_SEP512.json**: sentences of instructions of a recipe are combined until token amount nears 512. [SEP]
-tokens included.
-
-**model_datapoints.txt** and **model_datapoints_SEP.txt**: list of only the datapoints from **complete_dataset... .json**
-
-**training_data.txt**: instruction datapoints from recipes set aside for training
-
-**testing_data.txt**: instruction datapoints from recipes set aside for testing
-
 ## Other
 **ground_truth.json**: ground truth used for evaluation
+
 **synonyms.json**: synonyms of ingredients found in ground truth