added to README files, added full dataset versions to data
This commit is contained in:
@@ -25,8 +25,6 @@ A full dataset contains the complete recipes and additional information. Each re
|
||||
|
||||
comments: List of user comments (Strings) about the recipe
|
||||
|
||||
**dataset_parts**: Directory contains the dataset parts that were pulled from chefkoch.
|
||||
|
||||
**dataset_fin.json**: Entire dataset (combined dataset parts) as retrieved from chefkoch. Instructions of each recipe are one String.
|
||||
|
||||
**dataset_test.json**: A test dataset that only contains a few recipes. This dataset was used to test code,
|
||||
@@ -42,8 +40,14 @@ Instructions are separated into sentences (List of Strings).
|
||||
**dataset_cleaned_steps_not_empty.json**: Entire dataset with cleaned ingredients and instructions. Instructions are
|
||||
separated into sentences (List of Strings). Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
|
||||
|
||||
**full_dataset.json**: Entire dataset with cleaned ingredients and instructions. Instructions are
|
||||
separated into blocks of multiple sentences with up to 512 tokens. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
|
||||
**full_dataset_vers1.json**: Entire dataset with cleaned ingredients and instructions. Instructions are
|
||||
separated into blocks of multiple sentences with up to 512 tokens, separated by [SEP]. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
|
||||
|
||||
**full_dataset_vers2.json**: Entire dataset with cleaned ingredients and instructions. Instructions are
|
||||
separated into blocks of multiple sentences with up to 512 tokens, not separated by [SEP]. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
|
||||
|
||||
**full_dataset_vers3.json**: Entire dataset with cleaned ingredients and instructions. Instructions are
|
||||
separated into sentences. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
|
||||
|
||||
##Occurrances
|
||||
### Cleaned Ingredients
|
||||
@@ -64,7 +68,7 @@ Ingredients are sorted by number of occurrences.
|
||||
ingredient occurrs in the cleaned steps. This is important for training the model later on,
|
||||
as training will not be good for ingredients that only occur a few times in the steps.
|
||||
|
||||
## Ingredient sets
|
||||
## food_categories directory
|
||||
These json files contain all ingredients that could belong to the different categories.
|
||||
These files are: **bread.json**, **fish.json**, **meats.json**, **pasta.json**, **rolls.json**, **sausage.json**
|
||||
|
||||
@@ -78,18 +82,7 @@ instructions is separated into its sentences. Each step is cleaned. The structur
|
||||
|
||||
**cleaned_sep_sentences_not_empty.json**: same as above, but removed recipes that don't have any instructions
|
||||
|
||||
**complete_dataset512.json**: sentences of instructions of a recipe are combined until token amount nears 512. Not [SEP]
|
||||
tokens.
|
||||
|
||||
**complete_dataset_SEP512.json**: sentences of instructions of a recipe are combined until token amount nears 512. [SEP]
|
||||
tokens included.
|
||||
|
||||
**model_datapoints.txt** and **model_datapoints_SEP.txt**: list of only the datapoints from **complete_dataset... .json**
|
||||
|
||||
**training_data.txt**: instruction datapoints from recipes set aside for training
|
||||
|
||||
**testing_data.txt**: instruction datapoints from recipes set aside for testing
|
||||
|
||||
## Other
|
||||
**ground_truth.json**: ground truth used for evaluation
|
||||
|
||||
**synonyms.json**: synonyms of ingredients found in ground truth
|
||||
Reference in New Issue
Block a user