Vocab
To create vocab.txt file, run make_new_vocab.py
Prep dataset
prep_dataset_training: Format and split dataset, so it can be used for training. Adapt which dataset version to make!
train German FoodBERT
language_modeling
This was executed on Google Colab with the following parameters:
!python /content/drive/MyDrive/masterarbeit/language_modeling.py --output_dir="/content/drive/MyDrive/masterarbeit/output" --model_type=bert --model_name=bert-base-german-cased --do_train --train_data_file="/content/drive/MyDrive/masterarbeit/data/training_data.txt" --do_eval --eval_data_file="/content/drive/MyDrive/masterarbeit/data/testing_data.txt" --mlm --line_by_line --per_device_train_batch_size=8 --gradient_accumulation_steps=2 --per_device_eval_batch_size=8 --save_total_limit=5 --save_steps=10000 --logging_steps=10000 --evaluation_strategy=epoch --model_name_or_path="bert-base-german-cased"
The exclamation mark at the beginning of the line is only needed on Google Colab and can be omitted when executing locally. Paths need to be adjusted when executing.
Vocab Files:
bert-base-german-cased_tokenizer.json: original bert-base-german-cased tokenizer file bert_vocab.txt: original bert-base-german-cased vocab used_ingredients: all ingredients in dataset vocab.txt: German FoodBERT vocabulary