# Vocab To create vocab.txt file, run **make_new_vocab.py** # Prep dataset **prep_dataset_training**: Format and split dataset, so it can be used for training. Adapt which dataset version to make! # train German FoodBERT **language_modeling** This was executed on Google Colab with the following parameters: !python /content/drive/MyDrive/masterarbeit/language_modeling.py --output_dir="/content/drive/MyDrive/masterarbeit/output" --model_type=bert --model_name=bert-base-german-cased --do_train --train_data_file="/content/drive/MyDrive/masterarbeit/data/training_data.txt" --do_eval --eval_data_file="/content/drive/MyDrive/masterarbeit/data/testing_data.txt" --mlm --line_by_line --per_device_train_batch_size=8 --gradient_accumulation_steps=2 --per_device_eval_batch_size=8 --save_total_limit=5 --save_steps=10000 --logging_steps=10000 --evaluation_strategy=epoch --model_name_or_path="bert-base-german-cased" The exclamation mark at the beginning of the line is only needed on Google Colab and can be omitted when executing locally. Paths need to be adjusted when executing. # Vocab Files: **bert-base-german-cased_tokenizer.json**: original bert-base-german-cased tokenizer file **bert_vocab.txt**: original bert-base-german-cased vocab **used_ingredients**: all ingredients in dataset **vocab.txt**: German FoodBERT vocabulary