https://editor.p5js.org/cy1503/sketches/NAq__zThk

I used my WeChat post text since 2022 as a database (sort of a small one). I wanted to make a tool to help me generate text for my posts because I usually had difficulty thinking of any.

Features of my post text:

usually short, sometimes English sometimes Chinese, sometimes emoji, no other punctuation other than exclamation points.

Screenshot 2024-10-28 at 00.07.33.png

Initially, I just replaced the text file from English to Chinese, and it returned the same text. Then I realized that probably it’s because I am using Chinese and the MarkovGenerator can’t detect the Chinese as a string.

I consulted ChatGPT about on how I should modify the model. Here are the answers from ChatGPT

Character handling: Chinese does not have spaces between words like in English, so the substring() and charAt() methods in the code don't need to be altered in any way, they are essentially already segmented according to the character granularity, which is suitable for handling Chinese.

Construction of n-grams: we get n consecutive Chinese characters by substring(i, i + this.n) and process them as n-grams. These n-grams are stored in the n-gram dictionary for subsequent text generation.

Randomly select the next character: the weightedSelection() function selects the next character from the possible successors based on probability and adds it to the generated text.

But I still got the original text from the database. Then I studied the n-gram and max the value. By playing around with n-gram and max. I found n=1 and max=around 20 could be perfect.