【Research Note】Infrastructure of Text Analysis: Building and Evaluating the Lexicon for Taiwan Legislative Studies





Published date: 

June, 2023


Isaac Shih-Hao Huang


An increasing number of legislative studies have applied natural language processing skills to replace human labor in identifying topics, positions and sentiments underlying legislative texts. Taiwan’s parliament is the only democratic parliament in the world that uses Chinese as the official language. Studies of Taiwan’s legislative record are critical to the development of comparative legislative studies and Chinese text analysis. However, the distribution of terms that appear in legislative records is different from the distribution of those seen in daily conversations. Existing Chinese word segmentation tools are limited in recognizing these terms, leading to biases in estimating the probability of terms and compromising the reliability, as well as the validity, of research findings. This study builds a lexicon for Taiwan legislative studies (LTLS). The LTLS collects more than 137,000 terms that may be seen in legislative records and political texts. Moreover, this research unprecedentedly evaluates the performance of three frequently used Chinese word segmentation systems, Jieba, CKIP and Articut, in terms of how much each of them may help identify topics in legislators’ oral interpellations as human coders do. The results show that word segmentation matters in altering the results of text analyses. To be specific, the three existing word segmentation tools are far from perfect in helping identify the topics of texts. Among them, Articut is the best without adding the LTLS, followed by CKIP and Jieba, respectively. More importantly, each of them performs better with the LTLS, and CKIP outperforms the others when the LTLS is an add-on. This research provides empirical evaluations that help researchers in choosing word segmentation tools. In addition, it demonstrates that the LTLS can be a cost-efficient and accessible tool to improve Chinese word segmentation for Taiwan’s legislative studies. Its release and expansion may enhance the development of Chinese text analyses in legislative studies in Taiwan and serve as a building block of a more comprehensive Chinese lexicon for political studies in other fields.