Topic Modelling: Parliamentary Speeches · Shaun Khoo

Photo by Steven Lasry.

Topic Modelling: Parliamentary Speeches


Introduction

This was my submission for the final project in my course QMSS 5067 Natural Language Processing in the Social Sciences. Our task was to either find or scrape text data and apply NLP tools to produce some insights about the data. For my project, I chose to use the Latent Dirichlet Allocation (LDA) model to discover topics from parliamentary speeches in Singapore from 1988 to 2002. Here is the preview of the final product:

Details

I used Selenium to web scrape legislative speeches from the Singapore Parliament’s website, which was Javascript-based. I then cleaned and processed the data using Regex and Pandas, and split the corpus of each parliamentary session by agendas. Each agenda-corpus was processed using Gensim to create bigrams, and using NLTK to remove stopwords and for lemmatization.

As for the topic modelling, basic LDA models were created using various values of alpha and topic numbers via Gensim. In particular, a lower alpha was chosen based on the intuition that each agenda was likely to be focused on a few topics. The optimal topic number was chosen by running the basic LDA model several times with different values for the number of topics. The final LDA model was generated using Mallet, which implements Gibbs sampling to create its topic models. Models were evaluated based on coherence score, with the final model achieving a score of 0.53.

Thereafter, the matrix of probabilities of each topic for each documents was extracted from the model, and t-SNE was applied to reduce the dimensionality to make the data visualizable. Based on the top keywords for each topic generated by the model, I hand-labelled them into categories such as ‘Economy’ or ‘Defence’. The output was visualized using Bokeh, with each point representing an agenda’s corpus, and with a pop-up indicating information about the agenda title, predicted topic, and the speakers of that topic. I also included the model’s predicted probability of that topic being representative of that agenda-corpus, as well as the top keywords from that topic that appeared in that agenda-corpus.

One interesting observation is how closely this reflects our expectations - the topics are quite well-defined, and closely-aligned topics are naturally closer to each other in the t-SNE plot (such as healthcare, education, and social issues). Potential areas for extension include assessing how the distribution of topics change across parliaments or across different members of parliament.