Research Paper Topic Modeling with BERTopic

This project utilizes the BERTopic model to explore topics within a research paper database, focusing on the thematic structure of scientific abstracts. The goal is to assist in topic discovery and improve the accessibility of research insights.

Project Overview

This repository provides a comprehensive walkthrough of using BERTopic for topic modeling on a research paper dataset. The project covers:

Dataset Selection: Choosing a manageable dataset from Hugging Face for efficient topic modeling on Google Colab.
Data Processing: Preparing the data for BERTopic by cleaning text fields and structuring inputs.
Topic Modeling: Applying BERTopic to identify distinct clusters of topics in research abstracts.
Insights and Findings: Visualizing and analyzing the generated topics to understand prevalent themes.

Dataset

I selected the neuralwork/arxiver dataset from Hugging Face, which contains a variety of research paper abstracts. This dataset was chosen for its manageable size and relevance to my computational resources.

Dataset Details:

Source: Hugging Face Hub
Content: Scientific abstracts from different fields. It consists of 63,357 arXiv papers converted to multi-markdown (.mmd) format. It includes original arXiv article IDs, titles, abstracts, authors, publication dates, URLs, and corresponding markdown files published between January 2023 and October 2023.
Size: 63,357 rows

Preprocessing

To prepare the dataset for topic modeling, several preprocessing steps are implemented:

Tokenization and Lemmatization: Abstracts were tokenized and lemmatized for consistency.
Stop Words and Unwanted Tokens Removal: Removed common stop words, numbers, and other irrelevant content.

Model Configuration

The BERTopic model was set up with the following parameters:

Embedding Model: all-MiniLM-L6-v2 for efficient sentence embeddings.
UMAP: n_neighbors=10, min_dist=0.1 to optimize the embedding space for clustering.
HDBSCAN: min_cluster_size=60, min_samples=15 to help refine the clusters formed.

Key Parameters:

from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN

topic_model = BERTopic(
    embedding_model="all-MiniLM-L6-v2",
    umap_model=UMAP(
        n_neighbors=10,
        n_components=5,
        min_dist=0.1,
        metric='cosine'
    ),
    hdbscan_model=HDBSCAN(
        min_cluster_size=60,
        min_samples=15,
        metric='euclidean',
        cluster_selection_method='eom'
    ),
    top_n_words=10
)

Results and Analysis

After running the BERTopic model, 108 topics were identified. Here’s a sample of some of the topics that were discovered:

Title	Topic	Probability
"Dynamics of Polymer Ejection from a Nano-Sphere "	61-ploymer-cell-membrane	0.98
"Enhacning Health data interoperability with large language models : A FHIR study"	39-clinical-medical-health	1.0
"Benford's Law under Zeckendorf expansion"	27-integer-number-sum-prime	0.93

Observations:

Topic Diversity: The topics cover various research areas, indicating a broad application of the model.
Outlier Removal: A threshold was set to filter out low-confidence classifications to improve the overall quality of the topics.
Visualization: The results were visualized using the built-in functions of BERTopic to illustrate the relationships between topics.

Visualization

We generated visualizations to explore topic distributions and hierarchical structures:

topic_model.visualize_heatmap() Heatmap Visualization
topic_model.visualize_barchart() Barchart Visualization
topic_model.visualize_topics() Intertopic Visualization
topic_model.visualize_hierarchy() Hierarchy Visualization

Practical Applications:

1.Monthly Topic Analysis

Monthly Topic Analysis notebook demonstrates the practical application of topic modeling results to analyze monthly publication frequencies of research papers.

In this notebook, users can:

Analyze monthly publication frequencies for specific topics by entering a topic number.
Provide insights on how certain topics gained popularity over time, enabling researchers to identify hot topics in their field.
Enable users to track progress in specific research areas, fostering data-driven decisions for future work.
Suggest areas where more research might be needed or where emerging topics are gaining attention.

This analysis, based on the titles_topics_probabilities.csv and publication dates from our original dataset, offers deeper insights into the temporal patterns of research publications.

2.Title vs Abstract Analysis

Title vs Abstract Analysis explores the correlation between research paper titles and abstracts.

In this notebook, users can :

Analyze the relationship between titles and their corresponding abstracts to identify patterns in the data.
Investigate how well the topics extracted from the abstracts align with the titles of the papers.
Use the correlation analysis to improve the coherence of topics and guide future research

This notebook provides valuable insights into how titles and abstracts might influence each other and contributes to a deeper understanding of topic modeling.

Conclusion

This project successfully applied BERTopic to uncover meaningful topics within a research paper dataset. By refining the topic model parameters and using targeted preprocessing, we achieved a balance between topic granularity and interpretability. The practical applications of this project are diverse:

Monthly Topic Analysis: By tracking topic popularity over time, researchers can identify emerging areas of interest and monitor trends in their field. This feature enables data-driven decisions about where to focus future research efforts.
Correlation Analysis Between Titles and Abstracts: This analysis helps uncover relationships between paper titles and their corresponding abstracts, improving the coherence of topics and contributing to better topic interpretations. Overall, this tool can be valuable for researchers to quickly identify relevant areas, explore trends in scientific literature, and gain deeper insights into the connections between paper titles and abstracts. :)

Future Work

Parameter Optimization: Experiment with other sentence transformers for potentially finer-grained topic distinctions.
Outlier Analysis: Develop a more robust method to handle outliers, potentially enriching topic coherence.
Expansion: Scale the model to larger datasets as computational resources allow.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
Visualizations		Visualizations
0_BERTopic.ipynb		0_BERTopic.ipynb
1_Monthly_trend_analysis.ipynb		1_Monthly_trend_analysis.ipynb
2_Title_vs_Abstract_Analysis.ipynb		2_Title_vs_Abstract_Analysis.ipynb
README.md		README.md
titles_topics_probabilities.csv		titles_topics_probabilities.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Research Paper Topic Modeling with BERTopic

Project Overview

Dataset

Dataset Details:

Preprocessing

Model Configuration

Key Parameters:

Results and Analysis

Observations:

Visualization

Practical Applications:

1.Monthly Topic Analysis

2.Title vs Abstract Analysis

Conclusion

Future Work

About

Uh oh!

Releases

Packages

Languages

khushidubeyokok/BERTopic

Folders and files

Latest commit

History

Repository files navigation

Research Paper Topic Modeling with BERTopic

Project Overview

Dataset

Dataset Details:

Preprocessing

Model Configuration

Key Parameters:

Results and Analysis

Observations:

Visualization

Practical Applications:

1.Monthly Topic Analysis

2.Title vs Abstract Analysis

Conclusion

Future Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages