NLP on GitHub comments
The dataset I am using in this project (github_comments.tsv
) that carries 4000 comments that were published on pull requests on Github by developer teams.
Here is an explanation of the table columns:
- Comment: the comment made by a developer on the pull request.
- Comment_date: date at which the comment was published
- Is_merged: shows whether the pull request on which the comment was made has been accepted (therefore merged) or rejected.
- Merged_at: date at which the pull request was merged (if accepted).
- Request_changes: each comment is labelled either 1 or 0: if it’s labelled as 1 if the comment is a request for change in the code. If not, it’s labelled as 0.
The GitHub of the project can be found here :
The goal is to dig deeper into the nature of blockers and analyze the requests for change. If possible, try to answer the following questions:
- What are the most common problems that appear in these comments?
- Can we cluster the problems by topic/problem type?
- How long is the resolution time after a change was requested?
Content
- Report.pdf is a PDF report that details my approach.
- images is a collection of the images that I included in my report
- TopicModelling.ipynb is a Jupyter Notebook in which I have do my analysis in Python
- corpus.pkl, dictionary.gensim, and all files starting with model… are files generated in the notebook that I use to avoid re-running some steps.
Theory covered
This project covers the concepts of :
- Topic Modelling using LDA
- Clustering through tf-idf and BoW
- Dimension reduction through t-SNE and truncated SVD
- Classification and Regression algorithms