Web App hosted here: https://thetwitterpolice.herokuapp.com/
About the project
- Collect a minimum of 300 tweets (excluding retweets/replies) from 5 verified police accounts (India).
- Store the collected data in MongoDB Collections.
- Perform statistical analysis on the data.
- Display the results in the form of a web app.
- Data Collection:
BeautifulSoupwere used to scrape the web pages of the required twitter accounts.
- Clean Up & Storage: Collected data was cleaned & stored in
MongoDBcollections hosted on mLab. It can be accessed at
- Analysis: Analysis was purely done in iPython Notebook and I’d like to request you to please check them out
(Analysis.ipynb)once since they provide a very conducive environment for data analysis.
- Hosting: Used simple flask based web app to display the results, deployed on heroku. Click Here
Primary areas of focus
- Simplicity of results with maximum information gain.
- Easily understandable code.
- Exclude Retweets: To overcome this, the scrapper (bs4) was made to filter tweets on the basis of
data-user-idwhich is unique for any given tweet and always traces back to the original source of the tweet.
- Exclude Replies: URLs taken were of the form
- Why create five different scripts to collect data? It’s true that this could’ve been accomplished in a single script however, given the size and nature of data and the constant debugging that was required during development (collection), I felt it was better to create them individually. Also, if we wanted to update a collection of a particluar account, the need to collect data from all the other ones did not make much sense. (But this is an extremely personal opinion.) Executing any of the five scripts will alter its respective collection stored in the database automatically (online), thus accomodating the changes.
Analysis.ipynb : All the analysis was done in an
IPython Notebook. There were 2 reasons to use iPy notebooks - Firstly, they provide a rich environment that combines module by module code execution, mathematics & plots. Second and more importantly, that’s what I’ve been using for a quite a while now for my Kaggle Competitions/Projects.
- Frequency of Tweets (tweets/day): Includes the average number of tweets made by the account (no rts/replies).
- Frequent Hashtags (#): Includes 10 most frequently used hashtags.
- Sentiment Analysis: Done using
TextBlob. Pie chart denotes the total number of positive, negative & neutral tweets.
- Engagements: Determining the avg number of engagements (Favs + RTs), grouped by the type of content (Text/Media)
- Time Series: Time series is important (I feel so) when we want to study the activity of such a law enforcement social handle under different instances of time. Therefore, I think it was important to include it in the statistical analysis part.
- Word Cloud: Word clouds have been long used to represent frequently occuring words. Additionally to that, the wordclouds generated in this analysis mask the image of the state/city’s MAP itself. (Except- Thane, where I could not find a decent png map image).
app.py: For Flask deployment.
Analysis.py: Analysis of the collected data.
/FetchData/: Directory containing data collection scripts. One may comment out the last few lines of code in these scripts to avoid altering the collections conatining the original data, they’ll still execute and print the result. (Note : Please change the path to
style.cssand other relevant files necessary for deployment.
/templates/: Contains the
/JSON_Files/: Contains JSON exports for all the accounts from MongoDB Collections
- @DelhiPolice - 358 Records
- @MumbaiPolice - 617 Records
- @PuneCityPolice - 347 Records
- @ThaneCityPolice - 313 Records
- @wbpolice - 393 Records
- python3 (spyder-conda)
- IPython Notebook (For analysis)
- NumPy/Pandas/PyMongo etc (Essentially reqd)
- Matplotlib, Seaborn, WordCloud for visualizations
- TextBlob (Sentiment Analysis)
One of the key challenges for me was to work out the web-deployment solution since I was just starting out with the python based web frameworks, so that took around an entire day or two. Another one, though straight forward, was to bypass the limitations posed by the twitter REST APIs (3200 tweets/filtering of RTs etc) for which I did try to work for a day or so (on the APIs), looking for solutions but ultimately went ahead with the web-scrapping based solution for data collection.
Further work: 1. Structure the code in a better way(there might be some redundancy here & there). 2. Apart from the specific task at hand, I’d like to explore the possibility of using machine learning based techniques like cluster analysis along some supervised models to identify specific patterns of factors influencing engagements etc. (Just a thought) 3. Make the solution more dynamic in nature in terms of deployment.