Thanks for coming to Open Data Shanghai 2016!
This was the very first edition of Open Data Shanghai organized by Techyizu in collaboration with Coderbunker. We hope you enjoyed your weekend at Coderbunker, and we thanks PatSnap for sponsorship. Open Data Shanghai was held in December 10th and 11th. There were talks given by experienced data scientists and participants had 24 hours for data analysis. If you missed this awesome event, check out the overview below.
#1: China Economic Data
Macroeconomic data from the China National Bureau of Statistics
Chinese FDI Data
#2: China Air Quality Data
April to August 2014 Data from Berkeley Earth
#3: Shanghai Social Apps Data
#4: FreeCodeCamp Surveys
#5: US Day Tweets
#6: Image Recognition
Pragmatic Natural Language Processing (NLP), Matt Fortier
Offering an high-level framework including preprocessing and building vocabulary with explanations of tooling: TextBlob, NLTK, Jieba & SnowNLP (Chinese), spaCy and for modeling: gensim, scikit-learn. He showed a demo on how to categorize and visualize news articles into different categories.
Link to presentation: https://github.com/fortiema/notebooks/blob/master/Pragmatic%20NLP.ipynb
Introduction to R Programming and Data Visualization, Yangyang Xu
Introducing the difference between R and other closed-source softwares. Highlight different features of R to create graphs and extract/transform/load data.
Tensorflow for Poets, Jingsheng Wang
Jingshen is a Drupal developer that uses Tensorflow to label images. Wide Learning vs Deep Learning (learn all features from the bottom up) using Image-net.org as a starting point and as yet unrecognized images. Tensorflow is easy to install, 3 commands to predict new images with an accuracy better than humans.
Link to presentation: https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/
Applied Pandas for Data Transformation, Aurelien Petit
No, no, not the protected pandas. Pandas is Python tooling. Aurelien Petit works in company where data is important but used to rely on spreadsheets and databases or custom data structures. Pandas data frames make it much more convenient to manipulate data with good performance for large datasets. Using it in a corporate environment on multi-gigabytes.
User focused data science and World Bank Open Data, Juha Suomalainen
Juha talks about how to bring the data to audience. Data nerd at Wiredcraft from Finland. Important to bring the context to the user. Some hints on how to serve the visualizations, such as Falcor or GraphQL and pre-processing the data as much as possible. Some deep links into worldbank.org to access the data. Building web applications on top of big data.
Deep Convolutional Networks with Tensorflow, Shishi "Burness" Duan
Shishi Duan introduced two applications of Deep Learning: porn detection and resolution upscaling. He gave a high-level view of his workflow, from downloading the image and annotating the images, and the training. He also explained how to train a new dataset on an existing pre-trained model.
Project link: https://github.com/burness/tf_super_resolution
Data sciences applied to startups, TJ Spross
Skills to pick up: SQL, R (Dplyr, Tidyr, Rmarkdown), Python (Jupyter, Pandas). Usually startups do not start with data scientists, usually comes after. Does not need to be complex infrastructure and the latest in machine learning to do data sciences. Preferable to have a dedicated analytics pipeline which is made easier by the increased in the ease of use of the toolsets such as Luigi. Recommend looking towards open source and self-hosted. For dashboards, recommend things like Tableau or OpenSource alternatives like superset. Ultimately, build the toolsets to answers the questions of the business.
Working schedule: The team got to know each other first, then they started to look into the datasets. They found that the terms used in the data are hard to understand for people without a financial or business background. They team had a hypothesis that the Chinese company FDI investment would have a correlation with the company stock price. They used Tableau for data visualisation and testing their hypothesis.
Chinese Social Apps
Miguel Fernandez & Phil Mackenzie
This team of two people, used python, jupyter, pandas to analyze various data retrieved from different social apps. They were hesitating on focusing on one dataset only or combining different datasets. Later they decided to focus on Airbnb data. As a result, they used volume function in python to create a heatmap to investigate the relationship between Airbnb location and its price. Additionally, they found patterns in the Weibo postings during different times of the day. They also used Folium to visualize datasets on the map.
US Election Tweets
Sylvain, presented by Ricky
Sylvain from Coderbunker worked on analyzing the tweets. After cleaning and processing the data with jupyter and pandas, he found the most tagged word for the tweets before and after the elections. Before the election most tweets were about voting, while after the elections tweets were dominated with keywords such as "senate", "democrats", "republicans"
Shanghai Social apps data
Karl Ritchie, Matt Fortier, Yunfang Lu and Sheng Li
The team tried to build a predictive model by using python, pandas, Folium and scikit-learn. Firstly, they cleaned up the dataset retrieved from Airbnb, Weibo, Dianping and Flickr, and then inserted in Mongo DB. They used MongoDB GeoJSON 2dsphere indexing capabilities to find all points of interest. Their application would be able to estimate the attractiveness of an Airbnb.
Mary Polites, Eva Xiao, Stephane Laurent, Jerry Liu, Ignacio López Busón
The team analyzed Weibo check-in data to find a pattern about the Weibo influencers. They classified the data into 3 groups: normal users, influencers and super influencers. They build very nice visualization using clustering of super influencers and hotspots. The model of the city was built in rhinoceros 5, and the plugin to rhino that allowed them to connect geometry with data was grasshopper 3D. The team was selected as the winning team from the judges, for their practical insights and visualisations.
Shanghai Air Quality
Ilya, Bart, Hang
The team decided to focus on the air quality data in Shanghai. The team collected additional data about Shanghai weather, and built a predictive model consisting of wind strength, temperature etc... The team used python, jupyter, pandas for data processing and visualisation. They used the popular scikit-learn to build a regression model.
Congratulations again to all the teams who survived the 24 hours hackathon. Thanks to all our speakers and attendees who joined the talk. Special thanks to all sponsors and the Patsnap team who sponsored and joined our event. The event was a blast, we can't wait to hold our next Open Data Shanghai soon! Hope to see you all again!
We just sent you an email. Please click the link in the email to confirm your subscription!
OKSubscriptions powered by Strikingly