Return to site

First edition of Open Data Shanghai

Data Scientists' Party

Thanks for coming to Open Data Shanghai 2016!

This was the very first edition of Open Data Shanghai organized by Techyizu in collaboration with Coderbunker. We hope you enjoyed your weekend at Coderbunker, and we thanks PatSnap for sponsorship. Open Data Shanghai was held in December 10th and 11th. There were talks given by experienced data scientists and participants had 24 hours for data analysis. If you missed this awesome event, check out the overview below.

Datasets

#1: China Economic Data

  • Macroeconomic data from the China National Bureau of Statistics

  • Chinese FDI Data

  • World Bank data on China

#2: China Air Quality Data

  • April to August 2014 Data from Berkeley Earth

  • Air Quality data from 2000 to 2015

#3: Shanghai Social Apps Data

  • Airbnb Listings

  • Dazhong Dianping

  • Flickr

  • Weibo

#4: FreeCodeCamp Surveys

  • survey of 40+ question to 15k developers

#5: US Day Tweets

  • 500k tweets during the US Election Day

#6: Image Recognition

  • Cats vs Dogs
  • Caltech 256 data

Talks

Pragmatic Natural Language Processing (NLP), Matt Fortier

Offering an high-level framework including preprocessing and building vocabulary with explanations of tooling: TextBlob, NLTK, Jieba & SnowNLP (Chinese), spaCy and for modeling: gensim, scikit-learn. He showed a demo on how to categorize and visualize news articles into different categories.

Link to presentation: https://github.com/fortiema/notebooks/blob/master/Pragmatic%20NLP.ipynb

Introduction to R Programming and Data Visualization, Yangyang Xu

Introducing the difference between R and other closed-source softwares. Highlight different features of R to create graphs and extract/transform/load data.

Link to presentation: https://docs.google.com/presentation/d/1GX6PzxSvKtZvxmEMCItBxLXwRhNfjQvXmlI_hDVpM-c/edit#slide=id.p

Tensorflow for Poets, Jingsheng Wang

Jingshen is a Drupal developer that uses Tensorflow to label images. Wide Learning vs Deep Learning (learn all features from the bottom up) using Image-net.org as a starting point and as yet unrecognized images. Tensorflow is easy to install, 3 commands to predict new images with an accuracy better than humans.

Link to presentation: https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/

Applied Pandas for Data Transformation, Aurelien Petit

No, no, not the protected pandas. Pandas is Python tooling. Aurelien Petit works in company where data is important but used to rely on spreadsheets and databases or custom data structures. Pandas data frames make it much more convenient to manipulate data with good performance for large datasets. Using it in a corporate environment on multi-gigabytes.

User focused data science and World Bank Open Data, Juha Suomalainen

Juha talks about how to bring the data to audience. Data nerd at Wiredcraft from Finland. Important to bring the context to the user. Some hints on how to serve the visualizations, such as Falcor or GraphQL and pre-processing the data as much as possible. Some deep links into worldbank.org to access the data. Building web applications on top of big data.

Deep Convolutional Networks with Tensorflow, Shishi "Burness" Duan

Shishi Duan introduced two applications of Deep Learning: porn detection and resolution upscaling. He gave a high-level view of his workflow, from downloading the image and annotating the images, and the training. He also explained how to train a new dataset on an existing pre-trained model.

Project link: https://github.com/burness/tf_super_resolution

Data sciences applied to startups, TJ Spross

Skills to pick up: SQL, R (Dplyr, Tidyr, Rmarkdown), Python (Jupyter, Pandas). Usually startups do not start with data scientists, usually comes after. Does not need to be complex infrastructure and the latest in machine learning to do data sciences. Preferable to have a dedicated analytics pipeline which is made easier by the increased in the ease of use of the toolsets such as Luigi. Recommend looking towards open source and self-hosted. For dashboards, recommend things like Tableau or OpenSource alternatives like superset. Ultimately, build the toolsets to answers the questions of the business.

Link to presentation: https://docs.google.com/presentation/d/1EIcxrmlUbl4GuhpEXacDyaU_GsIUULY9eOI5DRsBpeM/edit?usp=sharing

Team Presentations

Working schedule: The team got to know each other first, then they started to look into the datasets. They found that the terms used in the data are hard to understand for people without a financial or business background. They team had a hypothesis that the Chinese company FDI investment would have a correlation with the company stock price. They used Tableau for data visualisation and testing their hypothesis.

Chinese Social Apps

Miguel Fernandez & Phil Mackenzie

This team of two people, used python, jupyter, pandas to analyze various data retrieved from different social apps. They were hesitating on focusing on one dataset only or combining different datasets. Later they decided to focus on Airbnb data. As a result, they used volume function in python to create a heatmap to investigate the relationship between Airbnb location and its price. Additionally, they found patterns in the Weibo postings during different times of the day. They also used Folium to visualize datasets on the map.

US Election Tweets

Sylvain, presented by Ricky 

Sylvain from Coderbunker worked on analyzing the tweets. After cleaning and processing the data with jupyter and pandas, he found the most tagged word for the tweets before and after the elections. Before the election most tweets were about voting, while after the elections tweets were dominated with keywords such as "senate", "democrats", "republicans"

Shanghai Social apps data

Karl Ritchie, Matt Fortier, Yunfang Lu and Sheng Li

The team tried to build a predictive model by using python, pandas, Folium and scikit-learn. Firstly, they cleaned up the dataset retrieved from Airbnb, Weibo, Dianping and Flickr, and then inserted in Mongo DB. They used MongoDB GeoJSON 2dsphere indexing capabilities to find all points of interest. Their application would be able to estimate the attractiveness of an Airbnb.

Weibo Influencers

Mary Polites, Eva Xiao, Stephane Laurent, Jerry Liu, Ignacio López Busón

The team analyzed Weibo check-in data to find a pattern about the Weibo influencers. They classified the data into 3 groups: normal users, influencers and super influencers. They build very nice visualization using clustering of super influencers and hotspots. The model of the city was built in rhinoceros 5, and the plugin to rhino that allowed them to connect geometry with data was grasshopper 3D. The team was selected as the winning team from the judges, for their practical insights and visualisations.

Shanghai Air Quality

Ilya, Bart, Hang

The team decided to focus on the air quality data in Shanghai. The team collected additional data about Shanghai weather, and built a predictive model consisting of wind strength, temperature etc... The team used python, jupyter, pandas for data processing and visualisation. They used the popular scikit-learn to build a regression model.

Final thoughts

Congratulations again to all the teams who survived the 24 hours hackathon. Thanks to all our speakers and attendees who joined the talk. Special thanks to all sponsors and the Patsnap team who sponsored and joined our event. The event was a blast, we can't wait to hold our next Open Data Shanghai soon! Hope to see you all again!

All Posts
×

Almost done…

We just sent you an email. Please click the link in the email to confirm your subscription!

OKSubscriptions powered by Strikingly