在GitHub上找到一些不错的项目。需要按照课程要求实现自己的项目。
前提
- Big Data
- Spark
已有项目
基于Spark的电影推荐系统
基于大数据过滤引擎的电影推荐系统–“懂你”电影网站,包含了爬虫、电影网站(前端和后端)、后台管理系统以及推荐系统(Spark)
News_recommend
基于大数据计算引擎的新闻推荐系统–”今日小站”,包含了爬虫,新闻网站(前端和后端),推荐系统(Spark)
深圳地铁大数据客流分析系统
!: This one is FUCKING HARD
主要分析深圳通刷卡数据,通过大数据技术角度来研究深圳地铁客运能力,探索深圳地铁优化服务的方向
基于Spark2.x新闻网大数据实时分析可视化系统项目
改变
根据之前的proposal,我们需要将有spark技术的大数据系统,结合AWS,应用到COVID或者别的什么疾病上面。
人流 + 疫情peak
scrap policy + peak, 根据疫情和不同的政治要求选择最佳的policy
scrap twitter / reddit or some media, predict the new version of virus
scrap on-time twitter / reddit or some media data, visualization the covid graph
David
Well, assuming we can get data for any non-COVID virus, as you said a recommended system would probably be a good fit for the type of question we are asking, which would essential be using the policy data to attempt to predict virus amount, although it might be possible to also get away with using a simpler regression, if we convert some of the categorical data into something numeric.
A completely different question that we could ask that wouldn’t require finding data for any other virus, just more recent COVID data, would be if a combination of past policy data combined with COVID numbers predicts current policy, which could be a decision tree. The thinking there would be that possibly as time has gone by, lots of places have relaxed their policies, and how did virus numbers impact those decisions - so detecting trends in how virus numbers impacted policy and not the other way around.
Proposal 1 | Proposal 2 | |
---|---|---|
data1 | data for non-COVID virus (like flu) | COVID data |
data2 | policy data | policy data |
(focus on COVID data) | ||
difficulty* | we may fail to collect enough data for a common virus (CDC can’t offer detailed data) | # |
data cleaning | convert some of the categorical data into something numeric | / |
method | regression | decision tree |
result | recommendation system | combine past policy data and COVID numbers to predict current policy |
goal | predict virus amount | detect trends in how virus numbers impacted policy and not the other way around |