e Computer Science homework help. Objectives
- Gain in depth experience playing around with big data tools (Hive, SparkRDDs, and Spark SQL).
- Solve challenging big data processing tasks by finding highly efficient solutions; very inefficient solutions will lose marks.
- Experience processing three different types of real data
- Standard multi-attribute data (video game sales data)
- Time series data (Twitter feed)
- Bag of words data
- Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for Hive, Spark (especially spark look under RDD. There are a lot of really useful API calls).
- [Hive] https://cwiki.apache.org/confluence/display/Hive/LanguageManual
- [Spark] http://spark.apache.org/docs/latest/api/scala/index.html#package
- [Spark SQL] https://spark.apache.org/docs/latest/sql-programming-guide.html
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
https://spark.apache.org/docs/latest/api/sql/index.html
- Hint: If you are not sure what a spark API call does, try to write a small example and try it in the spark shell
Submission checklist
- Ensure that all of your solutions read their input from the full data files (not the small example versions)
- Check that all of your solutions run without crashing on the CloudxLab interface.
- Delete all output files
- Zip the whole assignment folder and submit via LMS
Copying, Plagiarism
This is an individual assignment. You are not permitted to work as a part of a group when writing this assignment.
Plagiarism is the submission of somebody else’s work in a manner that gives the impression that the work is your own. For individual assignments, plagiarism includes the case where two or more students work collaboratively on the assignment. The Department of Computer Science and Computer Engineering treats plagiarism very seriously. When it is detected, penalties are strictly imposed.
Expected quality of solutions
- In general, writing more efficient code (less reading/writing from/into HDFS and less data shuffles) will be rewarded with more marks.
- This entire assignment can be done using the CloudxLab servers and the supplied data sets without running out of memory. It is time to show your skills!
- We are not too fussed about the layout of the output. As long as it looks similar to the example outputs for each task. That will be good enough. The idea is not to spend too much time massaging the output to be the right format but instead to spend the time to solve problems.
- For Hive queries. We prefer answers that use less tables.
The questions in the assignment will be labelled using the following:
- [Hive]
- Means this question needs to be done using Hive
- [Spark RDD]
- Means this question needs to be done using Spark RDDs, you are not allowed to use any Spark SQL features like dataframe or datasets.
- [Spark SQL]
- Means this question needs to be done using Spark SQL and therefore not allowed to use RDDs. You need to do these questions using the Spark dataframe or dataset API – do not use SQL syntax. i.e. you must not use the spark.sql() function; instead you must use the individual functions (e.g. .select() and .join())
- For example, write code like this:
- Don’t write code like this:
Do the entire assignment using CloudxLab.
Assignment structure:
- For each Hive question a skeleton .hql file is provided for you to write your solution in. You can run these just like you did in labs:
$ hive -f Task_XX.hql
- For each Spark question, a skeleton project is provided for you. Write your solution in the .scala file. Run your Spark code using:
$ spark-shell -i Task_XX.scala
Tips:
- Look at the data files before you begin each task. Try to understand what you are dealing with! The data for Spark questions is already on HDFS so you don’t need to write the data-loading.
- For each subtask we provide small example input and the corresponding output in the assignment specifications below. These small versions of the files are also supplied with the assignment (they have “-small” in the name). It’s a good idea to get your solution working on the small inputs first before moving on to the full files.
- In addition to testing the correctness of your code using the very small example input, you should also use the large input files that we provide to test the scalability of your solutions. If your solution takes longer than 10 minutes to run on the large input, it definitely has scalability problems.
- It can take some time to run Spark applications. So for the Spark questions it’s best to experiment using the spark-shell interpreter first to figure out a working solution, and then put your code into the .scala files afterwards.
Task 1: Analysing Video Game Sales [37 marks total]
We will be doing some analytics on data that contains a list of video games with sales greater than 100k copies[1]. The data is stored in a semicolon (“;”) delimited format.
The data is supplied with the assignment in the zip file and on HDFS at the below locations:
Local filesystem:
Small version |
t1/Input_data/vgsales-small.csv |
Full version |
t1/Input_data/vgsales.csv |
HDFS:
Small version |
hdfs:///user/ashhall1616/bdc_data/assignment/t1/vgsales-small.csv |
Full version |
hdfs:///user/ashhall1616/bdc_data/assignment/t1/vgsales.csv |
The data has the following attributes
Attribute index | Attribute name | Description |
0 | Name | The game name |
1 | Platform | Platform of the games release (Categorical: “Wii”,”NES”,”GB”,”DS”,”X360″,”PS3″,”PS2″,”SNES”,”GBA”, “3DS”,”PS4″,”N64″,”PS”,”XB”,”PC”,”2600″,”PSP”,”XOne”, “GC”,”WiiU”,”GEN”,”DC”,”PSV”,”SAT”,”SCD”,”WS”,”NG”, “TG16″,”3DO”,”GG”,”PCFX”) |
2 | Year | Year of the game’s release |
3 | Genre | Genre of the game (categorical: “Sports”,”Platform”,”Racing”, “Role-Playing”,”Puzzle”,”Misc”,”Shooter”,”Simulation”, “Action”,”Fighting”,”Adventure”,”Strategy”) |
4 | Publisher | Publisher of the game (categorical: “Nintendo”,”Microsoft Game Studios”,”Take-Two Interactive”,”Sony Computer Entertainment”,”Activision”,”Ubisoft”,”Bethesda Softworks”, “Electronic Arts”,”Sega”,”SquareSoft”,”Atari”,”505 Games”, …, and 567 other unique records) |
5 | NA_Sales | Sales in North America (in millions) |
6 | EU_Sales | Sales in Europe (in millions) |
7 | JP_Sales | Sales in Japan (in millions) |
8 | Other_Sales | Sales in the rest of the world (in millions) |
Here is a small example of the data that we will use to illustrate the subtasks below:
Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | |
Gran Turismo 3: A-Spec | PS2 | 2001 | Racing | Sony Computer Entertainment | 6.85 | 5.09 | 1.87 | 1.16 | |
Call of Duty: Modern Warfare 3 | X360 | 2011 | Shooter | Activision | 9.03 | 4.28 | 0.13 | 1.32 | |
Pokemon Yellow: Special Pikachu Edition | GB | 1998 | Role-Playing | Nintendo | 5.89 | 5.04 | 3.12 | 0.59 | |
Call of Duty: Black Ops | X360 | 2010 | Shooter | Activision | 9.67 | 3.73 | 0.11 | 1.13 | |
Pokemon HeartGold/Pokemon SoulSilver | DS | 2009 | Action | Nintendo | 4.4 | 2.77 | 3.96 | 0.77 | |
High Heat Major League Baseball 2003 | PS2 | 2002 | Sports | 3DO | 0.18 | 0.14 | 0 | 0.05 | |
Panzer Dragoon | SAT | 1995 | Shooter | Sega | 0 | 0 | 0.37 | 0 | |
Corvette | GBA | 2003 | Racing | TDK Mediactive | 0.2 | 0.07 | 0 | 0.01 | |
Resident Evil: Revelations | PS3 | 2013 | Action | Capcom | 0.14 | 0.32 | 0.22 | 0.12 | |
Mission: Impossible – Operation Surma | PS2 | 2003 | Platform | Atari | 0.14 | 0.11 | 0 | 0.04 | |
Suikoden Tierkreis | DS | 2008 | Role-Playing | Konami Digital Entertainment | 0.09 | 0.01 | 0.15 | 0.01 | |
Dragon Ball GT: Final Bout | PS | 1997 | Fighting | Namco Bandai Games | 0.02 | 0.02 | 0.22 | 0.02 | |
Harry Potter and the Sorcerer’s Stone | PS2 | 2003 | Action | Electronic Arts | 0.14 | 0.11 | 0 | 0.04 | |
Tropico 4 | PC | 2011 | Strategy | Kalypso Media | 0.1 | 0.13 | 0 | 0.04 | |
Call of Juarez: The Cartel | X360 | 2011 | Shooter | Ubisoft | 0.14 | 0.11 | 0 | 0.03 | |
Prince of Persia: The Two Thrones | GC | 2005 | Action | Ubisoft | 0.11 | 0.03 | 0 | 0 | |
NHL 07 | PSP | 2006 | Sports | Electronic Arts | 0.13 | 0.02 | 0 | 0.02 | |
Disney’s Winnie the Pooh’s Rumbly Tumbly Adventure | GBA | 2005 | Platform | Ubisoft | 0.09 | 0.03 | 0 | 0 | |
Batman & Robin | PS | 1998 | Action | Acclaim Entertainment | 0.06 | 0.04 | 0 | 0.01 | |
Spider-Man: Battle for New York | DS | 2006 | Platform | Activision | 0.12 | 0 | 0 | 0.01 | |
Please note we specify whether you should use [Hive] or [Spark RDD] for each subtask at the beginning of each subtask. There is a skeleton file provided for each of the tasks.
- [Hive] Report the top 3 games for the ‘X360’ console in North America based on their sales in descending order. Write the results to “Task_1a-out”. For the above small example data set you would report the following (Name, Genre, NA_Sales):
Call of Duty: Black Ops Shooter 9.67
Call of Duty: Modern Warfare 3 Shooter 9.03
Call of Juarez: The Cartel Shooter 0.14
[7 marks]
- [Hive] Report the number of games for each pair of platform and genre, in descending order of count. Write the results to “Task_1b-out”. Hint you can group data by more than one column. For the small example data set you would report the following (Platform, Genre, count):
X360 Shooter 3
PS Action 1
SAT Shooter 1
PSP Sports 1
PS3 Action 1
PS2 Sports 1
PS2 Racing 1
PS2 Platform 1
PS2 Action 1
PS Fighting 1
PC Strategy 1
GC Action 1
GBA Racing 1
GBA Platform 1
GB Role-Playing 1
DS Role-Playing 1
DS Platform 1
DS Action 1
[7 marks]
- [Spark RDD] Find the highest and lowest selling genre based on global sale. Print the result to the terminal output using println. For the small example data set you should get the following results:
(Formula🙂 [Global Sales = NA_Sales + EU_Sales + JP_Sales]
Highest selling Genre: Shooter Global Sale (in millions): 27.57
Lowest selling Genre: Strategy Global Sale (in millions): 0.23
[9 marks]
- [Spark RDD] Report the top 50 publishers by market share (percentage of games made by the publisher among all games), along with the total number of games they made. Output should be in descending order of market share. Print the result to the terminal output using println. Hint: The Spark RDD method take might be useful. For the small example data set you would report the following (publisher name, number of games made by the published, percentage of games made by the publisher among all games):
(Ubisoft,3,15.0)
(Activision,3,15.0)
(Electronic Arts,2,10.0)
(Nintendo,2,10.0)
(Acclaim Entertainment,1,5.0)
(Sega,1,5.0)
(3DO,1,5.0)
(Namco Bandai Games,1,5.0)
(TDK Mediactive,1,5.0)
(Sony Computer Entertainment,1,5.0)
(Konami Digital Entertainment,1,5.0)
(Kalypso Media,1,5.0)
(Capcom,1,5.0)
(Atari,1,5.0)
[14 marks]
Task 2: Analysing Twitter Time Series Data [30 marks]
In this task we will be doing some analytics on real Twitter data[2]. The data is stored in a tab (“\t”) delimited format.
The data is supplied with the assignment in the zip file and on HDFS at the below locations:
Local filesystem:
Small version |
t2/Input_data/twitter-small.tsv |
Full version |
t2/Input_data/twitter.tsv |
HDFS:
Small version |
hdfs:///user/ashhall1616/bdc_data/assignment/t2/twitter-small.tsv |
Full version |
hdfs:///user/ashhall1616/bdc_data/assignment/t2/twitter.tsv |
The data has the following attributes
Attribute index | Attribute name | Description |
0 | tokenType | In our data set all rows have Token type of hashtag. So this attribute is useless for this assignment. |
1 | month | The year and month specified like the following: YYYYMM. So 4 digits for year followed by 2 digits for month. So like the following 200905, meaning the year 2009 and month of May |
2 | count | An integer representing the number tweets of this hash tag for the given year and month |
3 | hashtagName | The #tag name, e.g. babylove, mydate, etc. |
Here is a small example of the Twitter data that we will use to illustrate the subtasks below:
Token type | Month | count | Hash Tag Name |
hashtag | 200910 | 2 | Babylove |
hashtag | 200911 | 2 | babylove |
hashtag | 200912 | 90 | babylove |
hashtag | 200812 | 100 | mycoolwife |
hashtag | 200901 | 201 | mycoolwife |
hashtag | 200910 | 1 | mycoolwife |
hashtag | 200912 | 500 | mycoolwife |
hashtag | 200905 | 23 | abc |
hashtag | 200907 | 1000 | abc |
- [Hive] Find the top 5 months with the highest number of cumulative tweets and sort it according to the number of tweets of each month. Write the results to “Task_2a-out”. The month with the highest number of cumulative tweets should come first. So, for the above small example dataset the result would be:
month numtweets
200907 1000
200912 590
200901 201
200812 100
200905 23
[7 marks]
- [Spark RDD] Find the year with the maximum number of cumulative tweets in the entire dataset across all months. Report the total number of tweets for that year. Print the result to the terminal output using println. For the above small example dataset the output would be:
2009 1819
[9 marks]
- [Spark RDD] Given two months x and y, where y > x, find the hashtag name that has increased the number of tweets the most from month x to month y. We have already written code in your code template that reads the x and y values from the keyboard. Ignore the tweets in the months between x and y, so just compare the number of tweets at month x and at month y. Report the hashtag name, the number of tweets in months x and y. Ignore any hashtag names that had no tweets in either month x or y. You can assume that the combination of hashtag and month is unique. Print the result to the terminal output using println. For the above small example data set the output should be the following:
Input x = 200910, y = 200912
Output hashtagName: mycoolwife, countX: 1, countY: 500
[14 marks]
Task 3: Indexing Bag of Words data [33 marks]
In this task you are asked to create a partitioned index of words to documents that contain the words. Using this index you can search for all the documents that contain a particular word efficiently.
The data is supplied with the assignment in the zip file and on HDFS at the below locations:
Local filesystem:
Small version |
t3/Input_data/docword-small.txt |
t3/Input_data/vocab-small.txt |
Full version |
t3/Input_data/docword.txt |
t3/Input_data/vocab.txt |
HDFS:
Small version |
hdfs:///user/ashhall1616/bdc_data/assignment/t3/docword-small.txt |
hdfs:///user/ashhall1616/bdc_data/assignment/t3/vocab-small.txt |
Full version |
hdfs:///user/ashhall1616/bdc_data/assignment/t3/docword.txt |
hdfs:///user/ashhall1616/bdc_data/assignment/t3/vocab.txt |
The first file is called docword.txt, which contains the contents of all the documents stored in the following format:
Attribute index | Attribute name | Description |
0 | docId | The ID of the document that contains the word |
1 | vocabId | Instead of storing the word itself, we store an ID from the vocabulary file. |
2 | count | An integer representing the number of times this word occurred in this document. |
The second file called vocab.txt contains each word in the vocabulary, which is indexed by vocabIndex from the docword.txt file.
Here is a small example content of the docword.txt file.
docId | vocabId | Count |
3 | 3 | 600 |
2 | 3 | 702 |
1 | 2 | 120 |
2 | 5 | 200 |
2 | 2 | 500 |
3 | 1 | 100 |
3 | 5 | 2000 |
3 | 4 | 122 |
1 | 3 | 1200 |
1 | 1 | 1000 |
Here is an example of the vocab.txt file
vocabId | Word |
1 | Plane |
2 | Car |
3 | Motorbike |
4 | Truck |
5 | Boat |
Complete the following subtasks using Spark:
- [spark SQL] Calculate the total count of each word across all documents. List the words ordered by count in descending order. Write the results in csv format at “file:///home/USERNAME/Task_3a-out”, replacing USERNAME with your CloudxLab username. Use show() to print the first 10 rows of the dataframe that you saved. So, for the above small example input the output would be the following:
+———+———-+
| word|sum(count)|
+———+———-+
|motorbike| 2502|
| boat| 2200|
| plane| 1100|
| car| 620|
| truck| 122|
+———+———-+
Note: spark SQL will give the output in multiple files. You should ensure that the data is sorted globally across all the files (parts). So, all words in part 0, will be alphabetically before the words in part 1.
[10 marks]
- [spark SQL]
- Create a dataframe containing rows with four fields: (word, docId, count, firstLetter). You should add the firstLetter column by using a UDF which extracts the first letter of word as a String. Write the results in parquet format partitioned by firstLetter at “file:///home/USERNAME/t3_docword_index_part.parquet”, replacing USERNAME with your CloudxLab username. Use show() to print the first 10 rows of the partitioned dataframe that you saved.
So, for the above example input, you should see the following output (the exact ordering is not important):
- Create a dataframe containing rows with four fields: (word, docId, count, firstLetter). You should add the firstLetter column by using a UDF which extracts the first letter of word as a String. Write the results in parquet format partitioned by firstLetter at “file:///home/USERNAME/t3_docword_index_part.parquet”, replacing USERNAME with your CloudxLab username. Use show() to print the first 10 rows of the partitioned dataframe that you saved.
+———+—–+—–+———–+
| word|docId|count|firstLetter|
+———+—–+—–+———–+
| plane| 1| 1000| p|
| plane| 3| 100| p|
| car| 2| 500| c|
| car| 1| 120| c|
|motorbike| 1| 1200| m|
|motorbike| 2| 702| m|
|motorbike| 3| 600| m|
| truck| 3| 122| t|
| boat| 3| 2000| b|
| boat| 2| 200| b|
+———+—–+—–+———–+
[9 marks]
- [spark SQL] Load the dataframe stored in partitioned parquet format from subtask b). The task template includes code to obtain a list of query words from the user. For each word in the queryWords list use println to display the following: the word and; the docId with the largest count for that word (you can break ties arbitrarily). Skip any query words that aren’t found in the dataset. To iterate over the query words, use a normal Scala for loop, like this:
for(queryWord <- queryWords) {
// …put Spark query code here…
}
For this subtask there is a big optimisation that can be made because of how the data is partitioned. So, think carefully about how you filter the data.
If queryWords contains “car”, “dog”, and “truck”, then the output for the example dataset would be:
[car,2]
[truck,3]
[7 marks]
- [spark SQL] Again load the dataframe stored in partitioned parquet format from subtask b). The task template includes code to obtain a list of document IDs from the user. For each document ID in the docIds list use println to display the following: the document ID and; the word with the most occurrences in that document, along with the number of occurrences of that word in the document (you can break ties arbitrarily). Skip any document IDs that aren’t found in the dataset. Note in this task we are searching via document ID instead of query word so we cannot take advantage of the same optimization you used for part c). So, you need to find some other kind of optimization. Hint: think about the fact you are repeatedly reusing the same dataframe in the scala for loop as you process each docId.
If docIds contains “2” and “3”, then the output for the example dataset would be:
[2, motorbike, 702]
[3, boat, 2000]
[7 marks]
Bonus Marks:
- Using spark, perform the following task using the Twitter dataset from task 2.
[Spark RDD or Spark SQL] Find the hash tag name that has increased the number of tweets the most from among any two consecutive months of any hash tag name. Consecutive month means for example, 200801 to 200802, or 200902 to 200903, etc. Report the hash tag name, the 1st month count, and the 2nd month count using println.
For the small example data set of task 2 the output would be:
Hashtag name: mycoolwife
count of month 200812: 100
count of month 200901: 201
Note: To get the bonus marks, you need to only complete the task using spark RDD OR SQL – not both. Both template files have been provided for you to choose between.
[10 marks]
Total Marks:
Please note that the total mark for this assignment is capped at 100. If your marks add to more than 100 then your final mark will be 100.
Return of Assignments
Departmental Policy requires that assignments be returned within three weeks of the submission date. We will endeavour to have your assignment returned before the BDC exam. The time and place of the return will be posted on LMS
[1] : Video games Sales data source: https://www.kaggle.com/gregorut/videogamesales
[2] : Twitter data source: http://www.infochimps.com/datasets/twitter-census-conversation-metrics-one-year-of-urls-hashtags-sm