e Computer Science homework help

e Computer Science homework help. Objectives
 

  1. Gain in depth experience playing around with big data tools (Hive, SparkRDDs, and Spark SQL).
  2. Solve challenging big data processing tasks by finding highly efficient solutions; very inefficient solutions will lose marks.
  3. Experience processing three different types of real data
    1. Standard multi-attribute data (video game sales data)
    2. Time series data (Twitter feed)
    3. Bag of words data

 

  1. Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for Hive, Spark (especially spark look under RDD. There are a lot of really useful API calls).

 

  1. [Hive] https://cwiki.apache.org/confluence/display/Hive/LanguageManual
  2. [Spark] http://spark.apache.org/docs/latest/api/scala/index.html#package
  3. [Spark SQL] https://spark.apache.org/docs/latest/sql-programming-guide.html

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
https://spark.apache.org/docs/latest/api/sql/index.html

  • Hint: If you are not sure what a spark API call does, try to write a small example and try it in the spark shell

 
 
Submission checklist
 

  • Ensure that all of your solutions read their input from the full data files (not the small example versions)
  • Check that all of your solutions run without crashing on the CloudxLab interface.
  • Delete all output files
  •  Zip the whole assignment folder and submit via LMS

 
 
Copying, Plagiarism
 
This is an individual assignment. You are not permitted to work as a part of a group when writing this assignment.
 
Plagiarism is the submission of somebody else’s work in a manner that gives the impression that the work is your own. For individual assignments, plagiarism includes the case where two or more students work collaboratively on the assignment.  The Department of Computer Science and Computer Engineering treats plagiarism very seriously.  When it is detected, penalties are strictly imposed.
 
Expected quality of solutions
 

  1. In general, writing more efficient code (less reading/writing from/into HDFS and less data shuffles) will be rewarded with more marks.
  2. This entire assignment can be done using the CloudxLab servers and the supplied data sets without running out of memory. It is time to show your skills!
  3. We are not too fussed about the layout of the output. As long as it looks similar to the example outputs for each task. That will be good enough. The idea is not to spend too much time massaging the output to be the right format but instead to spend the time to solve problems.
  4. For Hive queries. We prefer answers that use less tables.

 
The questions in the assignment will be labelled using the following:

  • [Hive]
    • Means this question needs to be done using Hive
  • [Spark RDD]
    • Means this question needs to be done using Spark RDDs, you are not allowed to use any Spark SQL features like dataframe or datasets.
  • [Spark SQL]
    • Means this question needs to be done using Spark SQL and therefore not allowed to use RDDs. You need to do these questions using the Spark dataframe or dataset API – do not use SQL syntax. i.e. you must not use the spark.sql() function; instead you must use the individual functions (e.g. .select() and .join())
    • For example, write code like this:
    • Don’t write code like this:

 
Do the entire assignment using CloudxLab.
Assignment structure:
 
 

  • For each Hive question a skeleton .hql file is provided for you to write your solution in. You can run these just like you did in labs:

$ hive -f Task_XX.hql
 

  • For each Spark question, a skeleton project is provided for you. Write your solution in the .scala file. Run your Spark code using:

$ spark-shell -i Task_XX.scala
 
 
Tips:
 

  1. Look at the data files before you begin each task. Try to understand what you are dealing with! The data for Spark questions is already on HDFS so you don’t need to write the data-loading.
  2. For each subtask we provide small example input and the corresponding output in the assignment specifications below. These small versions of the files are also supplied with the assignment (they have “-small” in the name). It’s a good idea to get your solution working on the small inputs first before moving on to the full files.
  3. In addition to testing the correctness of your code using the very small example input, you should also use the large input files that we provide to test the scalability of your solutions. If your solution takes longer than 10 minutes to run on the large input, it definitely has scalability problems.
  4. It can take some time to run Spark applications. So for the Spark questions it’s best to experiment using the spark-shell interpreter first to figure out a working solution, and then put your code into the .scala files afterwards.

Task 1: Analysing Video Game Sales [37 marks total]
 
We will be doing some analytics on data that contains a list of video games with sales greater than 100k copies[1]. The data is stored in a semicolon (“;”) delimited format.
 
The data is supplied with the assignment in the zip file and on HDFS at the below locations:
 
Local filesystem:

Small version
t1/Input_data/vgsales-small.csv
Full version
t1/Input_data/vgsales.csv

 
HDFS:

Small version
hdfs:///user/ashhall1616/bdc_data/assignment/t1/vgsales-small.csv
Full version
hdfs:///user/ashhall1616/bdc_data/assignment/t1/vgsales.csv

 
 
The data has the following attributes
 

Attribute index Attribute name Description
0 Name The game name
1 Platform Platform of the games release (Categorical: “Wii”,”NES”,”GB”,”DS”,”X360″,”PS3″,”PS2″,”SNES”,”GBA”,
“3DS”,”PS4″,”N64″,”PS”,”XB”,”PC”,”2600″,”PSP”,”XOne”,
“GC”,”WiiU”,”GEN”,”DC”,”PSV”,”SAT”,”SCD”,”WS”,”NG”,
“TG16″,”3DO”,”GG”,”PCFX”)
2 Year Year of the game’s release
3 Genre Genre of the game (categorical: “Sports”,”Platform”,”Racing”,
“Role-Playing”,”Puzzle”,”Misc”,”Shooter”,”Simulation”,
“Action”,”Fighting”,”Adventure”,”Strategy”)
4 Publisher Publisher of the game (categorical: “Nintendo”,”Microsoft Game Studios”,”Take-Two Interactive”,”Sony Computer Entertainment”,”Activision”,”Ubisoft”,”Bethesda Softworks”,
“Electronic Arts”,”Sega”,”SquareSoft”,”Atari”,”505 Games”, …, and 567 other unique records)
5 NA_Sales Sales in North America (in millions)
6 EU_Sales Sales in Europe (in millions)
7 JP_Sales Sales in Japan (in millions)
8 Other_Sales Sales in the rest of the world (in millions)

 
 
Here is a small example of the data that we will use to illustrate the subtasks below:
 

Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales
Gran Turismo 3: A-Spec PS2 2001 Racing Sony Computer Entertainment 6.85 5.09 1.87 1.16
Call of Duty: Modern Warfare 3 X360 2011 Shooter Activision 9.03 4.28 0.13 1.32
Pokemon Yellow: Special Pikachu Edition GB 1998 Role-Playing Nintendo 5.89 5.04 3.12 0.59
Call of Duty: Black Ops X360 2010 Shooter Activision 9.67 3.73 0.11 1.13
Pokemon HeartGold/Pokemon SoulSilver DS 2009 Action Nintendo 4.4 2.77 3.96 0.77
High Heat Major League Baseball 2003 PS2 2002 Sports 3DO 0.18 0.14 0 0.05
Panzer Dragoon SAT 1995 Shooter Sega 0 0 0.37 0
Corvette GBA 2003 Racing TDK Mediactive 0.2 0.07 0 0.01
Resident Evil: Revelations PS3 2013 Action Capcom 0.14 0.32 0.22 0.12
Mission: Impossible – Operation Surma PS2 2003 Platform Atari 0.14 0.11 0 0.04
Suikoden Tierkreis DS 2008 Role-Playing Konami Digital Entertainment 0.09 0.01 0.15 0.01
Dragon Ball GT: Final Bout PS 1997 Fighting Namco Bandai Games 0.02 0.02 0.22 0.02
Harry Potter and the Sorcerer’s Stone PS2 2003 Action Electronic Arts 0.14 0.11 0 0.04
Tropico 4 PC 2011 Strategy Kalypso Media 0.1 0.13 0 0.04
Call of Juarez: The Cartel X360 2011 Shooter Ubisoft 0.14 0.11 0 0.03
Prince of Persia: The Two Thrones GC 2005 Action Ubisoft 0.11 0.03 0 0
NHL 07 PSP 2006 Sports Electronic Arts 0.13 0.02 0 0.02
Disney’s Winnie the Pooh’s Rumbly Tumbly Adventure GBA 2005 Platform Ubisoft 0.09 0.03 0 0
Batman & Robin PS 1998 Action Acclaim Entertainment 0.06 0.04 0 0.01
Spider-Man: Battle for New York DS 2006 Platform Activision 0.12 0 0 0.01

 
Please note we specify whether you should use [Hive] or [Spark RDD] for each subtask at the beginning of each subtask. There is a skeleton file provided for each of the tasks.
 

  1. [Hive] Report the top 3 games for the ‘X360’ console in North America based on their sales in descending order. Write the results to “Task_1a-out”. For the above small example data set you would report the following (Name, Genre, NA_Sales):

 
Call of Duty: Black Ops       Shooter          9.67
Call of Duty: Modern Warfare 3     Shooter          9.03
Call of Juarez: The Cartel  Shooter          0.14
 
[7 marks]
 

  1. [Hive] Report the number of games for each pair of platform and genre, in descending order of count. Write the results to “Task_1b-out”. Hint you can group data by more than one column. For the small example data set you would report the following (Platform, Genre, count):

 
X360   Shooter          3
PS       Action             1
SAT    Shooter          1
PSP    Sports             1
PS3    Action             1
PS2    Sports             1
PS2    Racing           1
PS2    Platform         1
PS2    Action             1
PS       Fighting         1
PC      Strategy         1
GC      Action             1
GBA    Racing           1
GBA    Platform         1
GB      Role-Playing 1
DS      Role-Playing 1
DS      Platform         1
DS      Action             1
[7 marks]
 

  1. [Spark RDD] Find the highest and lowest selling genre based on global sale. Print the result to the terminal output using println. For the small example data set you should get the following results:

(Formula🙂 [Global Sales = NA_Sales + EU_Sales + JP_Sales]
 
Highest selling Genre: Shooter     Global Sale (in millions): 27.57
Lowest selling Genre: Strategy     Global Sale (in millions): 0.23
 
[9 marks]
 

  1. [Spark RDD] Report the top 50 publishers by market share (percentage of games made by the publisher among all games), along with the total number of games they made. Output should be in descending order of market share. Print the result to the terminal output using println. Hint: The Spark RDD method take might be useful. For the small example data set you would report the following (publisher name, number of games made by the published, percentage of games made by the publisher among all games):

 
(Ubisoft,3,15.0)
(Activision,3,15.0)
(Electronic Arts,2,10.0)
(Nintendo,2,10.0)
(Acclaim Entertainment,1,5.0)
(Sega,1,5.0)
(3DO,1,5.0)
(Namco Bandai Games,1,5.0)
(TDK Mediactive,1,5.0)
(Sony Computer Entertainment,1,5.0)
(Konami Digital Entertainment,1,5.0)
(Kalypso Media,1,5.0)
(Capcom,1,5.0)
(Atari,1,5.0)
 
[14 marks]
 
Task 2: Analysing Twitter Time Series Data [30 marks]
 
In this task we will be doing some analytics on real Twitter data[2]. The data is stored in a tab (“\t”) delimited format.
 
The data is supplied with the assignment in the zip file and on HDFS at the below locations:
 
Local filesystem:

Small version
t2/Input_data/twitter-small.tsv
Full version
t2/Input_data/twitter.tsv

 
HDFS:

Small version
hdfs:///user/ashhall1616/bdc_data/assignment/t2/twitter-small.tsv
Full version
hdfs:///user/ashhall1616/bdc_data/assignment/t2/twitter.tsv

 
 
The data has the following attributes
 

Attribute index Attribute name Description
0 tokenType In our data set all rows have Token type of hashtag. So this attribute is useless for this assignment.
1 month The year and month specified like the following: YYYYMM. So 4 digits for year followed by 2 digits for month. So like the following 200905, meaning the year 2009 and month of May
2 count An integer representing the number tweets of this hash tag for the given year and month
3 hashtagName The #tag name, e.g. babylove, mydate, etc.

 
 
 
 
Here is a small example of the Twitter data that we will use to illustrate the subtasks below:
 

Token type Month count Hash Tag Name
hashtag 200910 2 Babylove
hashtag 200911 2 babylove
hashtag 200912 90 babylove
hashtag 200812 100 mycoolwife
hashtag 200901 201 mycoolwife
hashtag 200910 1 mycoolwife
hashtag 200912 500 mycoolwife
hashtag 200905 23 abc
hashtag 200907 1000 abc

 
 
 
 

  1. [Hive] Find the top 5 months with the highest number of cumulative tweets and sort it according to the number of tweets of each month. Write the results to “Task_2a-out”. The month with the highest number of cumulative tweets should come first. So, for the above small example dataset the result would be:

 
 
month             numtweets
200907           1000
200912           590
200901           201
200812           100
200905           23
 
 
[7 marks]
 
 

  1. [Spark RDD] Find the year with the maximum number of cumulative tweets in the entire dataset across all months. Report the total number of tweets for that year. Print the result to the terminal output using println. For the above small example dataset the output would be:

 
2009   1819
 
[9 marks]
 

  1. [Spark RDD] Given two months x and y, where y > x, find the hashtag name that has increased the number of tweets the most from month x to month y. We have already written code in your code template that reads the x and y values from the keyboard. Ignore the tweets in the months between x and y, so just compare the number of tweets at month x and at month y. Report the hashtag name, the number of tweets in months x and y. Ignore any hashtag names that had no tweets in either month x or y. You can assume that the combination of hashtag and month is unique.  Print the result to the terminal output using println. For the above small example data set the output should be the following:

 
Input              x = 200910, y = 200912
 
Output           hashtagName: mycoolwife, countX: 1, countY: 500
 
[14 marks]
 
 
Task 3: Indexing Bag of Words data [33 marks]
 
In this task you are asked to create a partitioned index of words to documents that contain the words. Using this index you can search for all the documents that contain a particular word efficiently.
 
The data is supplied with the assignment in the zip file and on HDFS at the below locations:
 
Local filesystem:

Small version
t3/Input_data/docword-small.txt
t3/Input_data/vocab-small.txt
Full version
t3/Input_data/docword.txt
t3/Input_data/vocab.txt

 
HDFS:

Small version
hdfs:///user/ashhall1616/bdc_data/assignment/t3/docword-small.txt
hdfs:///user/ashhall1616/bdc_data/assignment/t3/vocab-small.txt
Full version
hdfs:///user/ashhall1616/bdc_data/assignment/t3/docword.txt
hdfs:///user/ashhall1616/bdc_data/assignment/t3/vocab.txt

 
The first file is called docword.txt, which contains the contents of all the documents stored in the following format:
 

Attribute index Attribute name Description
0 docId The ID of the document that contains the word
1 vocabId Instead of storing the word itself, we store an ID from the vocabulary file.
2 count An integer representing the number of times this word occurred in this document.

 
The second file called vocab.txt contains each word in the vocabulary, which is indexed by vocabIndex from the docword.txt file.
 
 
 
Here is a small example content of the docword.txt file.
 

docId vocabId Count
3 3 600
2 3 702
1 2 120
2 5 200
2 2 500
3 1 100
3 5 2000
3 4 122
1 3 1200
1 1 1000

 
Here is an example of the vocab.txt file
 

vocabId Word
1 Plane
2 Car
3 Motorbike
4 Truck
5 Boat

 
Complete the following subtasks using Spark:

  1. [spark SQL] Calculate the total count of each word across all documents. List the words ordered by count in descending order. Write the results in csv format at “file:///home/USERNAME/Task_3a-out”, replacing USERNAME with your CloudxLab username. Use show() to print the first 10 rows of the dataframe that you saved. So, for the above small example input the output would be the following:

+———+———-+
|     word|sum(count)|
+———+———-+
|motorbike|      2502|
|     boat|      2200|
|    plane|      1100|
|      car|       620|
|    truck|       122|
+———+———-+
 
Note: spark SQL will give the output in multiple files. You should ensure that the data is sorted globally across all the files (parts). So, all words in part 0, will be alphabetically before the words in part 1.
[10 marks]
 

  1. [spark SQL]
    1. Create a dataframe containing rows with four fields: (word, docId, count, firstLetter). You should add the firstLetter column by using a UDF which extracts the first letter of word as a String. Write the results in parquet format partitioned by firstLetter at “file:///home/USERNAME/t3_docword_index_part.parquet”, replacing USERNAME with your CloudxLab username. Use show() to print the first 10 rows of the partitioned dataframe that you saved.
      So, for the above example input, you should see the following output (the exact ordering is not important):

+———+—–+—–+———–+
|     word|docId|count|firstLetter|
+———+—–+—–+———–+
|    plane|    1| 1000|          p|
|    plane|    3|  100|          p|
|      car|    2|  500|          c|
|      car|    1|  120|          c|
|motorbike|    1| 1200|          m|
|motorbike|    2|  702|          m|
|motorbike|    3|  600|          m|
|    truck|    3|  122|          t|
|     boat|    3| 2000|          b|
|     boat|    2|  200|          b|
+———+—–+—–+———–+
[9 marks]
 

  1. [spark SQL] Load the dataframe stored in partitioned parquet format from subtask b). The task template includes code to obtain a list of query words from the user. For each word in the queryWords list use println to display the following: the word and; the docId with the largest count for that word (you can break ties arbitrarily). Skip any query words that aren’t found in the dataset. To iterate over the query words, use a normal Scala for loop, like this:

 
for(queryWord <- queryWords) {
// …put Spark query code here…
}
For this subtask there is a big optimisation that can be made because of how the data is partitioned. So, think carefully about how you filter the data.
 
If queryWords contains “car”, “dog”, and “truck”, then the output for the example dataset would be:
 
[car,2]
[truck,3]
 
[7 marks]

  1. [spark SQL] Again load the dataframe stored in partitioned parquet format from subtask b). The task template includes code to obtain a list of document IDs from the user. For each document ID in the docIds list use println to display the following: the document ID and; the word with the most occurrences in that document, along with the number of occurrences of that word in the document (you can break ties arbitrarily). Skip any document IDs that aren’t found in the dataset. Note in this task we are searching via document ID instead of query word so we cannot take advantage of the same optimization you used for part c). So, you need to find some other kind of optimization. Hint: think about the fact you are repeatedly reusing the same dataframe in the scala for loop as you process each docId.

 
If docIds contains “2” and “3”, then the output for the example dataset would be:
 
[2, motorbike, 702]
[3, boat, 2000]
 
[7 marks]
 
 
 
Bonus Marks:
 

  1. Using spark, perform the following task using the Twitter dataset from task 2.

 
[Spark RDD or Spark SQL] Find the hash tag name that has increased the number of tweets the most from among any two consecutive months of any hash tag name. Consecutive month means for example, 200801 to 200802, or 200902 to 200903, etc. Report the hash tag name, the 1st month count, and the 2nd month count using println.
 
For the small example data set of task 2 the output would be:
 
Hashtag name: mycoolwife
count of month 200812: 100
count of month 200901: 201
 
Note: To get the bonus marks, you need to only complete the task using spark RDD OR SQL – not both. Both template files have been provided for you to choose between.
 
[10 marks]
 
 
 
 
 
Total Marks:
Please note that the total mark for this assignment is capped at 100. If your marks add to more than 100 then your final mark will be 100.
 

Return of Assignments

Departmental Policy requires that assignments be returned within three weeks of the submission date. We will endeavour to have your assignment returned before the BDC exam. The time and place of the return will be posted on LMS
 
[1]          : Video games Sales data source:  https://www.kaggle.com/gregorut/videogamesales
[2]          : Twitter data source: http://www.infochimps.com/datasets/twitter-census-conversation-metrics-one-year-of-urls-hashtags-sm

e Computer Science homework help

Solution:

15% off for this assignment.

Our Prices Start at $11.99. As Our First Client, Use Coupon Code GET15 to claim 15% Discount This Month!!

Why US?

100% Confidentiality

Information about customers is confidential and never disclosed to third parties.

Timely Delivery

No missed deadlines – 97% of assignments are completed in time.

Original Writing

We complete all papers from scratch. You can get a plagiarism report.

Money Back

If you are convinced that our writer has not followed your requirements, feel free to ask for a refund.