2. Instructions & Brief
Task A
We will analyse the top emoticons found in the messages of tweets, from the ‘msgraw_sample.txt’ data used in the tutorial of Week 7. Note this should be done a Linux machine or similar where bash supported.
Task A.1 (4 marks)
The first sub-task is to extract the top 20 emoticons and their counts from the tweets. This must not be done entirely manually, and it can only be done using a single shell script. So you need to write a single shell script ‘tweet2emo.sh’ that will input ‘msgraw_sample.txt’ from stdin and produce a CSV file ‘potential_emoticon.csv’ giving a list of candidate emoticons with their occurrence counts. The important word here is “candidate”. Perhaps only 1 in 5 of your candidates are emoticons. Then you need to edit this by hand, deleting non-emoticons, and deleting less frequent ones, to get your final, list ’emoticon.csv’.
So for this task, you must submit:
(1) a single bash script, ‘tweet2emo.sh’ : this must output, one per line, a candidate emoticon and a count of occurrence, and cannot have any Python or R programmes embedded in it. More details on how to do this below.
(2) the candidate list of emoticons generated by the script, ‘potential_emoticon.csv’ : CSV file, TAB delimited file with (count, text-emoticon).
(3) the final list of emoticons selected, ’emoticon.csv’ : CSV file, TAB delimited file with (count, text-emoticon); these should be the 20 most frequent emoticons from ‘potential_emoticon.csv’, but you will have to select yourself, manually by editing, which are actually emoticons. To do this, you may use an externally provided list of recognised emoticons, but not should be used in step (2).
(4) a description for this task is included in your final PDF report describing the method used for the bash script, and then the method used to edit the file, to get the file for step (3).
Your bash scripts might take 2-5-10 lines and might require storing intermediate files.
The following single line commands, which process a file from stdin and generate stdout should be useful for this task:
perl -p -e ‘s/s+/n/g;’
— tokenise each line of text by converting space characters to newlines;
NOTE: this reportedly also work on Windows where newline character is different
perl -p -e ‘s/>/>/g; s/</
— convert embedded HTML escapes for ‘>’ and ‘
— you need to do this if you want to capture emoticons using the ‘<‘ or the ‘>’ characters, like ‘
sort | uniq -c | perl -p -e ‘s/^s+//; s/ /t/; ‘
— assumes the input file has one item per line
— sort and count the items and generates TAB delimited file with (count, item) entries
Specially, in order to recognise potential emoticons, you will need to write suitable greps. Here are some examples:
grep -e ‘^_^’
— match lines containing the string “^_^”
grep -e ‘^^_^’
— match lines starting with the string “^_^”, the initial “^”, called an anchor, says match start of line
grep -e ‘^_^$’
— match lines ending with the string “^_^”, the final “$”, called an anchor, says match end of line
grep -e ‘^^_^$’
— match lines made exactly of the string “^_^”, using beginning and ending anchors
grep -e ‘^0_0$’
— match lines made exactly of the string “0_0”
grep -e ‘^^_^$’ -e ‘^0_0$’
— match lines made exactly of the string “^_^” or “0_0”; so two match strings are ORed
grep -e ‘^[.:^]$’
— match lines made exactly of the characters in the set “.:^”
— the construction “[ … ]” means “characters in the set ” … ” but be warned some characters used inside have strange effects, like “-“, see next
grep -e ‘^[0-9ABC]$’
— match lines made exactly of the digits (“0-9” means in the range “0” to “9”) or characters “ABC”
grep -e ‘^[-0-9ABC]$’
— match lines made exactly of the dash “-“, the digits, or the characters “ABC”
— we place “-” at the front to stop in meaning “range”
For more detail on grep see:
https://opensourceforu.com/2012/06/beginners-guide-gnu-grep-basics-regular-expressions/
But my advice is “keep it simple” and stick with the above constructs. Remember you get to edit the final results by hand anyway. But if your grep match strings say “7” is an emoticon, it probably isn’t a strong enough filter.
Task A.2 (4 marks)
We would like to compute word co-occurrence with emoticons. So suppose we have the tweet:
loved the results of the game 😉
then this means that emoticon ‘;-)’ co-occurs once with each of the words in the list ‘ loved the results of the game’ once.
You can use the supplied Python program ’emoword.py” which uses a single emoticon, takes ‘msgraw_sample.txt’ as stdin and outputs a raw list of co-occurring tokens.
./emoword.py ‘:))’
Note the emoticon is enclosed in single quotes because the punctuation can cause bash to do weird things otherwise.
You can also put this in a bash loop to run over your emoticon list like so:
for E in ‘;)’ ‘:)’ ‘echo running this emoticon $E
done
or counting them too using
CNT=1
for E in ‘;)’ ‘:)’ ‘echo running this emoticon $E > $CNT.out
CNT=$(( $CNT + 1)) # this is arithmetic in bash
done
But be warned, bash does strange things with punctuation … it treats it differently as it plays a role in the language. So while you can have a loop doing this:
for E in ‘;)’ ‘:)’ ‘
where you have edited in your emoticons, and used the single quotes to tell bash the quoted text is a single token, if instead you try and be clever and read them from a file
for E in `cat emoticons.txt` ; do
then bash well see individual punctuation and probably fail to work in the way you want.
For each emoticon in your list ’emoticon.csv’, find a list of the 10-20 most commonly occurring interesting words. Report on these words in your final PDF report. Note that words like “the” and “in” are called stop words, see https://en.wikipedia.org/wiki/Stop_words, and are uninteresting, so try and exclude these from your report.
So for this task, you must submit:
(1) a single bash script, ’emowords.sh’ : as used to support your answers, perhaps calling ’emoword.py’; this should output for each of your 20 emoticons the most frequent words co-occurring with it (in tweets); use what ever format suits, as the results will be transferred and written up in your report.
(2) a description for this task is included in your final PDF report describing the method used for the bash script, and then the final list of selected interesting words per emoticon, and how you got them.
Task A.3 (2 marks)
See if there are other interesting information you can get about these emoticons. For instance is there anything about countries/cities and emoticons? Which emoticons have long or short messages? Whats sorts of messages are attached to different emoticons?
You can use the Python program ’emodata.py” which reads your ’emoticon.csv’ file, takes ‘msgraw_sample.txt’ as stdin and outputs selected data from the tweet file.
./emodata.py
Report on this in your final PDF report. Use any technique or coding you like to get this information. Your report should describe what you did and your results.
Task B
Consider the two files ‘training.csv’ and ‘test.csv’.
Task B.1 (2 marks)
Plot histograms of X1, X2, X3 and X4 in train.csv respectively and answer: which variable(s) is(are) most likely samples drawn from normal distributions?
Task B.2 (4 marks)
Fit two linear regression models using train.csv.
Model 1: Y~X1+X2+X3+X4
Model 2: Y~X2+X3+X4
Which model has higher Multiple R-squared value?
Task B3 (4 marks)
Now use the coefficients of Model 1 and 2 respectively to predict the Y values of test.csv, then calculate the Mean Squared Errors (MSE) between the predictions and the true values. Which model has smaller MSE? Which model is better? More complex models always have higher R square but are they always better?
3. Assessment Criteria
The work required to prepare data, explore data and explain your findings should be all your own. If you use resources elsewhere, make sure that you acknowledge all of them in your PDF report. You may need to review the FIT citation styletutorial to make yourself familiar with appropriate citing and referencing for this assessment. Also, review the demystifying citing and referencingfor help.
The following outlines the criteria which you will be assessed against.
3.1 Grading Rubric
The following outlines the criteria which you will be assessed against:
- Ability to read data files and process them using bash and R commands.
- Ability to wrangle and process data into the required formats.
- Ability to use various graphical and non-graphical tools for performing exploratory data analysis and visualisation
- Ability to use basic tools for managing and processing big data;
- Ability to communicate your findings in your report.
The marks are allocated as follows:
- Task 1: 10 (=5% of total)
- Task 2: 10 (=5% of total)
3.2 Penalties
- Late submission: for all assessment items handed in after the official due date, and without an agreed extension, a 5% penalty applies to the student’s mark for each day after the due date (including weekends, and public holidays) for up to 7 days. Assessment items handed in after 7 days will not be considered.
- Word limit: There are no firm wording limits on the report. However, as a general guidance, it should not be exceeding 1000 words, excluding other supplementary materials (e.g., slides and transcript). Lengthy reports (i.e., over 1000 words) may incur a loss of mark due to the time limit a marker will spend on the report reading. For instance, they may only read the part within 1000 words and omit rest of the report. Notice that references consisting of URLs can be given at the end of the entry and are not included in the word count.
4. How to Submit
Once you have completed your work, take the following steps to submit your work.
- Include the following materials in your submission:
- A report in PDF containing your answers to all the questions.
- You can use Word or other word processing software to format your submission. Just save the final copy to a PDF before submitting.
- Make sure to include in the PDF screenshots/images of any graphs you generate in order to justify your answers to all the questions for both parts A and B.
- Two bash scripts so named containing the code to complete Task A.1 and task A.2.
- The two CSV files ‘potential_emoticon.csv’ and ’emoticon.csv’
- A Jupyter notebook file or plain text R file containing the R code you write to prepare and plot the data for Task B.
- If you’ve chosen other tools, please include the process of data preparation in your report as an appendix included with the PDF.
"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"
2 Discuss An Example Of Market Failure And Whether The Government Has Been Effective 3309759
/in Uncategorized /by developer2 Discuss an example of market failure and whether the government has been effective in implementing policies to correct it.
"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"
2 Draw An Isometric Of The Wood Pellet Stove Shown 2848173
/in Uncategorized /by developer2. DRAW AN ISOMETRIC OF THE WOOD PELLET STOVE SHOWN.
Attachments:
"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"
2 Does Tiger Brands Pursue A Cost Leadership Differentiation Or Focus Strategy Evalu 2816033
/in Uncategorized /by developer2.Does Tiger Brands pursue a cost leadership, Differentiation or focus Strategy? Evaluate its strategic approach in comparison to its competitors.
"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"
2 Evaluate The Following Integrals Find F T Iff T 2nrof W Emtdw And F W Mm P6 W 4 Jw 3295971
/in Uncategorized /by developer2. Evaluate the following integrals: Find f(t) iff(t)=2nroF(w)eMtdw and F(w)=mm)p6(w). 4+jw 9+jw 3.
Attachments:
"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"
2 Er Diagram Is Based On The Business Rules And Should Clearly Label All Entities En 2849787
/in Uncategorized /by developer2. ER diagram is based on the business rules, and should clearly label all entities, entity attributes, primary and foreign keys, relationship and connectivity. The cardinality is optional.Instruction: Use professional software (e.g., MS Office Visio) to draw the ER diagram. Crow’s Foot notation is preferable.(40 marks)3. Develop relational schemas. Relational schemas should be derived using the ERD. You should map cardinalities correctly from ERD to Relational Schema. You should clearly indicate the referential integrity constraints (primary and foreign key relationships) using arrows. Clearly indicate datatype for each attribute.e.g.Emp( eid: integer, ename: string(50), address: string(100), did: number)Dept(did: number, dname : string (15)) 4. SQL commands1. Create all tables in Deakin Oracle DBMS (about eight tables including compositetables) and Populate the tables with sample data (10 records in each table isrecommended).2. Alter the student table and add new field Date of Birth for student table. Type fordate of birth should be date.3. Increase the annual salary for all staff member by 5%.4. List the course numbers, course names a student who is doing computer sciencemajor could enrol for.5. Find the students with age between 18 and 21. Print their student number, nameand the age with the major.6. Create your own query. It must include a nested query. Submit the following:i question your query is answering the SQL queryii the mark for this question will depend on the complexity of the query.iii higher marks will be given for queries that are more complex and/or innovative.(15 marks)iv if you do not provide a description of what question the query is answering, you will get zero for this query.
"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"
2 Expected Utility Theory An Individual Goes To The Store To Buy A New Iclicker For 3300681
/in Uncategorized /by developer2. Expected Utility Theory An individual goes to the store to buy a new iClicker for $40. The clerk at the store tells the individual that the same iClicker is on sale for $20 across campus. The individual goes to the other store. The same individual goes to the store to buy a new computer for $600. The clerk at the store tells the individual the same computer is on sale at the same store across campus for $580. The student does not go. Is this consistent with expected utility theory? Why or why not
"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"
2 Final Project Business Plan Resources Appendix A Due Date Sunday Day 7 Individual 1039231
/in Uncategorized /by developer2. Final Project: Business PlanResources: Appendix ADue Date: Sunday Day 7 [Individual] forumWrite a 700- to 1,050-word paper, using APA guidelines, based on the scenario below.You want to start your own business. You found an investment group that is willing to give you the capital needed for the first year of your business, but only if you can convince them you have a solid plan for the success of this business. Your investor is very concerned with how the accounting functions of this business will be handled. You must persuade your investor to put up the capital by addressing the following questions in your business plan:· What is the name of your business?· What type of business structure is it (sole proprietorship, partnership, or corporation)?· Why did you choose that structure?· What type of services or products does your business provide?· What role will accounting play in the start up of your business?· What type of work characteristics will you look for when hiring your accounting staff?· What education should a person have in budgeting, internal controls, and cash management before going into business?· What kinds of internal controls will you put in place for the business?· How will your managers use financial information to predict outcomes for your business?I really do not know where to start and was wondering if anyone can give me a way to go?
"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"
2 Glenn Foreman President Of Oceanview Development Corporation Is Considering Submi 2839863
/in Uncategorized /by developer(2) Glenn Foreman, president of Oceanview Development Corporation, is considering submitting a bid to purchase property that will be sold by sealed bid at a county tax foreclosure. Glenn’s initial judgment is to submit a bid of $5.5 million. Based on his experience, Glenn estimates that a bid of $5.5 million will have a 0.2 probability of being the highest bid and securing the property for Oceanview.
The sealed-bid procedure requires the bid to be submitted with a certified check for
8% of the amount bid. If the bid is rejected, the deposit is refunded. If the bid is accepted, the deposit is the down payment for the property. However, if the bid is accepted and the bidder does not follow through with the purchase and meet the remainder of the financial obligation within six months, the deposit will be forfeited. In this case, the county will offer the property to the next highest bidder.
If Oceanview’s bid is accepted and they obtain the property, the firm has to decide between building and selling a complex of luxury condominiums or building and selling a complex of single family residences. However, a complicating factor is that the property is currently zoned for single-family residences only. Glenn believes that a referendum could be placed on the voting ballot in time for the November election. Passage of the referendum would change the zoning of the property and permit construction of the condominiums. To determine whether Oceanview should submit the $5 million bid, Glenn conducted some preliminary analysis. This preliminary work provided an assessment of a 40% the probability that the referendum for a zoning change will be approved.
Cost and Revenue Estimates- Condominiums
Revenue from condominium sales: $16,000,000 with a probability of 75% or 7,000,000 with a probability of 25%
Cost: Construction expenses of $7,500,000
Cost and Revenue Estimates- Single Family Homes
Revenue from home sales: $12,000,000 with a probability of 70% or 6,000,000 with a probability of 30%
Cost: Construction expenses of $6,000,000
a. Draw the decision tree for this problem.
b. Calculate the expected values and determine the best decision for Oceanview
"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"
2 From An Historical Perspective How Has The Labor Market Experience Of Black Women 3297883
/in Uncategorized /by developer2. From an historical perspective, how has the labor market experience of black women and white women differed?
"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"
2 Instructions Amp Brief Task A We Will Analyse The Top Emoticons Found In The Messa 2732785
/in Uncategorized /by developer2. Instructions & Brief
Task A
We will analyse the top emoticons found in the messages of tweets, from the ‘msgraw_sample.txt’ data used in the tutorial of Week 7. Note this should be done a Linux machine or similar where bash supported.
Task A.1 (4 marks)
The first sub-task is to extract the top 20 emoticons and their counts from the tweets. This must not be done entirely manually, and it can only be done using a single shell script. So you need to write a single shell script ‘tweet2emo.sh’ that will input ‘msgraw_sample.txt’ from stdin and produce a CSV file ‘potential_emoticon.csv’ giving a list of candidate emoticons with their occurrence counts. The important word here is “candidate”. Perhaps only 1 in 5 of your candidates are emoticons. Then you need to edit this by hand, deleting non-emoticons, and deleting less frequent ones, to get your final, list ’emoticon.csv’.
So for this task, you must submit:
(1) a single bash script, ‘tweet2emo.sh’ : this must output, one per line, a candidate emoticon and a count of occurrence, and cannot have any Python or R programmes embedded in it. More details on how to do this below.
(2) the candidate list of emoticons generated by the script, ‘potential_emoticon.csv’ : CSV file, TAB delimited file with (count, text-emoticon).
(3) the final list of emoticons selected, ’emoticon.csv’ : CSV file, TAB delimited file with (count, text-emoticon); these should be the 20 most frequent emoticons from ‘potential_emoticon.csv’, but you will have to select yourself, manually by editing, which are actually emoticons. To do this, you may use an externally provided list of recognised emoticons, but not should be used in step (2).
(4) a description for this task is included in your final PDF report describing the method used for the bash script, and then the method used to edit the file, to get the file for step (3).
Your bash scripts might take 2-5-10 lines and might require storing intermediate files.
The following single line commands, which process a file from stdin and generate stdout should be useful for this task:
perl -p -e ‘s/s+/n/g;’
— tokenise each line of text by converting space characters to newlines;
NOTE: this reportedly also work on Windows where newline character is different
perl -p -e ‘s/>/>/g; s/</
— convert embedded HTML escapes for ‘>’ and ‘
— you need to do this if you want to capture emoticons using the ‘<‘ or the ‘>’ characters, like ‘
sort | uniq -c | perl -p -e ‘s/^s+//; s/ /t/; ‘
— assumes the input file has one item per line
— sort and count the items and generates TAB delimited file with (count, item) entries
Specially, in order to recognise potential emoticons, you will need to write suitable greps. Here are some examples:
grep -e ‘^_^’
— match lines containing the string “^_^”
grep -e ‘^^_^’
— match lines starting with the string “^_^”, the initial “^”, called an anchor, says match start of line
grep -e ‘^_^$’
— match lines ending with the string “^_^”, the final “$”, called an anchor, says match end of line
grep -e ‘^^_^$’
— match lines made exactly of the string “^_^”, using beginning and ending anchors
grep -e ‘^0_0$’
— match lines made exactly of the string “0_0”
grep -e ‘^^_^$’ -e ‘^0_0$’
— match lines made exactly of the string “^_^” or “0_0”; so two match strings are ORed
grep -e ‘^[.:^]$’
— match lines made exactly of the characters in the set “.:^”
— the construction “[ … ]” means “characters in the set ” … ” but be warned some characters used inside have strange effects, like “-“, see next
grep -e ‘^[0-9ABC]$’
— match lines made exactly of the digits (“0-9” means in the range “0” to “9”) or characters “ABC”
grep -e ‘^[-0-9ABC]$’
— match lines made exactly of the dash “-“, the digits, or the characters “ABC”
— we place “-” at the front to stop in meaning “range”
For more detail on grep see:
https://opensourceforu.com/2012/06/beginners-guide-gnu-grep-basics-regular-expressions/
But my advice is “keep it simple” and stick with the above constructs. Remember you get to edit the final results by hand anyway. But if your grep match strings say “7” is an emoticon, it probably isn’t a strong enough filter.
Task A.2 (4 marks)
We would like to compute word co-occurrence with emoticons. So suppose we have the tweet:
loved the results of the game 😉
then this means that emoticon ‘;-)’ co-occurs once with each of the words in the list ‘ loved the results of the game’ once.
You can use the supplied Python program ’emoword.py” which uses a single emoticon, takes ‘msgraw_sample.txt’ as stdin and outputs a raw list of co-occurring tokens.
./emoword.py ‘:))’
Note the emoticon is enclosed in single quotes because the punctuation can cause bash to do weird things otherwise.
You can also put this in a bash loop to run over your emoticon list like so:
for E in ‘;)’ ‘:)’ ‘echo running this emoticon $E
done
or counting them too using
CNT=1
for E in ‘;)’ ‘:)’ ‘echo running this emoticon $E > $CNT.out
CNT=$(( $CNT + 1)) # this is arithmetic in bash
done
But be warned, bash does strange things with punctuation … it treats it differently as it plays a role in the language. So while you can have a loop doing this:
for E in ‘;)’ ‘:)’ ‘
where you have edited in your emoticons, and used the single quotes to tell bash the quoted text is a single token, if instead you try and be clever and read them from a file
for E in `cat emoticons.txt` ; do
then bash well see individual punctuation and probably fail to work in the way you want.
For each emoticon in your list ’emoticon.csv’, find a list of the 10-20 most commonly occurring interesting words. Report on these words in your final PDF report. Note that words like “the” and “in” are called stop words, see https://en.wikipedia.org/wiki/Stop_words, and are uninteresting, so try and exclude these from your report.
So for this task, you must submit:
(1) a single bash script, ’emowords.sh’ : as used to support your answers, perhaps calling ’emoword.py’; this should output for each of your 20 emoticons the most frequent words co-occurring with it (in tweets); use what ever format suits, as the results will be transferred and written up in your report.
(2) a description for this task is included in your final PDF report describing the method used for the bash script, and then the final list of selected interesting words per emoticon, and how you got them.
Task A.3 (2 marks)
See if there are other interesting information you can get about these emoticons. For instance is there anything about countries/cities and emoticons? Which emoticons have long or short messages? Whats sorts of messages are attached to different emoticons?
You can use the Python program ’emodata.py” which reads your ’emoticon.csv’ file, takes ‘msgraw_sample.txt’ as stdin and outputs selected data from the tweet file.
./emodata.py
Report on this in your final PDF report. Use any technique or coding you like to get this information. Your report should describe what you did and your results.
Task B
Consider the two files ‘training.csv’ and ‘test.csv’.
Task B.1 (2 marks)
Plot histograms of X1, X2, X3 and X4 in train.csv respectively and answer: which variable(s) is(are) most likely samples drawn from normal distributions?
Task B.2 (4 marks)
Fit two linear regression models using train.csv.
Model 1: Y~X1+X2+X3+X4
Model 2: Y~X2+X3+X4
Which model has higher Multiple R-squared value?
Task B3 (4 marks)
Now use the coefficients of Model 1 and 2 respectively to predict the Y values of test.csv, then calculate the Mean Squared Errors (MSE) between the predictions and the true values. Which model has smaller MSE? Which model is better? More complex models always have higher R square but are they always better?
3. Assessment Criteria
The work required to prepare data, explore data and explain your findings should be all your own. If you use resources elsewhere, make sure that you acknowledge all of them in your PDF report. You may need to review the FIT citation styletutorial to make yourself familiar with appropriate citing and referencing for this assessment. Also, review the demystifying citing and referencingfor help.
The following outlines the criteria which you will be assessed against.
3.1 Grading Rubric
The following outlines the criteria which you will be assessed against:
The marks are allocated as follows:
3.2 Penalties
4. How to Submit
Once you have completed your work, take the following steps to submit your work.
"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"