0%

DATA641 assignment1

Assignment 1:Normalizing text and exploring a corpus

规范化文本和探索语料库

Assignment 1

Overview

  • Starting with a typical “raw” dataset

    从一个典型的 “原始 “数据集开始

    a dataset of speeches from the U.S. Congressional Record during 2020

    We’ll be using a dataset of speeches from the U.S. Congressional Record during 2020, acquired using code at https://github.com/ahoho/congressional-record. This is publicly available material.

    2020年期间美国国会记录中的演讲数据集

  • Extracting relevant text to create one or more corpora

    提取相关文本以创建一个或多个语料库

    We’ll restrict ourselves to the Senate, and create subcorpora of speeches by Democrats and Republicans.

    我们将把自己限制在参议院,并创建民主党人和共和党人的演讲的子体。

  • Tokenizing text

    符号化文本

    We’ll use the spaCy tokenizer

    我们将使用spaCy标记器

  • Normalizing text

    正常化文本

    Well use case folding and also a stopword list

    我们将使用案例折叠法和一个停顿词列表。

  • Extracting potentially useful ngrams

    提取潜在的有用的ngrams

    In this assignment we’ll focus on bigrams

    在这项任务中,我们将专注于大词

The files you’ll be working with

You’ll be working with the following:

  • Files in jsonlines format containing raw data
    • test_speeches.jsonl.gz - small example data for testing
    • speeches2020_jan_to_jun.jsonl.gz - main data you’ll run on
  • Files containing code
    • assignment.py - code skeleton that you’ll fill in
    • public_tests_obj.py - code to run for unit testing
  • Other resources
    • mallet_en_stoplist.txt - the stopword list from the widely used Mallet toolkit

What you should do

  • Check out this repo

  • Execute python assignment.py

    • It should run successfully from end to end with progress messages on the output

    • If it does not, most likely it’s because it is using packages you don’t have installed. Install them (see: requirements.txt)

      • If you use conda, we recommend installing a fresh conda env and putting your classwork dependencies there.

      • Execute

        conda create --name YOURCONDAENVIRONMENT python=3.8
        conda activate YOURCONDAENVIRONMENT
        which pip
        
      • Ensure that your pip lives in its own env, like: /anaconda3/envs/YOURCONDANEVIRONMENT/bin/pip

      • Execute pip install -r requirements.txt

  • Execute python public_tests_obj.py -v

    • The code should run, but will report on tests that have failed.
  • Read and modify assignment.py.

    • Each function has a detailed comment about input, output, and what it does.

    • You can look at public_tests_obj.py for examples of the function calls.

    • You will find a comment like # ASSIGNMENT: replace this with your code everywhere you have work to do.

      Line:41;57;67;81;92;122

    • Keep working until all the tests pass when you run public_tests_obj.py.

  • Code to be graded. Once all tests pass, submit assignment.py to the Canvas. This will be the basis for grading your code.

  • Analysis to be graded. For the analysis part of the assignment, look at the output of assignment.py and submit a brief but clear written response in a PDF file, named writeup.pdf. (Note that, particularly if you are not very familiar with U.S. politics, you are welcome to discuss the data you’re looking at with other other people – as long as you state explicitly in your writeup that you have done so, and of course you need to write your answers in your own words.)

    Looking at frequency. The first set of outputs are lists of the top Democratic and Republican bigrams by frequency. Looking at these lists, how similar or different are the most-frequent bigrams used by members of the two parties? Are there any generalizations you can make about the two parties, at least during this time period, based on this information? If yes, discuss. If you think the answer is no, clearly explain why. Support your answer with examples.

    看频率。第一组输出是按频率排列的民主党和共和党的前几位大词的列表。看一下这些列表,两党成员最常使用的重词有多大的相似或不同?
    两党成员所使用的大词有多大的相似或不同?你是否可以对这两个政党做出任何概括关于这两个党,至少在这一时期,根据这些信息,你能做出什么概括吗?如果有的话。讨论一下。如果你认为答案是否定的,请明确解释原因。请用实例支持你的答案。

运行

运行assignment.py

image-20220207130436907

*:

​ 一开始 speeches_dem.txt和speeches_rep.txt两个文件都是0kb

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Processing text from input file ./speeches2020_jan_to_jun.jsonl.gz
(base) PS E:\My Drive\DATA641\Assignment1> & C:/ProgramData/Anaconda3/python.exe "e:/My Drive/DATA641/Assignment1/assignment.py"

Processing text from input file ./speeches2020_jan_to_jun.jsonl.gz

Reading and cleaning text from ./speeches2020_jan_to_jun.jsonl.gz
13964it [00:00, 38118.32it/s]

Writing Democrats' speeches to ./speeches_dem.txt
Democrat speeches being written to ./speeches_dem.txt
0it [00:00, ?it/s]

Writing Republicans' speeches to ./speeches_rep.txt
Republican speeches being written to ./speeches_rep.txt
0it [00:00, ?it/s]

Getting Dem unigram and bigram counts
Collecting bigram counts with stopword-filtered bigrams
Initializing spacy
0it [00:00, ?it/s]

Top Dem bigrams by frequency

Getting Rep unigram and bigram counts
Collecting bigram counts with stopword-filtered bigrams
Initializing spacy
0it [00:00, ?it/s]

Top Rep bigrams by frequency

#Read in congressional speeches jsonlines, i.e. a file with one well formed json element per line.

#Limiting to just speeches where the chamber was the Senate, return a list of strings

in the following format:

​ ‘TAB

where and refer to the elements of those names in the json.

Make sure to replace line-internal whitespace (one or more newlines, tabs, spaces, etc.) in text with a single space.

#For information on how to read from a gzipped file, rather than uncompressing and reading, see

https://stackoverflow.com/questions/10566558/python-read-lines-from-compressed-text-files#30868178

#For info on parsing jsonlines, see https://www.geeksforgeeks.org/json-loads-in-python/.

(There are other ways of doing it, of course.)

#读取国会演讲的jsonlines,即一个文件,每行有一个成型的json元素。

#只限于参议院的演讲,返回一个字符串的列表

以下列格式。

​ ‘TAB

其中指的是json中这些名字的元素。

确保将文本中的行内空白(一个或多个换行符、制表符、空格等)替换为一个空格。

#关于如何从压缩文件中读取,而不是解压后读取的信息,见https://stackoverflow.com/questions/10566558/python-read-lines-from-compressed-text-files#30868178

#关于解析jsonlines的信息, 请看 https://www.geeksforgeeks.org/json-loads-in-python/.

(当然,还有其他的方法)。

运行public_tests.obj.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
(base) PS E:\My Drive\DATA641\Assignment1> & C:/ProgramData/Anaconda3/python.exe "e:/My Drive/DATA641/Assignment1/public_tests_obj.py"
FFFF
Reading and cleaning text from ./test_speeches.jsonl.gz
1000it [00:00, 33302.40it/s]
F
======================================================================
FAIL: test_filter_stopword_bigrams (__main__.TestBigramPublic)
----------------------------------------------------------------------
Traceback (most recent call last):
File "e:\My Drive\DATA641\Assignment1\public_tests_obj.py", line 56, in test_filter_stopword_bigrams
self.assertCountEqual(filtered_bigrams, output_bigrams,
AssertionError: Element counts were not equal:
First has 2, Second has 0: ['the', 'pretty']
First has 1, Second has 0: ['fox', 'in']
First has 1, Second has 0: ['in', 'the'] : Test failed: test_filter_stopword_bigrams

======================================================================
FAIL: test_load_stopwords (__main__.TestBigramPublic)
----------------------------------------------------------------------
Traceback (most recent call last):
File "e:\My Drive\DATA641\Assignment1\public_tests_obj.py", line 29, in test_load_stopwords
self.assertTrue(all(x in stopwords for x in correct_stopwords))
AssertionError: False is not true

======================================================================
FAIL: test_ngrams (__main__.TestBigramPublic)
----------------------------------------------------------------------
Traceback (most recent call last):
File "e:\My Drive\DATA641\Assignment1\public_tests_obj.py", line 39, in test_ngrams
self.assertEqual(bigrams, correct_bigrams,
AssertionError: Lists differ: [] != [['the', 'pretty'], ['pretty', 'brown'], [[81 chars]ds']]

Second list contains 7 additional elements.
First extra element 0:
+ ['fox', 'in'],
+ ['in', 'the'],
+ ['the', 'pretty'],
+ ['pretty', 'woods']] : Test failed: test_ngrams got the wrong bigrams

======================================================================
FAIL: test_normalize_tokens (__main__.TestBigramPublic)
----------------------------------------------------------------------
Traceback (most recent call last):
File "e:\My Drive\DATA641\Assignment1\public_tests_obj.py", line 67, in test_normalize_tokens
self.assertListEqual(normalized_toks, correct_output,
AssertionError: Lists differ: ['I', 'saw', '@psresnik', "'s", 'page', 'at[39 chars]url'] != ['i', 'saw', "'s", 'page', 'at', 'http://um[26 chars]url']

First differing element 0:
'I'
'i'

First list contains 1 additional elements.
First extra element 6:
'http://umiacs.umd.edu/~resnik/this_url'

+ ['i', 'saw', "'s", 'page', 'at', 'http://umiacs.umd.edu/~resnik/this+url']
- ['I',
- 'saw',
- '@psresnik',
- "'s",
- 'page',
- 'at',
- 'http://umiacs.umd.edu/~resnik/this_url'] : Test failed: test_normalize_tokens

======================================================================
FAIL: test_read_and_clean_lines (__main__.TestBigramPublic)
----------------------------------------------------------------------
Traceback (most recent call last):
File "e:\My Drive\DATA641\Assignment1\public_tests_obj.py", line 20, in test_read_and_clean_lines
self.assertEqual(len(lines), correct_lines,
AssertionError: 0 != 317 : Test failed: Count of Senate speeches in ./test_speeches.jsonl.gz is incorrect, should be 317

----------------------------------------------------------------------
Ran 5 tests in 0.156s

FAILED (failures=5)

Tips

pass

image-20220207152429011