Assignment 1:Normalizing text and exploring a corpus
规范化文本和探索语料库
Assignment 1
Overview
Starting with a typical “raw” dataset
从一个典型的 “原始 “数据集开始
a dataset of speeches from the U.S. Congressional Record during 2020
We’ll be using a dataset of speeches from the U.S. Congressional Record during 2020, acquired using code at https://github.com/ahoho/congressional-record. This is publicly available material.
2020年期间美国国会记录中的演讲数据集
Extracting relevant text to create one or more corpora
提取相关文本以创建一个或多个语料库
We’ll restrict ourselves to the Senate, and create subcorpora of speeches by Democrats and Republicans.
我们将把自己限制在参议院,并创建民主党人和共和党人的演讲的子体。
Tokenizing text
符号化文本
We’ll use the spaCy tokenizer
我们将使用spaCy标记器。
Normalizing text
正常化文本
Well use case folding and also a stopword list
我们将使用案例折叠法和一个停顿词列表。
Extracting potentially useful ngrams
提取潜在的有用的ngrams
In this assignment we’ll focus on bigrams
在这项任务中,我们将专注于大词
The files you’ll be working with
You’ll be working with the following:
- Files in jsonlines format containing raw data
test_speeches.jsonl.gz
- small example data for testingspeeches2020_jan_to_jun.jsonl.gz
- main data you’ll run on
- Files containing code
assignment.py
- code skeleton that you’ll fill inpublic_tests_obj.py
- code to run for unit testing
- Other resources
mallet_en_stoplist.txt
- the stopword list from the widely used Mallet toolkit
What you should do
Check out this repo
Execute
python assignment.py
It should run successfully from end to end with progress messages on the output
If it does not, most likely it’s because it is using packages you don’t have installed. Install them (see: requirements.txt)
If you use conda, we recommend installing a fresh conda env and putting your classwork dependencies there.
Execute
conda create --name YOURCONDAENVIRONMENT python=3.8 conda activate YOURCONDAENVIRONMENT which pip
Ensure that your
pip
lives in its own env, like:/anaconda3/envs/YOURCONDANEVIRONMENT/bin/pip
Execute
pip install -r requirements.txt
Execute
python public_tests_obj.py -v
- The code should run, but will report on tests that have failed.
Read and modify
assignment.py
.Each function has a detailed comment about input, output, and what it does.
You can look at
public_tests_obj.py
for examples of the function calls.You will find a comment like
# ASSIGNMENT: replace this with your code
everywhere you have work to do. Line:41;57;67;81;92;122
Keep working until all the tests pass when you run
public_tests_obj.py
.
Code to be graded. Once all tests pass, submit
assignment.py
to the Canvas. This will be the basis for grading your code.Analysis to be graded. For the analysis part of the assignment, look at the output of
assignment.py
and submit a brief but clear written response in a PDF file, namedwriteup.pdf
. (Note that, particularly if you are not very familiar with U.S. politics, you are welcome to discuss the data you’re looking at with other other people – as long as you state explicitly in your writeup that you have done so, and of course you need to write your answers in your own words.)– Looking at frequency. The first set of outputs are lists of the top Democratic and Republican bigrams by frequency. Looking at these lists, how similar or different are the most-frequent bigrams used by members of the two parties? Are there any generalizations you can make about the two parties, at least during this time period, based on this information? If yes, discuss. If you think the answer is no, clearly explain why. Support your answer with examples.
看频率。第一组输出是按频率排列的民主党和共和党的前几位大词的列表。看一下这些列表,两党成员最常使用的重词有多大的相似或不同?
两党成员所使用的大词有多大的相似或不同?你是否可以对这两个政党做出任何概括关于这两个党,至少在这一时期,根据这些信息,你能做出什么概括吗?如果有的话。讨论一下。如果你认为答案是否定的,请明确解释原因。请用实例支持你的答案。
运行
运行assignment.py
*:
一开始 speeches_dem.txt和speeches_rep.txt两个文件都是0kb
1 | Processing text from input file ./speeches2020_jan_to_jun.jsonl.gz |
#Read in congressional speeches jsonlines, i.e. a file with one well formed json element per line.
#Limiting to just speeches where the chamber was the Senate, return a list of strings
in the following format:
‘
TAB ‘ where
and refer to the elements of those names in the json. Make sure to replace line-internal whitespace (one or more newlines, tabs, spaces, etc.) in text with a single space.
#For information on how to read from a gzipped file, rather than uncompressing and reading, see
https://stackoverflow.com/questions/10566558/python-read-lines-from-compressed-text-files#30868178
#For info on parsing jsonlines, see https://www.geeksforgeeks.org/json-loads-in-python/.
(There are other ways of doing it, of course.)
#读取国会演讲的jsonlines,即一个文件,每行有一个成型的json元素。
#只限于参议院的演讲,返回一个字符串的列表
以下列格式。
‘
TAB ‘ 其中
和 指的是json中这些名字的元素。 确保将文本中的行内空白(一个或多个换行符、制表符、空格等)替换为一个空格。
#关于如何从压缩文件中读取,而不是解压后读取的信息,见https://stackoverflow.com/questions/10566558/python-read-lines-from-compressed-text-files#30868178
#关于解析jsonlines的信息, 请看 https://www.geeksforgeeks.org/json-loads-in-python/.
(当然,还有其他的方法)。
运行public_tests.obj.py
1 | (base) PS E:\My Drive\DATA641\Assignment1> & C:/ProgramData/Anaconda3/python.exe "e:/My Drive/DATA641/Assignment1/public_tests_obj.py" |