01252022 DATA641

Syllabus & Some notes

This course will introduce fundamental concepts and techniques involved in getting computers to deal more intelligently with human language. It is focused primarily on text (as opposed to speech) and will offer a grounding in core NLP methods for text processing (such as lexical analysis, sequential tagging, syntactic parsing, semantic representations, text classification, unsupervised discovery of latent structure), key ideas in the application
of deep learning to language tasks, and consideration of the role of language technology in modern society.
The content of this course will be substantially similar to Computational Linguistics I, though with some adjustments geared toward longer/fewer lectures and emphasizing practical rather than theoretical concerns.

Something I learned is that doing audio speech recognition and translations is actually easier than just doing text translations. GRU Autoencoder (for large datasets w/ normal freq of repetition of tokenized words) > LSTM autoencoder and both => normal autoencoder for doing text translations.
LSTM aencoder >>> GRU for doing datasets with high frequency of word/token repetition.
Using dask, pandas, tensorflow, keras and scikit for preprocess, tokenizing data, setting up models and splitting up data into train, test, and split.

Mix and matching LSTM, GRU, and RNN in one model doesn’t work out well and doesn’t confer any increase in test accuracy.

One of the big use cases I have seen at my work is using it pragmatically just to speed up or automate tasks that would take people a long time like classifying PDF documents that may need a signature or not. There is no machine understanding of the text documents but it does save a lot of humans time when it’s hundreds / thousands of documents

How does language work? What makes language a language?

properties

semanticity, particularly the arbitrary relation of sounds to meanings

【语言学】使语义化
变形

latent structure, as opposed simply to sequential patterns

潜在结构，潜结构

(latent structure model 隐结构模型)

recursion, and particularly center-embedding recursion

[uncountable ] ( mathematics 数 ) the process of repeating a function,each time applying it to the result of the previous stage 递归；递回

Zipfian curve

Rank of words vs Frequency of words
Engineer like to focus on the first section
Scientists like to focus on the latter section