Archive for December, 2006

Personal Statement

2006 Dec 22 in Soapbox | Comments (0)

The following is my personal statement for one of my grad school applications, essentially responding to the question of what obstacles to education I have overcome, and how my graduate studies will increase student diversity or serve disadvantaged groups. You may recognize some sentences from other things I have posted here, but perhaps this repackaging is of interest.

I have not personally experienced much financial hardship nor particularly challenging cultural barriers, and the truth is, my pursuit of an advanced degree is an admission that I am unsuited to more direct interaction with world problems. If I were completely committed to serving the less fortunate, I would be working in a favela in Recife or in a refugee camp in Afghanistan. The most pressing problems in the world today have little or nothing to do with linguistic research. But the most pressing problems are also the most overwhelming, and any improvement on these issues is difficult to notice. I don’t believe I am optimistic enough to deal directly with the most serious problems, and I need measurable signs of progress. Because of that, and because I am probably incorrigibly intellectual, I need to find a place for myself in academia, where I can do some public good without having to measure my life’s work against an insurmountable problem.

On the other hand, I have perhaps lived an unusual life thus far, and I do indeed feel a responsibility to serve the public good. I grew up on the Colombia center of SIL International, in a rural region near the border between the Amazon and Orinoco river valleys. It was a small community of people from all over, and I had frequent contact with indigenous and mestizo cultures. My family is deeply religious, but highly values critical thinking. My own thinking eventually led me to reject the faith I grew up with, but I was left with a sense of personal responsibility to serve the less fortunate. We felt quite poor when we periodically returned to the States, but we were wealthy compared to the people of the neighboring communities. I saw true poverty there, which I have since seen in other countries, but never in the US. With that beginning, I have never felt quite at home in American culture, though after so many years in school I am beginning to feel that academia is a home culture. I speak Spanish fluently, Portuguese okay, and Mandarin not too poorly, and I know something of quite a few countries, but that is not so abnormal in a linguistics graduate program.

My motivations for research arise primarily from curiosity about how mind and society work. Language is a subset of human behavior that helps us understand how our brains work and how our interpersonal relationships are built. This is a long way from solving any world problems, but I do hope my research will help us build in that direction.

Statement of Purpose

2006 Dec 2 in Soapbox | Comments (6)

This is a generalized version of the statement of purpose I used in my grad school applications.

I believe language science has suffered from a wealth of theories and a poverty of empirical research, but a revolution has already begun, which is bringing many exciting discoveries. I am pursuing further linguistics study because I want to participate in this research. There are many areas of research that intrigue me, in that there are many interesting behaviors, but they are tied together by the mathematical similarities of the models and methods. My research interests focus at the intersection of emergentist models, natural corpus data, and mathematical methods.

I particularly appreciate the recent reawakening to the statistical and inductive nature of language learning and to the interconnectedness of brain functions. Linguistics has been heavily concerned with how much language works like symbolic logic, and there has been much insight there. Indeed, one of the biggest factors that distinguishes human language from the communication in other animals is how much human language is symbolic and abstract. But as developments in cognitive psychology and neuroscience have revealed, and as should have been apparent all along, human language and cognition are still an awful lot like communication and cognition in other animals, and in that respect are unlike symbolic languages and computer processing. In humans, symbolic behavior is built on a foundation of statistical learning, just as in computers we can build statistical learning on a foundation of symbolic processing. The higher level description is quite useful, but behavior in the observed layer is still influenced by what lies underneath. We are now beginning to see more concretely how statistical behaviors arise from networks, and how the statistical systems produce symbolic behaviors. We are beginning to see how very complex language is, the result of interactions of many different factors, both structural and contextual. I am also intrigued by the parallels between the multi-layered system in the brain and the multi-layered system of quantum mechanics through biochemistry, and also the parallels between those systems and social networks. In one sense they are very different systems, yet the similarities there could yield valuable insights.

Furthermore, the current explosion of easily accessible machine-readable natural language corpora is opening up a wealth of new research possibilities. Some research that even fifty years ago would have required thousands of person-hours can now be done by one person in a few minutes, using only a computer, open-source software, and internet access, and much more can be done when annotated corpora and the resources for further annotation are available. In addition, corpus data can guide the design of psycholinguistic experiments and provide authentic stimuli, and comparable studies using corpus data and psychology methods provide a more complete perspective than either one alone could. Well-developed methodologies for corpus research hold the prospect of steady progress in empirical validation.

The combination of these two trends means that there are many reasons to be using math to study language. Many mathematical formalisms developed in other fields are finding use in language science, as isomorphisms are found in signal and image processing, bio-informatics, statistical analysis, and statistical mechanics. The inter-relatedness of many of the mathematical formalisms used in machine learning and statistics, including those based on neural networks, indicates that those methods are in fact, to a greater or lesser extent, crude models of human learning processes. To the extent that those methods offer analyzable descriptions of what has been learned, language science is being handed a suite of descriptive and predictive models, and we can learn much by taking full advantage of this gift. I have rarely had much interest in math for math’s sake, but applying math to the real world has always fascinated me. I particularly like the prospect of using math to explore language.

I am unusually well-prepared to pursue research at this intersection of language and math. I was raised in South America on a research center of SIL International, surrounded by linguists, and I absorbed much elementary linguistics before college. As an undergraduate at UCSB, I completed a BA in linguistics as well as a BS in physics, which provided me the mathematical frameworks for much of the work I do now. In the computational linguistics program at San Diego State, I have now added computational skills and knowledge of statistical methods, as well as further study in linguistics. As a result of my early education, I speak Spanish fluently, and the Chinese I began to learn as an undergraduate should be brought up to a reasonably high level this year, as I am living in China, studying the language and teaching English. I hope to be able to better take advantage of the many Chinese-language corpora, and I am also interested in the local languages (Wu dialect family here), which currently have many speakers, but they have an uncertain future due to the national language policy.

In the last few years I have carried out several research projects, some of which I hope to pursue further.

  • My thesis research developed a method for evaluating hierarchical discourse segmentation, i.e. shallow discourse parsing or unlabeled outlining, which is difficult to evaluate taking into consideration the differing importance of the section breaks and the intrinsic imprecision of the section break locations. The research involved recruiting several dozen students to annotate passages via a web form interface, developing a method for deriving a gold standard from conflicting annotations, adapting two segmentation programs to produce hierarchical segmentations, and developing a statistical measure suitable to the peculiarities of hierarchical discourse segmentation. This work so far is only the first steps in a research program I hope to continue working on in the future.
  • At an internship with Beth Sundheim at SPAWAR, I worked on several projects related to multi-lingual text processing and named entity recognition, including a study derived from the TDT 2002 link detection task, comparing the relative value of general lexical features, temporal expressions, and named entities for the identification of event-based topics, and comparing the published temporal expression vector spaces to some I developed. The internship was mostly language engineering rather than language science, but this study does begin to address how people express time concepts and conceptualize temporal similarity.
  • With two fellow students I worked on developing a system to distinguish degrees of bias in politically oriented websites, approaching it from two directions: as a language classification problem, like distinguishing subjective language from objective language; and as a network partitioning problem, using position within the hyperlink network to identify affiliation. We harvested the test corpus from the internet, and hand-annotated the target classes. For the linguistic approach we used standard machine learning methods with linguistically-informed features, and for the network approach we used mathematical methods from social network analysis. There are now published papers addressing this problem from the network analysis perspective, but I believe the comparison between the social network and the linguistic features could help enlighten our understanding of partisanship, subjectivity and motivated reasoning.
  • In another project with three other students, we worked on the problem of semantic role labeling, as proposed in the CoNLL 2005 Shared Task. We implemented a system that derived a wide variety of rule-based syntactic and semantic features for the sentences in the CoNLL 2005 corpus, to train a conditional random fields model of the target series of semantic role labels. A more complete system, with multiple alternative feature sets, could act as an analyzable model of human behavior, helping establish which factors are more significant in human assignment of semantic roles.

I believe my talents and training in both math and languages make me well-suited to a career doing this sort of research. For many years I have been interested in discourse, as a linearization of a complex belief network, and recent work in language induction (morphology, syntax, and semantics) is also quite intriguing, as are studies examining the relationships between language and general cognition, studies of motivated reasoning and perception, and many others. I have known I belonged in academia longer than I have known what my specialization should be, and even now my interests are diverse, extending well beyond the narrow definition of linguistics used by many. But in any case, this application of abundant language data, via statistical and computational methods, is what I most hope to work on.