List of papers we cover during our weekly paper reading session. For past and missing links/notes, check out the (private) wiki.
What’s Hidden in a Randomly Weighted Neural Network?
Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead
Presents a way for dialog level collection of alpha numeric strings via an ASR. Two main ideas:
- Skip listing over n-best hypothesis across turns (attempts)
- Chunking and confirming pieces one by one
The self-supervision signal here is coming from a model which tries to predict whether a provided tuple of turns is in order or not. Connecting this as the discriminator in generative-discriminative dialog systems they find better results.
This is an approach to collect supervision signal from deployment data. There are three tasks for the system (which is a chat bot doing ranking on candidate responses):
- Dialogue. The main task. Given the turns till now, the bot ranks which response to utter.
- Satisfaction. Given turns till now, last being user utterance, predict whether the user is satisfied.
- Feedback. After asking for feedback from the user, predict user’s response (feedback) based on the turns till now.
The models have shared weights, mostly among task 1 and 3.
This paper explores a new direction in language modelling. The idea is still to learn the underlying distribution of sequences of characters, but here they do it by learning the quantum analogue of the classical probability distribution function. Unlike the classical case, marginal distributions there carry enough information to re-construct the joint distribution. This is the central idea of the paper, and is explained in the first half. The second half of the paper explains the theory and implementation of the training algorithm, with a simple example. Future work would be to apply this algorithm to a more complicated example, and even adapt it to variable length sequences.
This paper suggests improvements to DeepVoice and Tacotron, and also proposes a way to add trainable speaker embeddings. The speaker embeddings are initialized randomly and trained jointly through backpropagation. The paper lists some patterns that lead to better performance
- Transforming speaker embeddings to appropriate dimension and form for every place it is added to the model. The transformed speaker embeddings are called site-specific speaker embeddings
- Initializing recurrent layer hidden states with the site-specific speaker embeddings.
- Concatenating the site-specific speaker embedding to input at every timestep of the recurrent layer
- Multiplying layer activations element-wise to the site-specific speaker embeddings
This talks about an API for framing L2S style search problems in style of an imperative program which allows for two optimizations:
- memoization
- forced path collapse, getting losses without going to the last state
Main reduction that happens here is to a cost-sensitive classification problem.
Introductory paper on the general approach used in learn. The idea is to learn various generalizable syntactic and semantic relations from unannotated corpus. The relations are expressed using graphs sitting on top of link grammar and meaning text theory (MTT). While the general approach is sketched out decently enough, there are details to filled in various steps and experiments to run (as of the writing in 2014).
On another note, the document is a nice read because of the many interesting ways of looking at various ideas in understanding languages and going from syntax to reasoning via semantics.
We came to here via opencog’s learn project. This is a nice perspective setup also if you are missing out on formal introduction of grammars and all. Overall a link grammar defines connectors on left and right side of a word with disjunctions and conjunctions incorporated which then link together to form a sentence, under certain constraints.
This specific paper shows the formulation and creates a parser for English, covering many (not all) linguistics phenomena.
This paper is development over their previous research work, Tuple-based end to end(TE2E) loss, for speaker identification. They try to generalize the concept of the cosine similarity being used in TE2E by creating similarity matrics for utterances by a user. They have suggested two losses in the paper:
- Softmax loss
- Contrast loss
Both these loss functions had two components, one which brings utterances by a user together and others, which separates the utterances of different users. Out of the two, Contrast loss is more rigorous.
This paper talks about developing an end to end model for intent recognition form speech. Currently, all the models have several components like ASR and NLU, which each have some errors of their own degrading the quality of the speech to intent pipeline. Experiments for two tasks, speech to domain and speech to intent were performed using the model. The model’s architecture is mostly inspired from end to end speech synthesis models. A unique feature of the architecture is that they perform sub-sampling after the first GRU layer to reduce the size of the vector and to tackle the problem of vanishing gradient.
They take a regular classifier, pick out logits before softmax and try to
formulate an energy based model able to give
Although the learning mechanism is a little fragile and needs work to be generally stable, the results are neat.
This is more about managing supervision than model. There are 3 problems that they are trying to solve:
- Fine grained quality monitoring,
- Support for multi-component pipelines, and
- Updating supervision
For this, they make easy to use abstractions for describing supervision and developing models. They also do a lot of multitask learning and snorkelish weak supervision, including the recent slicing abstractions for fine grained quality control.
While you have to adapt a few pieces for your own case (and scale), Overton is a nice testimony for success of things like weak supervision and higher level development abstractions in production.
This is taking the snorkel’s labelling function idea to group data instances in slices, segments which are interesting to us from an overall quality perspective. These slicing functions are important not only for identifying and narrowing down to specific kinds of data instances but also for learning slice specific representations which works out as computationally cheap way (there are other benefits too) of replicating a Mixture of Experts style model.
Like with labelling functions, we have the slice membership predicted using heuristics which are noisy. This membership value along with slice representations (and slice prediction confidences) help create the slice aware representation to be used for the final task. The appendix has few good examples of slicing functions.
- Moody, C. E., Mixing dirichlet topic models and word embeddings to make lda2vec, arXiv preprint arXiv:1605.02019, (), (2016). (cite:moody2016mixing)
- Ren, L., Xie, K., Chen, L., & Yu, K., Towards universal dialogue state tracking, arXiv preprint arXiv:1810.09587, (), (2018). (cite:ren2018towards)
- Coucke, A., Saade, A., Ball, A., Th'eodore Bluche, Caulier, A., Leroy, D., Cl'ement Doumouro, …, Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces, CoRR, abs/1805.10190(), (2018). (cite:DBLP:journals/corr/abs-1805-10190)
- Kim, S., Eriksson, T., Kang, H., & Hee Youn, D., A pitch synchronous feature extraction method for speaker recognition, In , (pp. ) (2004). : . (cite:PSMFCC)
- Chen, J., Elements of human voice (2016), : . (cite:HumanVoice)
- Ghorbani, A., & Zou, J., Data shapley: equitable valuation of data for machine learning, arXiv preprint arXiv:1904.02868, (), (2019). (cite:ghorbani2019data)
- Shen, G., Horikawa, T., Majima, K., & Kamitani, Y., Deep image reconstruction from human brain activity, PLoS computational biology, 15(1), 1006633 (2019). (cite:shen2019deep)
- Daum'e III, Hal, Frustratingly easy domain adaptation, arXiv preprint arXiv:0907.1815, (), (2009). (cite:daume2009frustratingly)
- Belkin, M., Hsu, D., Ma, S., & Mandal, S., Reconciling modern machine learning and the bias-variance trade-off, arXiv preprint arXiv:1812.11118, (), (2018). (cite:belkin2018reconciling)
- Locatello, F., Bauer, S., Lucic, M., Gelly, S., Sch"olkopf, Bernhard, & Bachem, O., Challenging common assumptions in the unsupervised learning of disentangled representations, arXiv preprint arXiv:1811.12359, (), (2018). (cite:locatello2018challenging)
- Advani, M. S., & Saxe, A. M., High-dimensional dynamics of generalization error in neural networks, arXiv preprint arXiv:1710.03667, (), (2017). (cite:advani2017high)
- Friedman, J., Hastie, T., & Tibshirani, R., The elements of statistical learning, In (Eds.), (pp. 51–61) (2001). : Springer series in statistics New York. (cite:friedman2001elements)
- Barham, P., & Isard, M., Machine learning systems are stuck in a rut, In , Proceedings of the Workshop on Hot Topics in Operating Systems (pp. 177–183) (2019). New York, NY, USA: ACM. (cite:barham2019machine)
- Hastie, T., Montanari, A., Rosset, S., & Tibshirani, R. J., Surprises in high-dimensional ridgeless least squares interpolation, arXiv preprint arXiv:1903.08560, (), (2019). (cite:hastie2019surprises)
- Levitan, S. I., Mishra, T., & Bangalore, S., Automatic identification of gender from speech, In , Proceeding of Speech Prosody (pp. 84–88) (2016). : . (cite:levitan2016automatic)
- Friedman, J., Hastie, T., & Tibshirani, R., The elements of statistical learning, In (Eds.), (pp. 51–61) (2001). : Springer series in statistics New York. (cite:friedman2001elements)
- Graf, S., Herbig, T., Buck, M., & Schmidt, G., Features for voice activity detection: a comparative analysis, EURASIP Journal on Advances in Signal Processing, 2015(1), 91 (2015). (cite:graf2015features)
- Welling, M., & Teh, Y. W., Bayesian learning via stochastic gradient langevin dynamics, In , Proceedings of the 28th international conference on machine learning (ICML-11) (pp. 681–688) (2011). : . (cite:welling2011bayesian)
- Goodman, J., A bit of progress in language modeling, arXiv preprint arXiv:cs/0108005, (), (2001). (cite:goodman2001progress)
- Cotterell, R., Mielke, S. J., Eisner, J., & Roark, B., Are all languages equally hard to language-model?, In , Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (pp. 536–541) (2018). New Orleans, Louisiana: Association for Computational Linguistics. (cite:cotterell-etal-2018-languages)
- Reynolds, D. A., Quatieri, T. F., & Dunn, R. B., Speaker verification using adapted gaussian mixture models, Digital signal processing, 10(1-3), 19–41 (2000). (cite:reynolds2000speaker)
- Jasper Snoek, H. L., & Adams, R. P., Practical bayesian optimization of machine learning algorithms, arXiv preprint arXiv:1206.2944, (), (2012). (cite:snoek2012practical)
- Breck, E., Zinkevich, M., Polyzotis, N., Whang, S., & Roy, S., Data validation for machine learning, In , Proceedings of SysML (pp. ) (2019). : . (cite:breck2019data)
- Carbonell, J. G., Learning by analogy: formulating and generalizing plans from past experience, In (Eds.), Machine learning (pp. 137–161) (1983). : Springer. (cite:carbonell1983learning)
- Liu, B., Wang, L., Liu, M., & Xu, C., Lifelong federated reinforcement learning: a learning architecture for navigation in cloud robotic systems, , abs/1901.06455(), (2019). (cite:Liu2019LifelongFR)
- Mohri, M., Pereira, F., & Riley, M., Weighted finite-state transducers in speech recognition, Computer Speech & Language, 16(1), 69–88 (2002). (cite:MOHRI200269)
- Ueffing, N., Bisani, M., & Vozila, P., Improved models for automatic punctuation prediction for spoken and written text., In , Interspeech (pp. 3097–3101) (2013). : . (cite:ueffing2013improved)
- Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., & Yu, S. X., Large-scale long-tailed recognition in an open world, arXiv preprint arXiv:1904.05160, (), (2019). (cite:liu2019large)
- Iyer, A., Jonnalagedda, M., Parthasarathy, S., Radhakrishna, A., & Rajamani, S. K., Synthesis and machine learning for heterogeneous extraction, In , Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (pp. 301–315) (2019). : . (cite:iyer2019synthesis)
- Dehak, N., Kenny, P. J., Dehak, R'eda, Dumouchel, P., & Ouellet, P., Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798 (2010). (cite:dehak2010front)
- Dehak, N., Dehak, R., Kenny, P., Br"ummer, Niko, Ouellet, P., & Dumouchel, P., Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification, In , Tenth Annual conference of the international speech communication association (pp. ) (2009). : . (cite:dehak2009support)
- Sutton, C., & McCallum, A., An introduction to conditional random fields for relational learning, In (Eds.), Introduction to Statistical Relational Learning (pp. ) (2006). : . (cite:sutton06introduction)
- Mendis, C., Droppo, J., Maleki, S., Musuvathi, M., Mytkowicz, T., & Zweig, G., Parallelizing wfst speech decoders, In , 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5325–5329) (2016). : . (cite:mendis2016parallelizing)
- Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen, Z., & others, , A tutorial on thompson sampling, Foundations and Trends{\textregistered} in Machine Learning, 11(1), 1–96 (2018). (cite:russo2018tutorial)
- Gravano, A., Jansche, M., & Bacchiani, M., Restoring punctuation and capitalization in transcribed speech, In , 2009 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4741–4744) (2009). : . (cite:gravano2009restoring)
- Mintz, M., Bills, S., Snow, R., & Jurafsky, D., Distant supervision for relation extraction without labeled data, In , Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2 (pp. 1003–1011) (2009). : . (cite:mintz2009distant)
- Beygelzimer, A., Daum'e, Hal, Langford, J., & Mineiro, P., Learning reductions that really work, Proceedings of the IEEE, 104(1), 136–147 (2016). (cite:beygelzimer2016learning)
- Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., …, Hidden technical debt in machine learning systems, In , Advances in neural information processing systems (pp. 2503–2511) (2015). : . (cite:sculley2015hidden)
- Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., …, Google’s neural machine translation system: bridging the gap between human and machine translation, arXiv preprint arXiv:1609.08144, (), (2016). (cite:wu2016google)
- Ghahramani, Z., Unsupervised learning, In , Summer School on Machine Learning (pp. 72–112) (2003). : . (cite:ghahramani2003unsupervised)
- Hundman, K., Constantinou, V., Laporte, C., Colwell, I., & Soderstrom, T., Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding, In , Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining (pp. 387–395) (2018). : . (cite:hundman2018detecting)