PaperHub
4.8
/10
withdrawn4 位审稿人
最低3最高8标准差2.0
5
8
3
3
3.8
置信度
ICLR 2024

Characterizing Long-Tail Categories on Graphs via A Theory-Driven Framework

OpenReviewPDF
提交: 2023-09-23更新: 2024-03-26

摘要

关键词
long-tail learning

评审与讨论

审稿意见
5

The paper confronts a significant challenge in long-tailed classification on graphs. While most prior research concentrates on mitigating bias, this paper offers a fresh perspective by introducing a theoretical framework for characterizing long-tail categories and improving generalization in real-world scenarios. The authors present the TAIL2LEARN framework, encompassing hierarchical task grouping and long-tailed balanced contrastive learning. Notably, the experimental results demonstrate promising performance, outperforming state-of-the-art methods.

优点

  • The proposed approach is novel and addresses a significant gap in the existing literature by providing a theoretical foundation for long-tail classification on graphs. The motivation for this work is well-defined and highlights the need for a deeper understanding of class imbalances and generalization performance.
  • A notable strength of the paper is its comprehensive theoretical analysis, which includes the development of a Generalization Error Bound that substantiates the effectiveness of the proposed method.
  • The experimental results effectively illustrate the superiority of the proposed TAIL2LEARN framework. By showcasing its effectiveness in characterizing long-tail categories on real-world graph datasets, the authors provide practical evidence of their method's capabilities.

缺点

  • One potential weakness of the paper is that the hierarchical task grouping approach employed by the authors seems similar with existing techniques like Graph U-Net [1]. Although the authors have extended these prior methods to facilitate multi-task learning and task grouping with theoretical backing, it may require clarification about what sets the TAIL2LEARN framework apart from the existing Graph U-Net. Further clarification and a more detailed comparison between the two would be beneficial to better understand the novelty and differentiation of the proposed framework.
  • While the authors have approached long-tailed classification as a multi-task learning problem, they have configured the number of tasks in the second layer to align with the number of categories. It might be worth considering whether the authors have explored the possibility of subdividing the samples into more finely-grained subclasses, which means increasing the number of tasks in the second layer beyond the number of categories.
  • The authors claimed that $\mathcal{L}_{BCL} potentially controls the range of losses for different tasks. However, the paper lacks experimental results to support this claim, which could contribute to a more robust evaluation of the method's effectiveness.

[1] Gao H, Ji S. Graph u-nets[C]//international conference on machine learning. PMLR, 2019: 2083-2092.

问题

See above.

评论

Q1: It may require clarification about what sets the TAIL2LEARN framework apart from the existing Graph U-Net.

A1: We are the first to consider the long-tail problem in the task space instead of the instance space. The proposed theoretical analysis and framework consider the complexity of the task. In particular, our model is different from Graph U-Net in the following aspects:

  • one of our key ideas is to group similar tasks as a hypertask, while Graph U-Net samples a subset of important nodes ;

  • Graph U-Net keeps the indexes of selected nodes for unpooling operation, while our method preserves the tasks hierarchically information to capture the complex relationships among tasks.

 

Q2: It might be worth considering whether the authors have explored the possibility of subdividing the samples into more finely-grained subclasses.

A2: Thank you for your insightful comment!

  • Yes, the samples can be finely-grained grouped by increasing the number of tasks. There have been some work delved into overclustering, i.e., the number of clusters is larger than the number of categories. The work [1,2] posits that overclustering can be beneficial for the model to learn expressive features.

  • We showed the impact of varying the number of tasks on model performance in Appendix F (last subplot of Figure 6). We have also conducted additional experiments to explore the impact. The experimental results show that our model can achieve great model performance when the hyperparameters are in a certain reasonable range. But there is a slight performance degradation when the number of hypertasks is small.

Cora_Full
bAccMacro-F1G-MeansAcc
[198,70]55.556.774.264.6
[70,35]55.857.174.564.7
[2,1]54.956.873.965.5

[1] Ji, Xu, Andrea Vedaldi, and João F. Henriques. "Invariant information clustering for unsupervised image classification and segmentation. ICCV, 2019.

[2] Kim, Yunji, and Jung-Woo Ha. "Contrastive fine-grained class clustering via generative adversarial networks." ICLR, 2022.

 

Q3: The authors claimed that $\mathcal{L}_{BCL} potentially controls the range of losses for different tasks. However, the paper lacks experimental results to support this claim.

A3: The loss range is defined as maxt1nti=1ntl(ft(h(xit)),yit)mint1nti=1ntl(ft(h(xit)),yit)\max_{t}\frac{1}{n_t}\sum_{i=1}^{n_t}l(f_t(h(\mathbf{x}^t_{i})), y^t_{i}) - \min_{t}\frac{1}{n_t}\sum_{i=1}^{n_t}l(f_t(h(\mathbf{x}^t_{i})), y^t_{i}). Intuitively, we assume that the model with L_BCL\mathcal{L}\_{BCL} can maintain good performance on head tasks, so the second term of the range almost remains constant; however, the balanced contrastive learning improves performance on the tail task, so the first term of the range decreases and therefore can help reduce the loss range. Moreover, in the Table below we empirically show that using L_BCL\mathcal{L}\_{BCL} improves the model performance from 54.6 to 55.8 (bAcc) on the Cora_Full dataset, which also demonstrates the efficacy of our balanced contrastive learning.

M1LBCL\mathcal{L}_{BCL}LSCL\mathcal{L}_{SCL}LCE\mathcal{L}_{CE}bAcc.Macro-F1G-MeansAcc.
55.8±0.5**55.8**\pm0.557.1±0.5**57.1**\pm0.574.5±0.3**74.5**\pm0.364.7±0.7**64.7**\pm0.7
54.6±0.454.6\pm0.456.2±0.456.2\pm0.473.7±0.373.7\pm0.364.3±0.464.3\pm0.4
52.8±0.652.8\pm0.654.5±0.754.5\pm0.772.5±0.472.5\pm0.462.7±0.562.7\pm0.5
评论

The author has partially addressed my concerns and I will keep my score unchanged.

审稿意见
8

While current methods focusing on the long-tail problem in graphs have shown notable improvements, this work, Tail2Learn, approaches from a different perspective and formulates the long-tail classification problem into a multi-task learning framework. Built upon theoretical findings, it controls the complexity of task space and the loss range of task-specific classifiers by offering remedies such as hierarchical task grouping and long-tailed balanced contrastive learning. The experiments on the node classification task show the efficacy of Tail2Learn in real-world long-tailed graph datasets.

优点

  1. I quite enjoyed reading this paper. Overall, the claims of this paper are well-formulated, and its remedies for the theoretical findings are well-supported.

  2. The proposed Definition 1, Long-Tailedness Ratio, is intuitive and straightforward. This metric can be generalized to balanced cases, such as 5 classes having 20 training samples each, as it would have a value of 4 in the 80th percentile. This contribution would further enrich the long-tail GNN community.

  3. The empirical performance aligns with the theoretical motivation. Also, the paper is well-written and easy to follow.

缺点

Major

  1. In M1. Hierarchical Task Grouping, I agree that this approach can reduce label scarcity and task complexity. However, I am concerned whether hierarchical grouping across different classes might compromise the distinctiveness of each class. That is, there could be a trade-off between achieving reduced complexity and maintaining distinctiveness among classes. Although there exists a module for contrastive loss between different classes, its contribution remains unclear. A more detailed discussion of such situations should be provided.

  2. In M2. Long-Tail Balanced Contrastive Learning, the utilization of supervised contrastive loss seems reasonable. However, given the long-tail situation, there would be very few training samples with labels for tail classes. Consequently, the positive pairs within tail classes would be significantly fewer compared to the head classes. Can you elucidate how Tail2Learn can work effectively in this scenario?

  3. Although the overall performance of Tail2Learn is effective in current datasets, can you provide more details about the improvements made in tail classes as shown in Figure 4 in LTE4G [1]? This would offer a more comprehensive understanding of Tail2Learn's efficacy in terms of improvement in tail classes without sacrificing performance in head classes.

  4. Can Tail2Learn generalize well on graph datasets having a relatively small number of classes such as Cora, CiteSeer, and PubMed?

Minor

  1. Although Definition 1, Long-Tailedness Ratio, is well-designed, for clarity, at first glance, I expected the semantic meaning to refer to "how severe the data distribution is long-tailed". However, in actuality, the semantic meaning is "the lower the severity of long-tailedness." Have you considered the reciprocal version of the current long-tailedness ratio?

  2. The notation (e.g., subscripts) in Equation 6 and Equation 7 appears to be exactly the same, while the underlying meaning is different. For clarity, I suggest differentiating the notations that denote specific classes and specific hypertasks, as they do not necessarily have to be the same value.

  3. The performance of ImGAGN [2] in Table 1 seems unusually low compared to classical GNN, although it is originally designed to alleviate class long-tailedness. Can you provide further explanations for this?

If the above concerns are properly addressed, I would be very happy to raise my score on the current rating.

[1] [CIKM 2022] LTE4G: Long-Tail Experts for Graph Neural Networks
[2] [KDD 2021] ImGAGN:Imbalanced Network Embedding via Generative Adversarial Graph Networks

问题

See the Weaknesses.

评论

Q4: Can Tail2Learn generalize well on graph datasets having a relatively small number of classes such as Cora, CiteSeer, and PubMed?

A4:

  • The datasets Cora, Citeseer, and PubMed are widely used graph datasets. However, they exhibit a relatively limited number of categories, a relatively low degree of imbalance, and do not follow a long-tail distribution. Hence these datasets were initially excluded from our paper.

  • We have conducted our model on the datasets compared with baselines. The experimental results suggest the great performance of our model on these datasets. We will provide the complete results in the appendix of the later version.

MethodPubMedCora
bAccMacro-F1G-MeansAccbAccMacro-F1G-MeansAcc
Origin84.5+0.284.6+0.288.3+0.285.1+0.278.5+1.080.0+0.787.1+0.681.6+0.6
Over-sampling83.3+0.483.7+0.487.4+0.384.4+0.477.2+1.079.2+0.786.4+0.681.0+0.7
Re-weight84.5+0.284.6+0.288.3+0.285.1+0.279.0+0.780.1+0.887.4+0.581.7+0.7
SMOTE83.5+0.583.9+0.487.6+0.484.5+0.476.7+1.178.8+0.886.0+0.780.7+0.7
Embed-SMOTE84.5+0.384.6+0.288.3+0.285.1+0.278.9+0.980.0+0.787.3+0.581.6+0.6
GraphSMOTE_T84.3+0.384.4+0.288.2+0.285.1+0.280.8+1.081.5+1.188.5+0.683.0+0.9
GraphSMOTE_O84.3+0.284.3+0.388.1+0.285.0+0.380.5+1.181.3+1.388.3+0.782.8+1.1
LTE4G85.5+0.284.9+0.289.0+0.185.4+0.281.4+0.881.3+0.988.9+0.583.0+0.9
Ours85.8+0.386.0+0.289.6+0.286.6+0.281.4+0.682.3+0.688.9+0.483.6+0.6

 

Q5: Have you considered the reciprocal version of the current long-tailedness ratio?

A5: Thanks for your suggestion! We will provide detailed discussions and data statistics by using the reciprocal of long-tailedness ratio in our later version.

 

Q6: The notation (e.g., subscripts) in Equation 6 and Equation 7 appears to be exactly the same, while the underlying meaning is different.

A6: Thanks for your suggestion! We try to use \prime to distinguish classes and hypertasks. For example, use jVt\ij \in\mathcal{V} _t\backslash i to represent all nodes belonging to category tt except ii, and use jVt\i{zt}j \in\mathcal{V} _{t^\prime}\backslash i \cup \{z _{t^\prime}\} to define all nodes belonging to hypertask tt^\prime except ii include the prototype ztz _{t^\prime} of hypertask tt^\prime. We seek your perspective on the proposed change. We intend to apply this change in the later version if you find it is better for understanding.

 

Q7: The performance of ImGAGN [2] in Table 1 seems unusually low compared to classical GNN.

A7: In the default setting of ImGAGN, it considers only one class as the tail category and generates synthetic minority nodes. Therefore, the performance of the method is relatively weak. We adjusted the number of categories considered as tail categories in ImGAGN, and the experimental results did show an improvement in performance.

Cora_Full
bAccMacro-F1G-MeansAcc
ImGAGN (1 tail)9.3+1.16.6+1.030.2+1.920.9+2.1
ImGAGN (7 tail)15.8+1.612.4+1.639.5+2.132.2+2.7
评论

I appreciate the author's efforts during the rebuttal process.
All of my concerns have been alleviated, and as a result, I will raiese my score to 8.

评论

Q1: There could be a trade-off between achieving reduced complexity and maintaining distinctiveness among classes.

A1: Thank you for your insightful comment!

  • Yes, there has been some work delved into overclustering, i.e., the number of clusters is larger than the number of categories. The work [1,2] posits that overclustering can be beneficial for the model to learn expressive features.

  • We showed the impact of varying the number of hypertasks on model performance in Appendix F (last subplot of Figure 6). We have also conducted additional experiments to explore the cases when the number of overtasks >, =, < the number of categories. The experimental results show our model (also using cross-entropy loss and contrastive loss) can ensure sufficient distinctiveness and achieve great model performance. But there is a slight performance degradation when the number of hypertasks is small (i.e., # of hypertasks is way smaller than the # of classes), potentially attributed to issues of indistinctiveness.

Cora_Full
bAccMacro-F1G-MeansAcc
[198,70]55.556.774.264.6
[70,35]55.857.174.564.7
[2,1]54.956.873.965.5

[1] Ji, Xu, Andrea Vedaldi, and João F. Henriques. "Invariant information clustering for unsupervised image classification and segmentation. ICCV, 2019.

[2] Kim, Yunji, and Jung-Woo Ha. "Contrastive fine-grained class clustering via generative adversarial networks." ICLR, 2022.

 

Q2: Given the long-tail situation, there would be very few training samples with labels for tail classes. Can you elucidate how Tail2Learn can work effectively in this scenario?

A2: Yes, we use balanced contrast loss to mitigate this problem, which is shown as follows (Section 3.2 M2).

LSCL(zi)=1nt1×j Vt\ilogexp(zizj/τ)1qT1nqkVqexp(zizk/τ)\mathcal{L} _{SCL}(\mathbf{z} _i) = -\frac{1}{n _{t}-1} \times \sum _{j \ \in\mathcal{V} _t \backslash i} \log \frac{\exp \left(\mathbf{z} _{i} \cdot \mathbf{z} _{j}/\tau\right)}{\sum _{1\leq q\leq T} \frac{1}{n _{q}} \sum _{k\in\mathcal{V} _q} \exp \left(\mathbf{z} _{i} \cdot \mathbf{z} _{k}/\tau\right)}

L(zi)=1nt1×j Vt\ilogexp(zizj/τ)kVexp(zizk/τ)\mathcal{L}(\mathbf{z} _i)=-\frac{1}{n _{t}-1} \times \sum _{j \ \in\mathcal{V} _t\backslash i} \log \frac{\exp \left(\mathbf{z} _{i} \cdot \mathbf{z} _{j}/\tau\right)}{\sum _{k\in\mathcal{V}} \exp \left(\mathbf{z} _{i} \cdot \mathbf{z} _{k}/\tau\right)}

To be noted that,

  • Since there are significant differences in the number of samples between the head and tail categories, the head categories have a great contribution to the denominator in classic contrastive loss. To address this issue, our loss function incorporates a category-based normalization in the denominator, thereby mitigating the dominance of the head categories.

  • In addition, we normalize by the number of positive pairs (nt1n_t-1) to compute an average instead of a sum to alleviate the label imbalance issue.

  • When applying to hypertasks, the node set Vt\mathcal{V}_t includes both task nodes belonging to hypertask tt and prototypes of hypertask tt, which increases the number of positive pairs, especially helping tail categories.

 

Q3: Can you provide more details about the improvements made in tail classes as shown in Figure 4 in LTE4G [1]?

A3: Thanks for your valuable suggestion! In Figure 1, we plot the model performance on each category, showing that our model outperforms the original GCN method. In addition, we have added the following figure to provide more details (https://anonymous.4open.science/r/Tail2Learn-CE08/figs/resultPerClass.pdf).

审稿意见
3

The paper studies long-tail categories in graphs. It proposes a generalization bound for long-tail classification on graphs, as well as a method TAIL2LEARN for long-tailed classification on graphs. The method includes a hierarchical task grouping module to reduce complexity of task space and a contrastive learning module to balance the gradients of head and tail classes. Experiments are conducted to evaluate the method.

优点

  1. The paper provides theoretical studies and arrives at a generalization bound.

  2. The paper presentation includes rich contents, with tables and figures well organized.

  3. The conducted experiments look correct with ablation studies included and code provided.

缺点

  1. Related works not well addressed. The long-tail categories studied in the paper is the same as the node-level imbalanced-class problem in graph. The imbalanced class problem has been studied intensively for graphs, which is closely related to this work but not sufficiently discussed in its related works. The paper lacks a thorough review of related literature. Some missing related works are [1-6].

  2. Following the above point, the experiments should include some of the missing imbalanced class baselines.

  3. The correctness of Corollary 1 is unclear. Why can contrastive learning guarantee to learn the predictors f1(l),...,fT(l)f_1^{(l)}, . . . , f_T^{(l)} with Range(f1(l),...,fT(l))<Range(f1,...,fT)Range(f_1^{(l)}, . . . , f_T^{(l)}) < Range(f_1, . . . , f_T)? In its proof, why do we only need to compare the relationship between t1/(nt(l))\sum _t 1/(n_t^{(l)}) and t1/(nt)\sum _t 1/(n_t)? And how is the special case of all nodes in one hypertask generalized to prove t1/(nt(l))t1/(nt)\sum _t 1/(n_t^{(l)})\leq \sum _t 1/(n_t)? The proof should be clearly given step-by-step instead of ambiguously stated.

[1] Imgcl: Revisiting graph contrastive learning on imbalanced node classification

[2] Boosting-GNN: boosting algorithm for graph networks on imbalanced node classification

[3] Graph neural network with curriculum learning for imbalanced node classification

[4] Co-Modality Graph Contrastive Learning for Imbalanced Node Classification

[5] Diving into Unified Data-Model Sparsity for Class-Imbalanced Graph Representation Learning

[6] TAM: topology-aware margin loss for class-imbalanced node classification

问题

Please see Weaknesses.

评论

Q: Related works not well addressed. The long-tail categories studied in the paper is the same as the node-level imbalanced-class problem in graph.

A: Thank you for your suggestion!

  • We would like to respectfully emphasize, as we identified in the introduction, that there are three fundamental challenges of long-tail classification problems (C1. Highly-skewed data distribution, C2.Label scarcity, and C3. Task complexity). While there exists some similarity between long-tail and imbalance classification, long-tail learning on the challenges arising from a significantly larger number of tail classes. Consequently, these two problem domains are not the same.

  • We have cited work [1-6] for imbalance in the section of related work. Furthermore, we have introduced the TAM method as a baseline for our experiments. We will provide the complete experimental results in a later version.

MethodEmailWiki
bAccMacro-F1G-MeansAccbAccMacro-F1G-MeansAcc
Origin48.9+4.545.2+4.369.5+3.266.7+2.148.2+1.549.9+1.968.6+1.164.2+0.9
TAM48.9+4.640.3+3.569.5+3.354.6+4.349.3+2.149.1+1.669.3+1.562.9+1.5
Ours50.5+3.046.6+3.070.7+2.165.4+1.752.8+2.054.1+2.371.9+1.467.2+1.1
MethodAmazon-ClothingAmazon-Electronics
bAccMacro-F1G-MeansAccbAccMacro-F1G-MeansAcc
Origin9.9+0.29.5+0.231.3+0.39.9+0.216.9+0.215.2+0.241.0+0.316.9+0.2
TAM10.9+2.59.8+2.432.6+3.910.9+2.518.5+3.415.8+3.542.7+4.018.5+3.4
Ours17.1+0.516.8+0.641.1+0.617.1+0.523.6+0.921.0+1.348.5+1.023.6+0.9
审稿意见
3

This paper investigate long-tailed classification on graphs by providing a PAC generalization bound in a multi-task learning fashion, which is characterized by the task number and overall loss range. As a solution, the authors propose Tail2Learn, a learning framework for long-tailed node classification, which reduces task complexity by hierarchically grouping tasks and adopting a contrastive loss to adaptively balance the gradients of both head and tail classes to control the loss range.

优点

  • The method presented in this paper is straightforward and easily comprehensible.

  • The approach of addressing the long-tail problem through a multi-task learning perspective appears to be original.

  • The empirical results in the experiment section indicate a promising improvement compared to the baseline methods.

缺点

  • The theoretical aspect of the paper appears to be quite preliminary, lacking in-depth analysis and original contributions. It appears to heavily rely on Theorem 8 from a previous work [1]. Additionally, some statements and derivations are unclear and contain errors, making it challenging to verify their correctness. For specific concerns, please refer to the detailed questions.

  • While the shift in perspective towards multi-task learning is novel, the proposed method is essentially a combination of existing, well-known techniques, such as hierarchy graph pooling [2] and contrastive loss [3].

  • The discussion and comparison of some relevant work (such as [4][5]) are missing in this paper.

[1] Maurer et al., The Benefit of Multitask Representation Learning, 2016

[2] Ying et al., Hierarchical graph representation learning with differentiable pooling, 2018

[3] Zheng et al., Tackling Oversmoothing of GNNs with Contrastive Learning, 2021

[4] Zhang et al., "Graph-less Neural Networks: Teaching Old MLPs New Tricks Via Distillation, 2022

[5] Zheng et al., Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhood, 2022

问题

  1. In Lemma 1, after applying the PCA bound provided by [1], how is the normalization term 1/T1/T eliminated? If 1/T1/T needs to remain, I would question one of the main claims in the abstract: "generalization performance of long-tailed classification is dominated by the total number of tasks'' (as stated in the abstract). This is because, after accounting for 1/T1/T, such a bound may no longer scale with the number of tasks. Actually, Theorem 2 in [1] even suggests the generalization error decays with O(1/T)O(1/\sqrt{T}) when transferred to new task.

  2. In Lemma 3, why is the definition of R(F)R(F) (Eq. 11) different from Eq. 4 in [1]? Note that Eq. 4 has f(y)f(y)f(y) - f(y') in the numerator, while Eq. 11 has l(f(y)f(y))l(f(y) - f(y')) in the numerator.

  3. In the proof of Corollary 1, it remains unclear to me how the inequality between the last terms (under the square root) is established.

  4. Why is this analysis focused specifically on long-tail classification on graphs? Can it be extended to the general long-tail learning problem?

  5. Can the authors explain why the proposed approach, Tail2Learn, takes on the form of $f \circ h," resembling a general multi-task learning framework?

  6. It is known that task complexity is not always harmful. Instead, improving task diversity can be helpful for multi-task learning [2][3]. Does this contradict to this paper’s claim?

[1] Maurer et al., The Benefit of Multitask Representation Learning, 2016

[2] Tripuraneni et al., On the Theory of Transfer Learning: The Importance of Task Diversity, 2020

[3] Du et al., Few-Shot Learning via Learning the Representation, Provably, 2020

评论

Thank you for your careful review; it is very helpful for enhancing the quality of our paper. Yes, Theorem 1 should retain the normalization term 1/T. We are still revising our paper to accommodate this change.

In addition, we think that 'improving task diversity can be helpful for multi-task learning' does not contradict our paper. While our paper incorporates the paradigm of multi-task learning, it is crucial to highlight the fundamental differences between long-tail learning and multi-task learning: long-tail learning not only involves a large number of classes but also exhibits substantial class-membership imbalance.

  • In the context of multi-task learning discussed in previous work [1], each task corresponds to nn observed input samples, where enhancing task diversity proves beneficial for performance. Increasing the number of tasks TT decreases the cost associated with estimating the representation hh, while increasing the number of samples nn decreases the cost of estimating task-specific predictors.

  • On the other hand, in long-tail learning, where we define task complexity in situations where a vast majority of samples belong to a few head classes, and a minority of samples are allocated to tail classes, increasing the number of classes (particularly tail classes) may increase the difficulty of learning a salient representation and depicting the support regions of each classes. In long-tail learning, task complexity is relevant to n1,,nT,Tn_1,\ldots , n_T, T, thus task complexity does not reduce the generalization error.

  • Our experimental results show great performance in utilizing hierarchical grouping (Table 1), especially in the harsh long-tail setting (Table 2).

Further, we argue that 'in the long-tail setting, the increase in task complexity, especially considering tail categories have very few samples, raises the difficulty in correct classification'. This is because the shared representation hh can be considered as a consensus of different views. The cost of maximizing consistency and learning a uniform representation from multiple subspaces will increase when categories (especially tail categories) increase. And when the number of categories increases, the loss range remains constant or increases.

[1] Maurer et al., The Benefit of Multitask Representation Learning, 2016.